The internet has a wealth of information. With Search engines, people can find almost anything, but far from everything. One of which, is data for research, or simply datsets.
Datasets are different from regular web content that users of the web and Google can read. "In many cases, information about these datasets is neither linked nor has it been indexed by search engines, making data discovery tedious or, in some cases, impossible," said Google.
Because datasets exist across thousands of repositories, researchers and scientists looking for datasets to use in their research, can't just use search engines with normal queries to find those information. What's more, there are also organizations that curate and publish their data on a regular basis, but only through dedicated data portals.
Those who want to get access to that data, should be familiar with the process of locating the data via those portals, for example.
And because the internet has lots information, researchers have to also locate the right sources, and get the right data from those sources.
This is not an easy task. Wouldn't it be much easier if they could just use one search engine and just find everything out there, just like when normal people search for something on the web? Well, Google is making that process possible.
Read: How Search Engines Process Your Queries Determines Your Satisfaction
The search engine attempts to understand the web by using structured data and semantics have become fruitful. The key element here is, schema.org.
Using Schema.org's controlled vocabulary that describes entities in the real world and their properties, Google can better understand the context of contents. So for example, when something described in schema.org is used to annotate content on the web, it lets search engines know what that content is, as well as its properties.
According to Google's blog post, the search engine giant started the project by creating guidelines for dataset providers to ensure Google could understand the content of a dataset. For example, Google suggested that providers should include particular information in the dataset’s metadata, such as how the provider collected the data and who can use it.
The first step was to make it easier to discover tabular data in search, which uses this same metadata along with the linked tabular data to provide answers to queries directly in search results. This has been available for a while.
What happened in Google is that, it turned on the support for dataset entities in schema.org, officially.
Data that follows Google's guidelines enables the search engine to index those datasets so that the search engine can show relevant ones in users' search queries.
According to Google Research:
"Furthermore, because the standard is open and used by other companies, we know that many felt that they are doing it because it is 'the right thing to do.' While we reached out to a number of partners to encourage them to provide the markup, we were surprised to find schema.org/dataset on hundreds, if not thousands, of sites."
"So, at launch, we already have millions of datasets, although we estimate it is only a fraction of what is out there. Most just marked up their data without ever letting us know."
And by combining web ranking for web pages where datasets come from, with dataset-specific signals such as quality of metadata, citations, etc., Google can list those pages with datasets, just like regular pages.
But to make this possible, the metadata needs to be open, despite the dataset itself doesn't have to be.
Read: How Google Search Works, And How It Can Show You The Things You Want
Google has launched Dataset Search, a search engine specifically designed towards the collections of data. The company hopes that the platform can help scientists to locate datasets quickly and effortlessly.
This is Google's first attempt to make this possible. The Dataset Search includes datasets focused on the environmental and social sciences, as well as datasets from government websites and various news organizations focused on other topics.
As more organizations and companies are following the metadata guidelines on their web pages, the number and types of datasets included in the search engine should continue to grow, according to Google.
Eventually, with Google in capable of accessing millions of available datasets which were previously hidden from its sight, the search engine should allow anyone to also get that information by just typing a few words into its search box.
Read: Google Is Not A "Truth Engine," But People Think It Is