You may have heard that Google announced on September 5 a new search for datasets. I know some of you did because a bunch of people let me know. Y’all are awesome! I found this new search compelling enough that I didn’t want to just mention it in ResearchBuzz; I wanted to take a little time and play with it and write an article. So here we are.
Google Dataset Search
At the moment you might be wondering what a dataset is. Isn’t every Web page a set of data?
Not really. I would describe a data set as a collection of information that’s organized and delineated so that it can be viewed and used by both people and computer programs. To me, a dataset is to a Web page what an RSS feed is to a Web site — the same kind of information may be in each, but in a dataset, it’ll be better organized and structured so that a computer program or script can take advantage of it. (Lost Boy blog also takes a look at datasets with several different possible definitions, if you want a more in-depth look.)
Google has created a set of guidelines for publishers to make data sets available. As you’ll note when you look at those guidelines, Google could consider as a dataset anything from a plain old CSV file to a collection of files that all feeds into the same collection of data. That’s a lot of different potential kinds of datasets, so remember that when you’re using the search engine.
And as for the search engine itself — it’s quite a bit like Google, which means you can use special syntax with its associated powers for seriously narrowing down your search results. Let’s take a look.
Basic Dataset Searching
Google’s Dataset search (which I’ll just call Google Dataset) is available at https://toolbox.google.com/datasetsearch . It’s in beta. We’ll start with a basic search so you can get an idea of how it works. How about cows. You’ll notice that the search results appear different than Google’s Web search, and indeed what you get from each search result can vary a lot.
In the screenshot above you’ll note that there’s a link to Google Scholar for articles where the dataset is cited, information on licensing from Creative Commons, an excellent description, and a list of the authors. On the other hand you might get something like this, where the description is minimal, there are no links to Google Scholar articles, etc.
What these results both have in common is a link underneath the name of the dataset, but again, what you get when you click on that link will also vary. If you click on a dataset associated with the government of Canada, you might get a landing page and a prompt to download the database:
While if you click on a dataset from Figshare, you might get dropped right into the data itself!
If this Google tool is restricted to just datasets, then of course it’s going to be much smaller than Google’s Web search. That doesn’t mean you should stick to keyword searches. Special syntax search will help you drill down into Google Datasets’ offerings.
Advanced Searching for Datasets
When you do a basic keyword search, you might notice that you’re getting a lot of results from data repository sites like Figshare and the Dryad Digital Repository. If you want to limit your search results to large depositories, you can use the site: syntax and the pipe symbol (|) to search only those depositories.
cows (site:datadryad.org | site:figshare.com | site:datacite.org)
On the other hand, maybe you don’t want large sites. Maybe you want to stick to data sets offered by universities. Add a simple site:edu modifier.
In this case you’ll notice that there are multiple sites mentioned some search result, which looks like a dataset being held at both an institution and a dataset repository. So you won’t escape the repositories entirely, but you will bring educational institution results to the top.
I found similar results when I did a search for cows site:uk, but when I searched for cows site:gov I found most of the datasets on gov sites did not appear to be available in other places. So try cows site:gov as well.
Sometimes a dataset you’re searching for does not have your keyword in its title. It doesn’t appear that Google is restricting the search in datasets to just the name of the dataset. If you search for cow you’ll find, say, “Milk Production, from the USDA National Agricultural Statistics Service,” or “Weighing Weddell seal pups and mothers as a measure of ecosystem conditions.” In that case you can use intitle to restrict the search to just the dataset title. Try this:
Or maybe you want to know a particular thing about cows and other livestock. Like their methane production. Put the keyword in a title syntax and then add the kinds of animals you want to know about.
intitle:methane cows goats livestock
Out of curiosity I tried the filetype: search on Google Datasets — it’s a search that allows you to get results based on filetype (try report filetype:pdf on Google’s Web search for a quick example.) That syntax didn’t work. But the inurl: syntax — the syntax that lets you specify a keyword in the site URL — does work.
It doesn’t always work perfectly, however. Searching for intitle:water infrastructure inurl:csv does find a couple of datasets, but only one goes directly to an embedded CSV file.
The descriptions and information presented with the datasets in Google’s new search varies so much that you’ll probably have to do several iterations of searches to find everything you might be looking for. Add in a little special syntax mojo and these data collections might be a little easier to winkle out.