Will Google succeed?
I'm not the only one who notes the irony of Google calling for good metadata and saying it is almost impossible to get researchers to do structured metadata.
I believe researchers could easily be persuaded to do this, particularly if libraries make it easier to enter very simple metadata (UX!) that is converted to Schema.org. Plenty of faculty have deposited their own articles onto ResearchGate, with very little incentive and pretty much no brand name at the time.
Google may or may not get bored and give up, but I do expect that even if it becomes a standard to provide metadata via Schema.org, the quality will be minimal. You would get title, author, license, maybe funder but nothing much else.
Unlike articles (where poor metadata can be made up somewhat with indexing of full-text), with datasets Google must rely almost wholly on metadata; poor, limited metadata is likely what they are going to get. This leads to a situation where users will search a large pool of items with limited metadata. If this is the case, then relevancy is going to struggle. This makes me wonder whether subject specific data search is going to predominant. This echoes the web scale discovery vs subject databases debate.
To be fair, Google is aware of the difficulties
Should we hope for Google dataset to succeed? Worries about open infrastructure
Google Scholar is beloved by many, including me. It is a free yet excellent tool. Assuming Google dataset search grows to the same level, should we cheer it on to success?
The #DontLeaveItToGoogle hashtag discusses the importance of not letting Google (a commercial company) dominate and to push for Open source infrastructure.
For those who don't know -- Google provides open APIs for some of its products, but not for Google Scholar and not for Google Dataset search. Furthermore, at least for Google Scholar, Google has said it never will provide open APIs.
That said, one suspicion for the lack of APIs in Google Scholar is because of a condition set by journal publishers in return for allowing the Google crawler to crawl their full text. With datasets this might not be a barrier, as the Nature article notes APIs might be added in the future, but you never know.
I personally think Google is playing as fair as it could under the circumstances. It is using Schema.org, an open standard, and nothing prevents others from crawling the web and using the standard to index datasets.
Realistically speaking of course Google has more experience in crawling and indexing the web, and its clout means what it "recommends" will be taken seriously and this move has clearly given legitimacy to the push towards data discovery.
Still leaving Google Dataset search as the one and only comprehensive dataset search is dangerous. Google is a commercial company and could at any time shutter the service. There has been recent talk about the need for open infrastructure on top of open data and open access and clearly the same holds for dataset search tools.
Conclusion
All in all, I believe this move by Google will be a big step in making the importance of data discovery front and centre in the eyes of many. In many ways, this new discovery challenge might mirror developments for article discovery unless we are careful. Will we be wise enough to learn from history?
___________________________________________
Aaron Tay is Librarian at Singapore Management University. You can follow him on Twitter.
His blog Musings about Librarianship covers a wide range of topics including OA, library analytics, tech tools and marketing.
This is an edited version of an article originally featured on Aaron’s blog. The full article features screenshots of search options, Twitter debates and other key resources.
__________________________________________