You are here

Google and Harvard's Open-Access Repository

How well does Google index Harvard’s open-access repository? Three interns with the Office for Scholarly Communication (OSC) tackled this question over three semesters.

 
OSC logo

How well does Google index Harvard’s open-access repository? Three interns with the Office for Scholarly Communication (OSC) tackled this question over three semesters.

More than half the visitors to DASH, Harvard’s open-access repository, are referred by Google or Google Scholar. We in the OSC knew that Google and Google Scholar indexed DASH, and had their own reasons to do it thoroughly. But we also knew that configuration mistakes on our side could deter or derail the Google crawlers.

OSC director Peter Suber wanted to know just how well Google and Google Scholar were indexing the Harvard scholarship in DASH. The idea gained momentum when two OSC developers attended a conference presentation by Anurag Acharya, the founder of Google Scholar, on ways in which repositories like DASH could cooperate with the Google search engines.

From spring of 2015 to the spring of 2016, Rebecca Lewis (UMass-Boston), Alexis Dhembe (Simmons College), and Mark Jemerson (Simmons College) systematically searched for DASH works in Google and Google Scholar. They picked samples from several different categories of DASH records: peer-reviewed articles, working papers, dissertations, conference presentations, old deposits, new deposits, long deposits, short deposits, PDFs, and non-PDFs. They searched for these works by title, and by unique phrases from within the texts. They searched in plain Google and in Google Scholar. Altogether they tested the Google-discoverability of nearly 1,000 works in DASH.

The results are reassuring. If the samples are representative, then 99.5% of DASH works are indexed by either Google or Google Scholar, and 93.2% are indexed by both. When the study turned up works not appearing in either Google or Google Scholar, we identified the problems and fixed them.

“I’m very happy with the study and for two reasons,” said Suber. “First, it answers a question that we couldn’t answer without data. We wanted to know whether DASH was friendly to Google crawling. Now we know that it is, and don’t have to assume it, hope it, or wonder about it. Second, there’s the fact itself, apart from our new degree of confidence in it. Google indexes DASH comprehensively. That’s important to us. The point of opening up Harvard’s research is to make it easier to find, retrieve, read, and apply. We spend a good deal of our time making DASH compatible with search engines and discovery tools. One result is that DASH has had almost eight million downloads in seven years. Another, now confirmed, is that researchers around the world can find Harvard research in DASH even if they don’t know that DASH exists, don’t know where it’s located, don’t know what it contains, and don’t visit to run local searches.”

Article written by the Office for Scholarly Communication.

Article published on July 20, 2016. 

Share