[This post is the first in a series that serialises the One Repo whitepaper in digestible chunks. Do please weigh in with comments! See also Part 2: the solution]
It was more than twenty years ago that Stevan Harnad published his “subversive proposal” that scholars should make the manuscripts of their publications freely available on the Internet. In the initial version of this proposal, the mechanism was FTP sites, but these were quickly replaced by institutional repositories (IRs), collections of manuscripts generated by all of a university’s authors. Many of these IRs are implemented using well established software packages such as EPrints and DSpace.
The OpenDOAR directory of repositories contains metadata records describing not only IRs but also governmental and disciplinary repositories. At present, OpenDOAR lists some 2860 repositories. Assuming that perhaps another third to half as many IRs again exist but are not registered, the total number in the world is probably close to 4000.
Although these repositories in aggregate make an enormous amount of research freely available, the fragmentation of this knowledge across 4000 repositories makes much of it effectively undiscoverable, and therefore useless. In practice, IRs form an archipelago of isolated islands rather than a continent of discoverable knowledge.
Google does not solve the problem
The de facto discovery tool for most purposes is the Google search engine; and, for scholarly research, its cousin Google Scholar. These are helpful, but far from complete solutions. While Google Scholar indexes articles from many different sources, including commercial databases and some repositories, it does not solve the repository problem for several reasons:
- It is not focussed on repositories, and has no mandate to focus on them.
- Its coverage is patchy and haphazard.
- There is no clear statement of what sources are and are not covered.
- There is no accountability to a board or the wider public.
- There is no API, and screen-scraping is prohibited.
This last point is crucial for practical purposes. The only thing that can be done with Google Scholar is to read its results from a screen. They cannot be automatically queried, aggregated, analysed, harvested, backed up or otherwise used.
Perhaps worst of all, Google has no commitment to Scholar. It provides the service at present, but it could be withdrawn at any time. (Google has a history of doing this, for example recently closing down Google Wave, Google Reader and Google Code.) The scholarly community simply cannot rely on an opaque, closed, unaccountable and long-term unreliable service.
National repositories do not solve the problem
Some valuable initiatives already exist to gather repository content within individual countries: for example, JISC’s CORE (COnnecting REpositories) in the UK, and HAL in France. As significant as these are, however, they reduce the number of places to be searched from 4000 repositories to (potentially) 200 countries. What is needed is a single point of access for the whole world.
National solutions have on occasion been mothballed at the end of experimental periods: for example, the ARROW project of Australia closed in December 2008, and the DARE project in the Netherlands ended in 2006 (although Narcis fulfils some of the same role). Such demises arguably indicate that national solutions are not broad enough, and therefore do not offer enough value, to sustain themselves for the long term.
[Read on to part 2: the solution]