[This post is the second in a series that serialises the One Repo whitepaper in digestible chunks. Do please weigh in with comments! See also Part 1: the problem]
We offer The One Repo (http://onerepo.net) as a solution to these challenges. This is a system, already existing in proof-of-concept form, to gather all the content of all the world’s repositories into a single database, in a uniform format, freely accessible to all as a Web UI, as embeddable widgets, as a set of web services, and as harvestable data.
The One Repo is not a research project, but is built on battle-tested components that are in use in high-volume commercial systems. It has been proven robust, efficient and scalable.
The policy of The One Repo is to accept all objects deposited in the included repositories, including:
- Actual manuscripts, with full text available.
- Metadata records describing manuscripts that are not available. These are important for at least three reasons. First, in some cases, they describe manuscripts that will become freely available after the expiry of an embargo period; second, such metadata records provide a means of discovering the author and requesting a copy directly – a process that may be facilitated by an “ask author for a copy” button; and third, records of manuscripts that should be available (but are not) are important data for tracking compliance of open-access policies.
- Associated data-sets, such as specimen photos, matrices for phylogenetic analysis, databases of observations and survey results.
Data objects deposited with third-party services such as GenBank, FigShare or Morphbank are out of scope.
Methods of harvesting and searching
The One Repo works by a seamless integration of searching remote systems and locally harvested data. While harvested data is quicker to access and enables more efficient and accurate facets and sorting, it is also more expensive to set up and an initial harvest can take some time to complete, so direct searching provides a useful alternative in difficult cases. Different approaches are appropriate for different databases.
Harvesting works by any of these methods:
- Metadata transfer using the OAI-PMH protocol
- Bulk download of records in any XML format
- Bulk download of records in any MARC-based format
- Any XML-based harvesting API
- Any web-based UI can be crawled when no better solution is available
Similarly, real-time searching works by means of several methods:
- The ANSI/NISO Z39.50 protocol
- The SRU family of web-service searching protocols
- The Solr protocol
- Any XML-based searching API
- Any web-based UI can be screen-scraped
All databases are treated essentially equally within the One Repo, and a uniform web service API is provided by which any of them can be searched.