[This post is the second in a series that serialises the One Repo whitepaper in digestible chunks. Do please weigh in with comments! See also Part 1: the problem]
We offer The One Repo (http://onerepo.net) as a solution to these challenges. This is a system, already existing in proof-of-concept form, to gather all the content of all the world’s repositories into a single database, in a uniform format, freely accessible to all as a Web UI, as embeddable widgets, as a set of web services, and as harvestable data.
The One Repo is not a research project, but is built on battle-tested components that are in use in high-volume commercial systems. It has been proven robust, efficient and scalable.
The policy of The One Repo is to accept all objects deposited in the included repositories, including:
- Actual manuscripts, with full text available.
- Metadata records describing manuscripts that are not available. These are important for at least three reasons. First, in some cases, they describe manuscripts that will become freely available after the expiry of an embargo period; second, such metadata records provide a means of discovering the author and requesting a copy directly – a process that may be facilitated by an “ask author for a copy” button; and third, records of manuscripts that should be available (but are not) are important data for tracking compliance of open-access policies.
- Associated data-sets, such as specimen photos, matrices for phylogenetic analysis, databases of observations and survey results.
Data objects deposited with third-party services such as GenBank, FigShare or Morphbank are out of scope.
Methods of harvesting and searching
The One Repo works by a seamless integration of searching remote systems and locally harvested data. While harvested data is quicker to access and enables more efficient and accurate facets and sorting, it is also more expensive to set up and an initial harvest can take some time to complete, so direct searching provides a useful alternative in difficult cases. Different approaches are appropriate for different databases.
Harvesting works by any of these methods:
- Metadata transfer using the OAI-PMH protocol
- Bulk download of records in any XML format
- Bulk download of records in any MARC-based format
- Any XML-based harvesting API
- Any web-based UI can be crawled when no better solution is available
Similarly, real-time searching works by means of several methods:
- The ANSI/NISO Z39.50 protocol
- The SRU family of web-service searching protocols
- The Solr protocol
- Any XML-based searching API
- Any web-based UI can be screen-scraped
All databases are treated essentially equally within the One Repo, and a uniform web service API is provided by which any of them can be searched.
Pingback: 1. The Problem | The One Repo blog
How do you propose to ensure perpetual operation? Endowment?
You may have a PR problem in the USA, where repo is short for repossession, usually of a car by the bank’s repo man.
At this point, David, we considering all sorts of different options for funding.
Interesting point on the word “repo”.
I like the concept, but what would the core record structure be of your database, dublin core? when I scan the different repositories, what of the key things I notice is that they all have vastly different structure of a “record”.
Calvin, your question is one of the key ones! We’ve convened a small, informal working group to discuss just this. There is plenty of prior art: for example RCUK’s RIOXX Profile includes some of the necessary additional fields. But we may need to merge this and other existing profiles in order to have the expressiveness to capture all relevant fields. We’ll blog about this as we make progress.