The Harvard Library is committed to making as much of its metadata as possible available through open access in order to support learning and research, to disseminate knowledge and to foster innovation. Open access to metadata aligns with the university’s established commitment to open access for scholarly communication.
Harvard Library Policy On Open Metadata
The Harvard Library provides open access to library metadata, subject to legal and privacy factors. In particular, the Library makes available its own catalog metadata under appropriate broad use licenses. The Library Board is responsible for interpreting this policy, resolving disputes concerning its interpretation and application, and modifying it as necessary.
Does the policy apply only to catalog metadata?
No, it applies to all metadata that the library holds. For instance, the metadata from the DASH repository is also distributed under an open license.
What legal and privacy factors might restrict access to library metadata?
Some metadata may have been placed under contractual obligations preventing distribution prior to the establishment of this policy. In such cases, of course, the library cannot legally, and will not, distribute the metadata beyond what such agreements allow.
Metadata that involves the usage of library materials by individual patrons will not be distributed without sufficient anonymization or aggregation to provide reasonable protection against the reconstruction of individual patron usage.
What license is used for metadata?
Because each metadata set may have individual legal and privacy characteristics, appropriate licenses are designed on an individual dataset basis. However, the goal is to make these licenses as broad as possible.
This dataset contains over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
The metadata has been created, acquired and modified over decades, and represents a range of cataloging rules and practices. The records have not been altered or quality-checked during the export process and are offered as is. For more information about the dataset, please see the Documentation file, below.
Harvard makes this set of bibliographic records available for public use under its Bibliographic Dataset Use Terms.
We suggest the following language to provide proper attribution when using this dataset:
This [title of report or article or dataset] contains information from the Bibliographic Dataset, which is provided by the Harvard Library under its Bibliographic Dataset Use Terms and includes data made available by, among others, OCLC Online Computer Library Center, Inc. and the Library of Congress.
DASH is Harvard's digital repository for scholarly articles, theses and dissertations, and other Harvard-affiliate generated literature. Harvard Library makes the bibliographic data openly available for all uses, with a standard set of APIs.
DASH supports two standard APIs for extracting article information: opensearch and OAI-PMH.
OpenSearch API. As of the November 9, 2009 (1.1.8) release, DASH includes an OpenSearch interface. OpenSearch is a RESTful Web service that performs a query and returns search results as RSS or ATOM feeds. It provides complete, unmediated access to the full power of DSpace's Lucene search engine; the UI places inherent limitations on the queries you can construct.
OpenSearch URLs. The URL of an OpenSearch request starts with http://dash.harvard.edu/open-search/ (the trailing "/" is required). All parameters are specified as query arguments after that:
|query||Lucene query string (see below)|
|format||Output format, must be one of: atom, rss, or one of the supported specific format versions, e.g. atom_1.0, rss_1.0, rss_2.0. No default.|
|scope||Search is restricted to a collection or community with the indicated handle.|
|rpp||Number indicating the number of results per page (i.e. per request). Default is 10. Specifying 0 invokes the default, so to get all results use an improbably large rpp, e.g. 500.|
|start||Number of the page to start with (if paginating results)|
|sort_by||Index of sorting criteria (same as DSpace advanced search values). Must match a sort-option index in the DSpace configuration. Currently, they are:
0 - by relevance (default)
1 - by title
2 - by date of issue
3 - by date of accession (i.e. submit date).
|order||Ordering of sorted entries, either ascending or descending. Only effective when sort_by is nonzero.|
Lucene Query Syntax
The query string specifies what field (index) to match with a value. It also supports Boolean combinations, ranges, proximity, and other advanced features. The value of the query parameter must be in the Lucene query language (suitably escaped for inclusion in a URL, of course).
DASH is configured with the following indexes. Not all of them appear on the Advanced Search page.
|OpenSearch/Lucene name||Advanced Search||Description|
|default||Full Text||All metadata indexes and extracted text.|
|author||Author||Keyword in author name, wildcards and phrases acceptable.|
|author_authority||n/a||Authority key value of a Harvard-affiliated author.|
|title||Title||Keyword or phrase in title of article or journal.|
|subject||Keyword||Subject keywords (dc.subject.* fields)|
|abstract||Abstract||Keyword or phrase in the dc.description.abstract metadata field.|
|fasDepartment||FAS Department||Keyword or phrase matching FAS Department name|
|identifier||Identifier||Any of the identifiers such as DOIs and URLs associated with the work, including the published version, other sources, and the DASH NRS identifier.|
|issued||Issue Date||Date of issue (original publication), as full date.|
|issued.year||n/a||Year only of the date of issue.|
|accessioned||n/a||Date of accession, i.e. when the submission is entered into the archive.|
|accessioned.year||n/a||Year only of the date of accession.|
- When matching FAS departments, search for an unambiguous phrase, e.g. fasDepartment:"Molecular and Cellular Biology", not just fasDepartment:biology.
- Specify search terms in lower case to match case-insensitively.
- Full dates are specified as YYYYMMDD, e.g. 20091031. See the Lucene Query Language for further details about specifying ranges, etc.
- To match a Harvard author by authority key, go to the DASH UI and choose "Browse by Harvard Author" from the Options menu. Find your author, and copy the URL it links to. That URL will contain query arguments including a value for "authority", this is the authority key you have to look for. For example:
This gets an Atom feed for all of the articles by Stuart Shieber, using his Harvard-affiliated authority identification. They are ordered by descending issue date so newest papers appear first, although the year granularity may present a problem.
This shows all articles from the department of Molecular and Cellular Biology:
OAI-PMH API. We provide OAI-PMH access in conformity with its open standard.
As a DASH-specific example, here's an OAI-PMH url configured to show all articles in the Graduate School of Education collection ("set=hdl_1_3345928") for a
specified date range:
The resulting XML file contains metadata about each article and a url for the dash abstract page.
Frequently Asked Questions
Q1: Why is Harvard University making its library catalog metadata publicly available?
A: Open access to data and metadata are cornerstone values of the Harvard Library. From the Open Collections Program to harvestable metadata from DASH (Harvard’s open access scholarly repository) and a range of digital collections, Harvard libraries have long been working to open collections and metadata for public use and reuse. With growing interest in and benefits from integrating library information into the web, the time seems right to support innovation in this space with as much metadata as we can.
Q2: Most libraries make their catalogs available online. How is this different?
A: Library catalogs make use of metadata to allow for online searching of information about library collections, but catalogs generally do not make the metadata itself available for harvesting so that it can be reused in innovative ways. Library metadata is the foundational information such as author, title, publisher and subject about the books, journals, and many other forms of knowledge traditionally collected by libraries. These are important cultural objects, but without libraries making the information about them available in a reliable, reusable form, it has been hard for developers to create applications that make full use of them. Harvard hopes not only that the release of its catalog metadata will enrich the Web ecosystem, but also that more institutions will be encouraged to release their metadata.
Q3: How many records will be available?
A: The initial dataset includes over 12 million records for items, such as books, journals, manuscripts, electronic resources, archival collections, audio, video, scores, and other formats from Harvard's dozens of libraries.
Q4: How is Harvard making this metadata available?
A: You can download the MARC21 records from the Bibliographic Dataset section of this page. MARC21 is the standard format for encoding bibliographic information. In addition, developers can get programmatic access to that information through an API offered by the Digital Public Library of America (DPLA) beta platform. An API (application programming interface) enables a computer program to request information from a site. So, if you want to write a program that, for example, retrieves information about books classified both in science and in cooking, your program could make that request via the API.
Q5: What is Harvard's relationship with the DPLA?
A: The DPLA is an independent national organization, with a diverse steering committee and contributors from all across the nation. Harvard’s metadata is being used in the DPLA beta platform.
Q6: Is Harvard placing any restrictions on use of the data?
A: No, Harvard is not imposing any limitations on use of the data. However, we are requesting that users comply with a simple set of Harvard Library community norms. These norms request attribution and that if others improve this data, they make those improvements equally freely available. In addition, for data that originated in WorldCat, at OCLC’s request, we are asking users to observe the WorldCat community norms. We believe that observing these community norms will help promote good practices, foster trust among partners, and encourage growth of the open metadata community.
Q7: What is Creative Commons Zero (CC0), and why is Harvard releasing the metadata under CC0?
A: CC0 is a public domain designation developed by the Creative Commons for use when a person wants to relinquish all copyright and related rights the person has in a work. More information about CC0 1.0 is available at http://creativecommons.org/publicdomain/zero/1.0/. With the CC0 public domain designation, Harvard waives any copyright and related rights it holds in the metadata. We believe that this will help foster wide use and yield developments that will benefit the library community and the public.
Q8: How big is the downloadable file?
A: The initial set of MARC21 records consists of a single file of approximately 3.1 gigabytes.
Q9: Is it possible to get selected subsets of records from the database?
A: Only the full set is available as a downloadable file. Applications that incorporate the metadata, including the DPLA beta platform, can provide additional functionality.
Q10: What are potential uses for the downloadable file of MARC21 records and API access?
A: You'd want the downloadable file of MARC21 records if you want to do some intense processing of the data, or if you want to integrate those records into other data you already have. You might use the API if your site or application needs to pull up information about items as part of a service it’s providing to users. For example, if you have a site that lets people review textbooks, you might use the API to fetch the page count of a book as a user begins a review.
Q11: Will the data be updated?
A: Yes, Harvard Library plans to update the records in this dataset on a weekly basis.
Q12: Have other libraries released records?
A: Some have. For example, the British Library has released 3 million records, Cologne libraries have released 5.4M and the University of Cambridge has released 3.6M. OCLC has released 8 million bibliographic records as part of the OhioLINK–OCLC Collection and Circulation Analysis Project.
Rev. April 23, 2012
Bibliographic Dataset Use Terms
Pursuant to its Open Metadata Policy, the Harvard Library makes this set of bibliographic records and the metadata contained therein (together, the “Metadata”) available for public use under the CC0 1.0 Public Domain Designation:
Although Harvard does not impose any legally binding conditions on access to the Metadata, Harvard requests that you act in accordance with the following Community Norms of the Harvard Library with respect to the Metadata:
First, Harvard requests that the Harvard Library and OCLC Online Computer Library Center, Inc. (“OCLC”) and the Library of Congress be given attribution as a source of the Metadata, to the extent it is technologically feasible to do so.
Second, Harvard requests that you make the Metadata and any improvements thereto freely available on the same terms as Harvard has done, i.e., without claiming any legal right in, or imposing any legally binding conditions on access to, the Metadata or your improvements, and with a request to act in accordance with these Community Norms.
Third, with respect to Metadata consisting of or contained in records Harvard has obtained from the OCLC WorldCat database, Harvard requests that you respect and act in accordance with the community norms set forth in the WorldCat Rights and Responsibilities for the OCLC Cooperative. Use of metadata from the WorldCat database for study and research is consistent with those norms, but if you plan to use such Metadata for other purposes, whether or not you are an OCLC member, we ask that you review and comply with those norms.