Overview
Are you a developer looking to build better experiences for library users? Are you a data scientist studying library information architecture? Are you interested in text mining Harvard Library's records to look for trends and insights related to your field of study?
Harvard Library is among the world's largest academic libraries. The data behind our collections has the power to tell compelling stories and open our eyes to new ways of doing things — making the knowledge we preserve for the world accessible in new and exciting ways.
That's why we provide open access to our metadata through our APIs.
Available APIs
LibraryCloud
Harvard LibraryCloud is a metadata hub that provides granular, open access to a large aggregation of Harvard library bibliographic metadata.
The public LibraryCloud Item API supports searching LibraryCloud and obtaining results in a normalized MODS or Dublin Core format.
LibraryCloud contains records from Harvard's Alma instance (over 12.7M bib records), SharedShelf (4M image records), and ArchivesSpace finding aids (2M finding aid components). Alma metadata has additionally been enriched with the Stackscore usage metric, as well as holdings, and LC classification subject headings.
LibraryCloud also contains an alpha release of a Collections API, that is planned for use as a digital collection definition and export service. The Collection API allows a group of LibraryCloud records to be labeled as part of a named collection. The collection may then be harvested through OAI-PMH in order to import metadata into online digital exhibit platforms, such as Spotlight or DPLA. The full build out of the collection API and a collection builder web application is still a work in progress.
Presto Data Lookup
The Presto Data Lookup service is a RESTful web API that offers programmatic access to data in the library's central online systems.
The Data Lookup API uses a simple URL request syntax and returns results in XML or JSON format.
Note that some of the resources available in this service must be accessed from a pre-registered IP address. Write to the Presto support team to request access. Please include the IP address that needs access, and planned usage.
Available Datasets
Public Domain Corpus
Harvard Library offers the Harvard community free access to the Harvard Library Public Domain Corpus, a collection of approximately one million digitized public domain books. This resource, created through a previous partnership with Google Books, is intended to foster creative reuse, including for training large language models (LLMs).
The corpus is made available to the Harvard community via request and is limited to non-profit, educational, and research uses only within the Harvard community.
Bibliographic Metadata
The Harvard Library Bibliographic Metadata collection is an open access data set that provides a snapshot of HOLLIS bibliographic records and holdings records. These are available as a bulk download via Harvard Dataverse.
Generated from metadata in Harvard's Alma instance, this collection contains all active (i.e., not suppressed or deleted) bibliographic records that have one or more active holdings in Alma, the library’s information management system. Due to size limitations, the over 12.7 million bibliographic records are split across multiple files. Each file contains approximately 200,000 bibliographic records, as well as their associated holdings records, in MARC XML format.
Additional information about the contents of the data set is available in an informational datasheet posted along with the data in Dataverse.
Bibliographic Dataset Use Terms
Pursuant to its Open Metadata Policy, the Harvard Library makes this set of bibliographic records and the metadata contained therein (together, the “Metadata”) available for public use under the CC0 1.0 Public Domain Designation:
Harvard Library Bibliographic Metadata by President and Fellows of Harvard College
is marked with CC0 1.0 Universal
Although Harvard does not impose any legally binding conditions on access to the Metadata, Harvard requests that you act in accordance with the following Community Norms of the Harvard Library with respect to the Metadata:
First, Harvard requests that the Harvard Library and OCLC Online Computer Library Center, Inc. (“OCLC”) and the Library of Congress be given attribution as a source of the Metadata, to the extent it is technologically feasible to do so.
Second, Harvard requests that you make the Metadata and any improvements thereto freely available on the same terms as Harvard has done, i.e., without claiming any legal right in, or imposing any legally binding conditions on access to, the Metadata or your improvements, and with a request to act in accordance with these Community Norms.
Third, with respect to Metadata consisting of or contained in records Harvard has obtained from the OCLC WorldCat database, Harvard requests that you respect and act in accordance with the community norms set forth in the WorldCat Rights and Responsibilities for the OCLC Cooperative. Use of metadata from the WorldCat database for study and research is consistent with those norms, but if you plan to use such Metadata for other purposes, whether or not you are an OCLC member, we ask that you review and comply with those norms.
Caselaw Access Project
The Caselaw Access Project (“CAP”) expands public access to U.S. law. It includes roughly 40 million pages of U.S. court decisions from the collection of the Harvard Law Library, making 6.4 million cases freely available to the public online in a consistent, machine-readable format. That data is fully available for open use, without limitations.
The CAP dataset includes all state courts, federal courts, and territorial courts for American Samoa, Dakota Territory, Guam, Native American Courts, Navajo Nation, and the Northern Mariana Islands. The earliest case is from 1658, and the most recent cases are from 2020.
There are a few ways to access the data. The Library Innovation Lab hosts the bulk data available for download at https://case.law/caselaw/, and our partners at the Free Law Project provide tools and services for CAP data, including search functionality and an API via CourtListener, a free legal research portal and archive.
Harvard Library Policy On Open Metadata
The Harvard Library provides open access to library metadata, subject to legal and privacy factors. In particular, the Library makes available its own catalog metadata under appropriate broad use licenses. The Library Board is responsible for interpreting this policy, resolving disputes concerning its interpretation and application, and modifying it as necessary.
This policy applies to all metadata that the library holds. For instance, the metadata from the DASH repository is also distributed under an open license.
Some metadata may have been placed under contractual obligations preventing distribution prior to the establishment of this policy. In such cases, of course, the library cannot legally, and will not, distribute the metadata beyond what such agreements allow.
Metadata that involves the usage of library materials by individual patrons will not be distributed without sufficient anonymization or aggregation to provide reasonable protection against the reconstruction of individual patron usage.
Because each metadata set may have individual legal and privacy characteristics, appropriate licenses are designed on an individual dataset basis. However, the goal is to make these licenses as broad as possible.