Harvard Library offers the Harvard community free access to the Harvard Library Public Domain Corpus, a collection of approximately one million digitized public domain books.
This resource, created through a previous partnership with Google Books, is designed to support a wide range of research, teaching, and creative endeavors — including innovative applications such as training large language models (LLMs).
The corpus includes:
- 350 million digitized page images
- 220 billion tokens of machine-readable text
- Materials spanning over 230 languages, with most works in English, German, and French
- A diverse range of topics and genres, primarily from the 1800s and early 1900s
In addition to the texts, associated metadata is available in an easily usable format to encourage further exploration and reuse.
Use Policies
Policy for the Harvard Library Public Domain Corpus
The Harvard Library Public Domain Corpus [“HLPDC”] is made available to the Harvard community via request and is limited to non-profit, educational, and research uses only within the Harvard community.
Requestors will agree to a binding Terms of Use agreement reflecting the Policies that will be presented for review upon request.
Harvard faculty, students, and staff that use the HLPDC are responsible for complying with the copyright laws in their respective jurisdiction.
As a matter of good scholarly practice, Harvard Library requests that faculty, students, and staff using Harvard Library Public Domain Corpus provide the following appropriate citation to the source of reproductions:
“Harvard Library Public Domain Corpus”
by Harvard Library
Policy for the Metadata of the Harvard Library Public Domain Corpus
Pursuant to its Open Metadata Policy (detailed here at bottom of the page), the Harvard Library also makes the Harvard Library Public Domain Corpus’s set of bibliographic records and associated collection metadata available for public use under the CC0 1.0 Public Domain Designation.
Although Harvard does not impose any legally binding conditions on access to the Metadata, Harvard requests that you act in accordance with the Community Norms of the Harvard Library with respect to the Metadata, and provide the following appropriate citation:
“Harvard Library Public Domain Corpus Metadata”
by Harvard Library is marked with CC0 1.0
These policies are subject to the clarifications below:
Public domain
Some works that are in the public domain in the U.S. may remain in copyright in other countries. The nature of collections, however, is such that copyright or other information about restrictions may be difficult or even impossible to determine. The absence of explicit information on copyright is no guarantee, therefore, that a work is in the public domain either in the U.S. or abroad. Nor can Harvard Library guarantee the accuracy of any information about copyright status of the Harvard Library Public Domain Corpus that it does provide. The Library makes no express or implied warranty to others who wish to use digital reproductions of items found in its collections. Users are solely responsible for making independent legal assessments of an item's status in the arena in which it is to be used.
Non-copyright restrictions
Some uses of materials within the Harvard Library Public Domain Corpus may be restricted by trademark, privacy, publicity rights, donor requirements, or other such rights or restrictions. It is the user's sole responsibility to consider the possibility that such rights or restrictions may be involved and to secure any needed permissions.
Citation and credit
Harvard Library requests as a matter of good scholarly practice that appropriate citations be provided to the source of the corpus that is used in any work.
“Harvard Library Public Domain Corpus”
by Harvard Library
“Harvard Library Public Domain Corpus Metadata”
by Harvard Library is marked with CC0 1.0
Harvard Marks/Name
Users of the Harvard Library Public Domain Corpus should not suggest or imply that Harvard Library endorses, approves of, or participated in their projects. Harvard's name should not be used in the title or the name of the product, and it should not be added by the user as a metadata search term, website name, or web address, or be large or prominent. The use of the Harvard name or trademarks for any purpose other than standard source citation requires the prior approval of the Harvard University Trademark Program.
Disclaimer
The use of the Harvard Library Public Domain Corpus made available by the Harvard Library shall be at the user's sole risk. To the fullest extent permitted by law, Harvard disclaims all warranties of any kind (express, implied or otherwise), including but not limited to any implied warranties of merchantability, fitness for a particular use, non-infringement, and/or as to the accuracy or completeness of content or related information, in connection with such digital reproductions and your use thereof. Harvard shall not be responsible or liable for any damage that may occur due to your use of any material that Harvard Library makes openly available. As used in this paragraph, "Harvard" includes Harvard Library, Harvard University and their respective governing board members, officers, employees, agents and affiliates.