Harvard Library Shares the Public Domain: Unlocking Centuries of Knowledge for AI and Research

This year, Harvard Library made its Public Domain Corpus available to the public for research, teaching, learning, and creative activities. Kyle K. Courtney, Director of Copyright and Information Policy at Harvard Library, explains what the corpus is and why it’s an important public access resource. 


Harvard Library: First, we should start with, what exactly is the Harvard Library Public Domain (HLPD) Corpus?

Kyle K. Courtney: The HLPD Corpus is a dataset of nearly one million digitized public domain books from Harvard’s collection, spanning more than six centuries, multiple languages, and countless genres. It’s not just a collection of texts, it’s a vast record of human knowledge and cultural memory, transformed into structured, research-ready data. By making this data accessible to the public, we’re inviting researchers, educators, and innovators to explore them in ways that were never possible when they were bound to physical volumes. It’s unlocking the public domain for transformative scholarship now and into the future.

HL: Where did the materials in Harvard Library’s Public Domain Corpus come from? And what does it mean that they’re “public domain”?

KC: The corpus comes from nearly a million books in Harvard Library’s collection that were scanned during a partnership with Google in the early 2000s as part of the Google Books project. We identified the set of works that were firmly in the public domain, that is, no longer under copyright, and prepared them as a large-scale, machine-readable dataset. “Public domain” means works are free of copyright restrictions and may be freely used by all. By releasing them openly, we’re taking that legal principle and turning it into a lived reality: the public domain as data, ready for anyone to use in research, teaching, or innovation

A photo of Kyle Courtney standing and speaking with a mic in a conference hall
Kyle K. Courtney at the AI @ FAS Symposium, May 2024. Photo by Stephanie Mitchell.

HL: How does making this corpus available support research and innovation?

KC: When you digitize books at scale, their value multiplies. Once they’re no longer bound by covers, their knowledge becomes available for all kinds of uses including computational analysis, training AI, mining trends, even building open-source public tools. But the real impact comes when everyone has access. You only get the full value of knowledge when anyone can engage with it. That’s why Harvard Library is putting this deep, structured, open dataset into the public’s hands: to spark new questions, empower research, inspire unexpected innovation, and support the responsible development of future tools and technologies.

HL: What does this project change in the landscape of AI and scholarship?

KC: This project marks a big improvement in the AI training landscape. Instead of relying on proprietary, opaque, or ethically questionable sources, researchers and AI developers can instead use documented, open-source, and legally sound data. By shifting control of training data from tech companies to public institutions, Harvard Library is  empowering smaller players and AI researchers. We are giving public institutions and researchers a fair chance to compete in building the next generation of AI. This is a step forward for libraries as leaders of open and ethical infrastructure.

HL: Why does open access to the Public Domain Corpus matter so much right now?

KC: We don’t want knowledge to remain trapped behind paywalls. Until now, these kinds of large-scale datasets have remained locked behind paywalls and licensing restrictions that limit reuse. With the HLPD Corpus, we are expanding the role of library-led open infrastructure while promoting ethical and equitable AI development. We aren’t just offering a technical resource—we’re serving the public good. By creating a foundation of clean, legally reusable data, we are declaring that the public domain, in practice as well as principle, belongs to everyone. This is about democratizing knowledge, preserving access for everyone, and opening new paths for exploration and innovation.

HL: What do you hope Harvard Library’s work in this space will accomplish?

KC: Our mission is to advance the pursuit of knowledge that is at the heart of Harvard. As a library, we steward and democratize access to knowledge. This corpus is just the newest form of knowledge. That is why it matters. We are putting these resources into the hands of anyone with curiosity and vision, regardless of their affiliation or ability to pay. Free access is what turns public domain from a legal category into a lived reality. We hope to see new ideas, new questions, and new technologies emerge from this foundation. The real value is not just in the data itself, but in the doors it opens for people to participate freely in the future of research, scholarship, and AI.

HL: Harvard Library is the first institution to make a public domain dataset of this scale openly available to the world. Why is it significant that a library has taken this step?

KC: It’s significant because libraries exist to serve the public, not to profit from information. For centuries, libraries have been the stewards of knowledge; collecting, preserving, and making it accessible to everyone. By being the first to release a dataset of this scale openly, Harvard Library is reaffirming that mission in the digital era. 

When a library takes this step, it shifts the power dynamic. Instead of public knowledge being filtered through proprietary platforms or paywalls, the data is being offered freely, transparently, and ethically. That means the rules of engagement are different: anyone, anywhere, can use this resource to ask new questions, build new tools, and advance scholarship without fear of hidden costs or legal uncertainty.

This is about setting a precedent for what public access can look like when it’s guided by public values. Libraries are uniquely positioned to ensure that the infrastructure of knowledge remains open, equitable, and accountable to the communities it serves.