Past Projects & Initiatives

Archive Ingest and Handling Test (AIHT)

By both policy and design, the LDI Digital Repository Service (DRS) is intended for highly "curated" digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats. The Library of Congress organized its Archive Ingest and Handling Test (AIHT) to investigate issues surrounding the import, export, and manipulation of a sizeable test corpus - approximately 57,000 files (12 GB) in more than 90 formats - of unknown provenance by institutions with very different preservation strategies and technological infrastructures. Harvard participated in this test along with Johns Hopkins University, Old Dominion University, and Stanford University. The results of the test highlighted the need for community agreement on descriptive and packaging standards, file transfer and validation tools, and best practices for repository operation and preservation planning. 

Digital Repository Certification

The Research Library Group (RLG) and the National Archives and Records Administration (NARA) created a joint task force to develop criteria for identifying digital repositories meeting minimal standards for trustworthiness in the professional management and preservation of digital content. Harvard staff participated in the development of these criteria, which include both technical and organizational metrics, as well as in follow-on work by the Center for Research Libraries (CRL) to develop a specific audit methodology. Harvard staff use the audit guidelines development by the RLG/NARA task force for self-assessment of the Digital Repository Service (DRS) and for planning for its future functional and operational enhancements.

DuraCloud Pilot Project

During the summer of 2011, Harvard conducted a pilot project using DuraCloud, the cloud storage solution offered by DuraSpace. DuraCloud was tested as a potential additional preservation storage location for DRS content. This project ran from May-August 2011.

E-journal Archiving

With funding from the Andrew W. Mellon Foundation, Harvard conducted an in-depth study of the licensing, economic, organizational, and technical issues involved in building a large-scale archive of electronic journals. Working with Blackwell Publishers, the University of Chicago Press, and John Wiley & Sons, Harvard analyzed E-journal content and technical formats, contractual arrangements under which an archive could be assembled and accessed, and who would benefit by and who should pay for archiving. Internally, a technical team studied what changes to the LDI infrastructure were required in order to archive content of this complexity and scale over extended time frames. The results of the Harvard study were widely reported and discussed at meetings of the Digital Library Federation (DLF), the American Library Association (ALA), the American Association for the Advancement of Science (AAAS), the Society for Scholarly Publishing (SSP), and the Coalition for Networked Information (CNI), among others. The April 2002 report is available from the DLF website:

As a follow-up to the 2002 Mellon Foundation-funded e-journal archiving project, LDI funded a collaborative project with the National Library of Medicine (NLM) to produce an open-source archival and interchange XML Document Type Definition (DTD). The DTD is designed to increase the ease of interchange between publishers and archives for article-level e-journal content. Without this DTD, the structure of e-journal content can vary widely, requiring costly human intervention and multiple parallel workflows within archival repositories. The DTD was designed after extensive document analysis in many subject domains to ensure that it does not reflect the bias of any particular academic discipline. Based on public standards, the DTD features a modular structure that allows customizing and that should be an easy target of transformation from existing XML- or SGML-encoded content. In addition to being used by NLM for the PubMed Central archive, this DTD is well positioned to become a standard format for the transfer and archival storage of scholarly literature.

Election 2012 Web Archive

As a member of the International Internet Preservation Consortium (IIPC), Harvard Library, including staff at the Harvard Kennedy School Library and at the Library Technology Services, is collaborating with other academic libraries as well as non-profit and government organizations on a web archiving project. Through the Election 2012 Web Archive, we will collect and preserve web sites related to the 2012 election campaign in the United States. Our goal is to capture this important historical record to ensure long-term preservation and accessibility for teaching, research and the general public.Subject experts at the Harvard Kennedy School and at other academic institutions in areas such as political science and public policy are identifying relevant web sites for long-term preservation.For more information about the IIPC, see their web site.For more information about web archiving at Harvard, see About WAX.To search or browse Harvard’s web archive, see the WAX start page.

Global Digital Format Registry (GDFR)

Preservation activities depend upon extensive knowledge of the formats which are used to represent digital content. Since this same information is useful to all institutions interested in preserving their digital assets, great economies of scale can be achieved from a centralized repository for this format information. LDI staff have been instrumental in articulating this concept within the digital library and preservation communities. With funding provided by the Andrew W. Mellon Foundation, Harvard and OCLC are engaged in the development of a Global Digital Format Registry (GDFR), a peer-to-peer network of independent, but cooperating format registries that will use a common protocol to synchronize their holdings of important format documentation and technical information. Beyond the technical work in creating the software underlying the GDFR network, Harvard is also cooperating with the National Archives and Records Administration (NARA) in an investigation of the business and governance issues that need to be addressed in order to ensure that the GDFR will remain viable over time as a core service to the preservation community.

JHOVE (JSTOR/Harvard Object Validation Environment)

An extensive technical description of the formal characteristics of a digital resource is a necessary precursor to preservation planning for or intervention on that resource. These characteristics are highly dependent upon the format used to represent the resource's abstract content. With funding from the Andrew W. Mellon Foundation, LDI staff collaborated with the JSTOR Electronic-Archiving Initiative (now known as Portico) to produce an extensible tool, called JHOVE (the JSTOR/Harvard Object Validation Environment, pronounced "jove"), for automating format-specific identification, validation, and characterization of digital resources. Harvard and JSTOR have made this tool available to the wider community under an open source license, and it is widely deployed internationally. JHOVE has facilities to extract important technical characteristics of resources created in many commonly-used formats, such as AIFF and WAVE (audio); GIF, JPEG, JPEG 2000, and TIFF (still image); ASCII and UTF-8 (text); and PDF.

LOCKSS (Lots of Copies Keep Stuff Safe)

The goal of LOCKSS is to preserve access to web-based content, primarily e-journals, by maintaining multiple copies at physically disparate locations and by conducting periodic comparisons among them to ensure that materials remain consistent, authentic, and accessible. LDI staff participated in the alpha and beta development phases of the LOCKSS system.


Adobe's Portable Document Format (PDF) has rapidly become a de facto standard for the dissemination and presentation of electronic documents on the web. Unfortunately, the feature-rich nature of PDF permits tremendous variability in the internal structure of these documents. Further, it allows documents to be dynamically composed at the time of their display from separate external resources, which leads to significant difficulties in ensuring their long-term viability. In order to address these concerns, the International Organization for Standardization (ISO) convened a Joint Working Group to produce a constrained version of PDF suitable for archival preservation, known as PDF/A. Stephen Abrams, then LDI's digital library program manager, was the project leader and document editor for the initial version of the PDF/A standard. The PDF/A standard defines the features that should be required, recommended, restricted, or prohibited in order to make electronic documents more amenable to long-term preservation.

PREMIS (Preservation Metadata Implementation Strategies)

Registry of Digital Masters

To avoid unnecessary and expensive duplication of digital reformatting efforts, LDI staff participated with the Digital Library Federation (DLF) in plans for a national digital registry of born-digital materials and digitally reformatted books and journals. By consulting the registry before digitization efforts are undertaken, a content owner can determine if an appropriate digital version already exists and is being preserved in a professional manner that obviates the need for local management.

Page Image Compression for Mass Digitization

Page Image Compression for Mass Digitization In late 2006, Harvard University Library, the California Digital Library, the Internet Archive, and the Bibliothèque nationale de France conducted a collaborative investigation of the the use of lossy JP2 compression for mass digitization of texts. The findings are documented in the IS&T Archiving 2007 Conference Proceedings. Please consult the published paper or this preprint


Unified Digital Format Registry (UDFR)

The UDFR is an initiative begun in April 2009 to build a single shared formats registry. UDFR builds on years of work performed by a number of institutions internationally, including Harvard, whether it was for PRONOM, the Global Digital Formats Registry (GDFR), or other format registry projects. The UDFR was developed at the University of California Curation Center (UC3) with funding from the Library of Congress

Zone 1 Rescue Repository

Zone 1 is a project, begun in June 2011 and funded by the Harvard Library Lab. The Zone 1 prototype rescue repository will serve as a proof-of-concept for a future production version of the repository which would close the gap in secure storage solutions at Harvard that currently exists. The repository will have a low deposit barrier to ensure that valuable content at Harvard won't be lost. It will also serve as a conduit of review by the Harvard community for potential re-use or long-term stewardship of the content and will facilitate transfer to other repositories. For more information see the Zone 1 web page.