Archive Ingest and Handling Test (AIHT)
By both policy and design, the LDI Digital Repository Service (DRS) is intended for highly "curated" digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats. The Library of Congress organized its Archive Ingest and Handling Test (AIHT) to investigate issues surrounding the import, export, and manipulation of a sizeable test corpus - approximately 57,000 files (12 GB) in more than 90 formats - of unknown provenance by institutions with very different preservation strategies and technological infrastructures. Harvard participated in this test along with Johns Hopkins University, Old Dominion University, and Stanford University. The results of the test highlighted the need for community agreement on descriptive and packaging standards, file transfer and validation tools, and best practices for repository operation and preservation planning.
- AIHT: Conceptual Issues from Practical Tests, D-Lib Magazine (December 2005)
- Harvard's Perspective on the Archive Ingest and Handling Test, D-Lib Magazine (December 2005)
Digital Repository Certification
The Research Library Group (RLG) and the National Archives and Records Administration (NARA) created a joint task force to develop criteria for identifying digital repositories meeting minimal standards for trustworthiness in the professional management and preservation of digital content. Harvard staff participated in the development of these criteria, which include both technical and organizational metrics, as well as in follow-on work by the Center for Research Libraries (CRL) to develop a specific audit methodology. Harvard staff use the audit guidelines development by the RLG/NARA task force for self-assessment of the Digital Repository Service (DRS) and for planning for its future functional and operational enhancements.
DuraCloud Pilot Project
During the summer of 2011, Harvard conducted a pilot project using DuraCloud, the cloud storage solution offered by DuraSpace. DuraCloud was tested as a potential additional preservation storage location for DRS content. This project ran from May-August 2011.
With funding from the Andrew W. Mellon Foundation, Harvard conducted an in-depth study of the licensing, economic, organizational, and technical issues involved in building a large-scale archive of electronic journals. Working with Blackwell Publishers, the University of Chicago Press, and John Wiley & Sons, Harvard analyzed E-journal content and technical formats, contractual arrangements under which an archive could be assembled and accessed, and who would benefit by and who should pay for archiving. Internally, a technical team studied what changes to the LDI infrastructure were required in order to archive content of this complexity and scale over extended time frames. The results of the Harvard study were widely reported and discussed at meetings of the Digital Library Federation (DLF), the American Library Association (ALA), the American Association for the Advancement of Science (AAAS), the Society for Scholarly Publishing (SSP), and the Coalition for Networked Information (CNI), among others. The April 2002 report is available from the DLF website:
As a follow-up to the 2002 Mellon Foundation-funded e-journal archiving project, LDI funded a collaborative project with the National Library of Medicine (NLM) to produce an open-source archival and interchange XML Document Type Definition (DTD). The DTD is designed to increase the ease of interchange between publishers and archives for article-level e-journal content. Without this DTD, the structure of e-journal content can vary widely, requiring costly human intervention and multiple parallel workflows within archival repositories. The DTD was designed after extensive document analysis in many subject domains to ensure that it does not reflect the bias of any particular academic discipline. Based on public standards, the DTD features a modular structure that allows customizing and that should be an easy target of transformation from existing XML- or SGML-encoded content. In addition to being used by NLM for the PubMed Central archive, this DTD is well positioned to become a standard format for the transfer and archival storage of scholarly literature.
Global Digital Format Registry (GDFR)
Preservation activities depend upon extensive knowledge of the formats which are used to represent digital content. Since this same information is useful to all institutions interested in preserving their digital assets, great economies of scale can be achieved from a centralized repository for this format information. LDI staff have been instrumental in articulating this concept within the digital library and preservation communities. With funding provided by the Andrew W. Mellon Foundation, Harvard and OCLC are engaged in the development of a Global Digital Format Registry (GDFR), a peer-to-peer network of independent, but cooperating format registries that will use a common protocol to synchronize their holdings of important format documentation and technical information. Beyond the technical work in creating the software underlying the GDFR network, Harvard is also cooperating with the National Archives and Records Administration (NARA) in an investigation of the business and governance issues that need to be addressed in order to ensure that the GDFR will remain viable over time as a core service to the preservation community.
JHOVE (JSTOR/Harvard Object Validation Environment)
An extensive technical description of the formal characteristics of a digital resource is a necessary precursor to preservation planning for or intervention on that resource. These characteristics are highly dependent upon the format used to represent the resource's abstract content. With funding from the Andrew W. Mellon Foundation, LDI staff collaborated with the JSTOR Electronic-Archiving Initiative (now known as Portico) to produce an extensible tool, called JHOVE (the JSTOR/Harvard Object Validation Environment, pronounced "jove"), for automating format-specific identification, validation, and characterization of digital resources. Harvard and JSTOR have made this tool available to the wider community under an open source license, and it is widely deployed internationally. JHOVE has facilities to extract important technical characteristics of resources created in many commonly-used formats, such as AIFF and WAVE (audio); GIF, JPEG, JPEG 2000, and TIFF (still image); ASCII and UTF-8 (text); and PDF.
- JHOVE is now being maintained external to Harvard as an open source project.
LOCKSS (Lots of Copies Keep Stuff Safe)
The goal of LOCKSS is to preserve access to web-based content, primarily e-journals, by maintaining multiple copies at physically disparate locations and by conducting periodic comparisons among them to ensure that materials remain consistent, authentic, and accessible. LDI staff participated in the alpha and beta development phases of the LOCKSS system.
NISO Z39.87, Data Dictionary – Technical Metadata for Digital Still Images
Adobe's Portable Document Format (PDF) has rapidly become a de facto standard for the dissemination and presentation of electronic documents on the web. Unfortunately, the feature-rich nature of PDF permits tremendous variability in the internal structure of these documents. Further, it allows documents to be dynamically composed at the time of their display from separate external resources, which leads to significant difficulties in ensuring their long-term viability. In order to address these concerns, the International Organization for Standardization (ISO) convened a Joint Working Group to produce a constrained version of PDF suitable for archival preservation, known as PDF/A. Stephen Abrams, then LDI's digital library program manager, was the project leader and document editor for the initial version of the PDF/A standard. The PDF/A standard defines the features that should be required, recommended, restricted, or prohibited in order to make electronic documents more amenable to long-term preservation.
- For more information visit the PDF/A Competence Center.
Registry of Digital Masters
To avoid unnecessary and expensive duplication of digital reformatting efforts, LDI staff participated with the Digital Library Federation (DLF) in plans for a national digital registry of born-digital materials and digitally reformatted books and journals. By consulting the registry before digitization efforts are undertaken, a content owner can determine if an appropriate digital version already exists and is being preserved in a professional manner that obviates the need for local management.
Page Image Compression for Mass Digitization
Page Image Compression for Mass Digitization In late 2006, Harvard University Library, the California Digital Library, the Internet Archive, and the Bibliothèque nationale de France conducted a collaborative investigation of the the use of lossy JP2 compression for mass digitization of texts. The findings are documented in the IS&T Archiving 2007 Conference Proceedings. Please consult the published paper or this preprint.