Go to m.harvard.edu for the Harvard Mobile web app.

 
My Account
 
Site Search
 
 

You are here

Preservation Services
Page Image Compression for Mass Digitization
 
 

Project Description

 

In late 2006, Harvard University Library, the California Digital Library, the Internet Archive, and the Bibliothèque nationale de France conducted a collaborative investigation of the the use of lossy JP2 compression for mass digitization of texts. We documented our findings in the IS&T Archiving 2007 Conference Proceedings.

Please see the published paper, or this preprint.

 

Production notes for the Harvard test suite

 
Harvard University Library Test Suite
Text pages      

003176581_0007

003176581_0007

003176581_008

003176581_0008

003298279_0001

003298279_0001

006393844_0001_thumbnail

006393844_0001

 

Text + b/w illustration  

003298279_0004

003298279_0004

002010967_0026_thumbnail

002010967_0026

 
 
Color images  

002024214_0033

002024214_0033

006393844_0008

006393844_0008

   
b/w image  

006051784_0002_thumbnail

006051784_0002

 

 

 

The digitized pages in this suite were selected to represent a segment (but not the full range) of page characteristics for volumes published in the 19th and 20th centuries. This test suite contains nine book pages. Click on any thumbnail to access:

  • a one-page report summarizing evaluations of image quality for JP2 images of various sizes (from the Aware and Kakadu codecs)
  • three baseline images created by HCL Imaging Services:
    • a 300 ppi uncompressed 24-bit RGB TIFF "uncorrected camera" image, captured with a Zeutschel OS10000 Bookscanner
    • a 300 ppi uncompressed "processed RGB TIFF," created from the above image, with an Adobe action script optimized for works on paper (from general library collections such as books and journals). The action script makes global and local color corrections and  tonal adjustments through a combination of curves, levels, and hue/saturation controls. Images were also sharpened through a fairly complex multi step process.
    • a 600 ppi 1-bit TIFF, created from an intermediary 8-bit grayscale TIFF from the "processed RGB TIFF" with Photoshop's default "convert to grayscale" function (30% Red, 60% Green, 10% Blue)
  • three sets of JP2 images: each produced from the same 300 ppi uncompressed processed RGB TIFF
    • one lossless and six lossy JP2 files with the Aware version 3.11.2 command-line codec
    • one lossless and six lossy JP2 files with the Kakadu version 5.5.2 command-line codec
    • one lossless and six lossy JP2 files with the LuraWave command-line codec 2.1.11.05

The IS&T paper provides details of the Peak Signal to Noise Ratio (PSNR) and mean square error (MSE) functions we used in Aware and Kakadu respectively to optimize human-perceived quality of text and illustrations in lossy-compressed images. The LuraWave codec provides quality settings ranging from a low of Q1 to a high of Q100. We encourage you to consult the full paper for an explanation of our research questions, methodology, and findings.