top of page

Case Studies

Delivering customer success since 1981

“DCL played a critical role in helping NYPL create this resource for the world. They quickly understood the big picture and helped us extract, structure, and parse thousands of records accurately and on time.”

-Sean Redmond, Senior Project Manager, New York Public Library

Technologies Used

XML
OCR
Automated QC software

Project highlights

450,000 pages of previously digitized content structured into XML.
Data field identification to facilitate efficient search in final product.
Cost-conscious project and automated QC ensure funding from Arcadia, a charitable fund of Lisbet Rausing and Peter Baldwin, and the Ford Foundation, is honored and respected.

New York Public Library

Data Extraction and Content Structure

Keywords: digitization, XML transforms, OCR, QC validation

Background

The New York Public Library (NYPL) has been an essential provider of free books, information, ideas, and education for more than 100 years. Serving more than 17 million patrons a year, and millions more online, the Library holds more than 55 million items, from books, e-books, and DVDs to renowned research collections used by scholars from around the world.

NYPL obtained historical records of the United States Copyright Office, which were previously scanned into image-based PDF files. The Catalog of Copyright Entries is published annually by the Copyright Office and is a vast collection of digital copyright entries dating back to 1891. Scanning is a good first step in information preservation; however, limitations arise if one wants to search through that content in a meaningful way with modern tools and interfaces.

NYPL envisioned the creation of a database that could be used to quickly determine the copyright status of a piece of work. NYPL understood that to create a product that was previously impossible, the historical records of the Copyright Office needed intelligent content structure and semantic enrichment.

Solution

DCL developed and configured custom software to scan the image-based PDF files and transcribe embedded data. After deep analysis of the source material, content was extracted and converted to XML.

Content experts at NYPL and DCL identified required fields that would facilitate efficient searching. Data were parsed and delivered to the NYPL copyright database.

Results

DCL completed a successful 10,000-page pilot project and is now in the process of transforming all 450,000 pages of digitized copyright records into a searchable, easy-to-use data set. NYPL is committed to making this data freely available and without restriction of any type of use. The data will also be programmatically accessible through APIs so that it may be integrated with other tools. Unlocking these records and structuring previously digitized content results in an amazing record of American creativity that is available worldwide.

Related webinar featuring The New York Public Library

How Content Structure and Data Extraction Facilitate New Product Dev - DCL Learning Series Webinar

How Content Structure and Data Extraction Facilitate New Product Dev - DCL Learning Series Webinar

Play Video

Content structure, data extraction, and semantic enrichment facilitate product development

Technology and content are empowering products and ideas that were previously impossible. Organizations can "digitize" content but without proper structure and metadata enrichment, digitized content is still considered flat. DCL incorporates technologies such as computer vision, natural language processing, machine learning, and more to structure content and bring it to life.

TECHNOLOGY ENABLES MULTIPLE DIMENSIONS

Where content was previously static and "flat" content structure and technology enables interactivity and discovery.

bottom of page