• DCL LinkedIn
  • DCL Twitter
  • DCL YouTube

61-18 190th Street, Suite 205

Fresh Meadows, NY 11365

+1 718.357.8700

info@dclab.com

HOME  /  INDUSTRIES  /   SOLUTIONS  /  SERVICES  /  RESOURCES /  ABOUT  /  CONTACT  /  PRIVACY  /  TERMS OF USE

© 2019 Copyright Data Conversion Laboratory, All Rights Reserved.

White Papers

Authoritative reports on a variety of content structure topics

Check back often to browse our growing collection of industry-related white papers. Got an idea for a topic? Email us with suggestions!

DCL_Data_Harvesting_white-paper.gif

Keywords: data harvesting, web scraping, XML feeds, automation

Vast amounts of business-critical information appears only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented, and formatted, or for how individual websites are constructed or accessed. 

DCL Data Harvester is a streamlined, automated processes to crawl websites, scrape content and metadata, and transform the content into a standardized XML format. 

OSA_White_Paper

Keywords: NLM, conversion, agile development, quality assurance

When faced with the challenge of converting eight highly technical journals spanning 95 years, how do you divide responsibility between the content owner and the conversion vendor? This paper demonstrates how an agile approach to content conversion with close collaboration between the publisher and the conversion vendor allowed The Optical Society of America (OSA) and DCL to navigate between the two extremes and create a high-quality digital archive that will serve OSA’s strategic aims for developing innovative products and services. 

Screen Shot 2019-05-07 at 2.26.59 PM.png

Keywords: image-based PDFs, automated transformation, USPTO

Many governmental and private organizations gather massive collections of content, including legal documents, filings, and contracts. Most such collections consist of images and image-based PDFs; they’re not searchable or minable for the critical information that these organizations need to function. As data collections grow larger and are measured in terabytes, conventional conversion techniques—as efficient as they may be—are not economically feasible. The Holy Grail has always been a fully automated process without human intervention. This paper describes the implementation of such a system at the United States Patent and Trademark Office (USPTO).