top of page

White Papers

Authoritative reports on a variety of content structure topics

Check back often to browse our growing collection of industry-related white papers.


Identifying XML Issues That Impact Content Interchange

Publishers' content collections are complex, often spanning decades, during which time standards have evolved. JATS XML from 2006 is significantly different from JATS XML 2022. Errors and issues with content structure have a serious impact on downstream discoverability and content interchange.


content structure, JATS, JATS-Con, scholarly publishing, content analysis, XML conversion


White Hat Data Harvesting: Industrial-Strength Web Crawling

Vast amounts of business-critical information appear only on public websites. DCL Data Harvester offers a streamlined, automated process to crawl websites, scrape content and metadata, and transform the content into a standardized XML format. 


data harvesting, web crawling, DCL Data Harvester, XML feeds, automation


A Case Study in High-Quality Legacy NLM Conversion

This paper demonstrates how an agile approach to content conversion with close collaboration between Optica Publishing Group (formerly OSA) and DCL created a high-quality digital archive that will serve Optica's strategic aims for developing innovative products and services. 


NLM, NLM conversion, agile development, quality assurance, Optica Publishing Group, XML conversion services

Contente Structure: The Building

Content Structure: The Building Blocks of Innovation

This report explores how any organization can leverage new technologies to create intelligent, multidimensional content. By extracting and enriching data, organizations accelerate digital transformation to innovate, meet modern consumer expectations, and propel business forward.


content structure, semantic enrichment, digital transformation, New York Public Library, American Water Works Association


Using AI to Create Structured Data from Static Documents

The United States Patent and Trademark Office needed a system to process millions of image-based PDFs, forms, and TIFF files to transform into XML feeds back to its system. The system must run 24/7 with no human intervention. Read more about this amazing project.


image-based PDFs, automated transformation, United States Patent and Trademark Office, content structure, XML feeds

bottom of page