top of page
Writer's pictureMarianne Calilhanna

Think You Know DCL? Technology That Might Surprise You: Automated Table-to-XML Extraction

While "conversion" is Data Conversion Laboratory's (DCL) middle name, we offer much more than one-off transactional conversions from PDFs to XML (or InDesign, Framemaker, Word, HTML, etc.). DCL employs some serious technology wizards who are skilled at mining, extracting, structuring, enriching, and well...really manipulating content and data in almost any way you can imagine.


We address intricate content obstacles with skilled teams who specialize in solving puzzles. Following is a service you might find useful in your organization.


Automated Table-to-XML Extraction

Tables are tough to structure due to inconsistencies with tabular content, high diversity of layouts, complicated elements such as straddle headings, various alignments of contents, the presence of empty cells, and other intricacies. Often it's easier to represent tables as images rather than convert and structure into XML.

The following example highlights some typical complexities we see with tabular content.

Transforming tabular content into a structured model such as XML or HTML is nearly always a manual or semi-manual process. Tabular content is particularly important in regulatory, financial, and scientific documents where complex alphanumeric content is often presented in tabular format.

 

DCL created an AI model that finds and extracts information from all tables in a document using a combination of Computer Vision (CV) and Natural Language Processing (NLP). We developed a hybrid approach of rules-based processes and machine-learning to identify and extract tabular data, and augmented training data to develop an AI model that automates table-to-XML extraction.


Tables are Tough - DCL Learning Series

Watch the following video to hear more about the complexities of transforming tables into XML:



 

If you have a complex issue related to managing your content, we can help.




122 views

Comments


bottom of page