Document Formats That Leo Hates*
At DCL we often joke about how “we’ve seen it all” in terms of document formats. It’s true we’ve been in the business since before SGML was even an ISO standard. And it’s also true we’ve seen a LOT of document formats during our 40+ years in the business.
Leo Belchikov is a Senior Project Manager at Data Conversion Laboratory and has spearheaded many conversion projects over his 25 years at DCL. Internally we often say “Leo is going to hate this” when we encounter a format that is not up to par with his requirements of content portability, accessibility, and flexibility. Leo likes XML and the power it provides organizations that manage content. We asked Leo to share some of his observations about the typical formats he encounters when converting from one format into XML. He shared considerations that make automation challenging if you want to achieve a high accuracy rating. He also shared what DCL must consider during the QA process with these formats.
Image-based PDF—An image-based PDF can be thought of as a photocopy. Just as a copy is a facsimile of an original document, so too is an image-based PDF. What appears as text to humans reading an image-based PDF is just a series of pixels to computers and mobile phones. In an image-based PDF, text is no different from images or graphical elements that may also be in the document. Search and retrieval of keywords and concepts within the content is impossible. In terms of accessibility, there is no capacity for assistive screen readers to read content aloud for users who may be visually impaired or have learning disabilities. When converting from an image-based PDF to XML, it requires thorough proofreading, and the converted content is extremely dependent on the original quality of the scanned document.
PDF Normal—A true PDF (also called a “digitally created PDF” or “PDF Normal”) is typically created as an export from another desktop publishing format, such as saving a Microsoft Office file to PDF or saving another Adobe format as a PDF. These files natively have electronic character designation for both the text and the corresponding metadata. However, complex content elements, special characters, math, chemical formulae, and so on, are still often “digitized” as images. Therefore, filtered search and data analysis cannot be performed using complex content elements. Extraction from this format will depend on how the PDF file was created. Depending on the level of accuracy the client desires, DCL may need to undertake manual clean-up OR we can explore other ways to automate the process with the understanding there will be additional analysis and business requirements involved.
MS Word—Word files are not structured. While the use of styles and formats might sound like structure is being applied (or attempted), it can complicate things when converting from Word to XML. Some specific templates used on source files might complicate extraction processes. Word documents always require extensive preparation because even within the same company there is typically no standard template enforced and elements are not styled or defined. Even organizations who believe they are consistently applying styles in Word templates usually are not when you analyze all the content for a large-scale conversion project.
InDesign—INDD files are created using Adobe InDesign. InDesign can support XML structure, but it is no way true XML. In InDesign, the Structure Pane displays, in hierarchical form, items in a document that have been marked with XML tags. However, an INDD file is not a fully structured format. Several elements require further identification and linking between elements is not done very well in InDesign. When converting an InDesign file to XML, text (objects, elements) might not transform in proper reading order and often causes a challenge. Because InDesign is a layout program, reading order is another factor that must be carefully assessed during the analysis phase and post-conversion QA processes are required.
*It’s most certainly a joke that Leo “hates” any document format that is NOT XML! It’s actually quite the opposite—Leo and all of DCL’s analysts and project managers love a good problem to solve. While we can’t say definitively that we’ve seen every document or data format, we have most certainly worked with a lot. If you have a complex conversion project, we welcome the opportunity to speak with you.