The Inconvenient Truth: Someone Has to Clean the Content

Marianne Calilhanna
Jun 5
4 min read

Updated: Jun 9

Anthropomorphized LLM that is cleaning and validating content.

A situation is unfolding inside AI initiatives at many organizations. Teams invest in AI models, platforms, and workflows and still get outputs they can't trust. The culprit isn't the AI model itself. Rather, it's the content going in.

Two inconvenient truths keep surfacing as organizations move from AI experimentation into production:

The document formats they've relied on for decades are fundamentally misunderstood by machines
Even well-intentioned content collections are often sprawling, redundant, and inconsistent in ways that quietly undermine AI before it ever gets started

The PDF Problem: Appearance Isn't Meaning

The PDF was a triumph of engineering. When Dr. John Warnock launched the Camelot Project in 1990, the goal was simple and elegant: create a document that could be created once and reliably viewed anywhere, with its appearance perfectly preserved. It worked. Today, more than 290 billion new PDF documents are created every year, and the PDF software market is projected to nearly triple to $5.72 billion by 2033.

The triumph of the PDF carries a hidden cost for AI.

PDFs encode appearance. They do not encode meaning. When a human looks at a PDF, the brain instantly reconstructs hierarchy

headings
tables
captions and associated figures

But none of this structure actually exists inside the file. What's there are text fragments, x/y coordinates, font sizes, and disconnected objects. There is no "heading." No "table." No "footnote." Just positioned text.

AI systems don't have the cognitive machinery (yet!) that humans bring to a page. They see what the file contains. This flattened, position-based representation creates serious downstream problems:

Multi-column text gets read out of order
Tables lose row-column relationships
Footnotes bleed into body text
Charts and diagrams are reduced to noise or dropped entirely

And here's what makes this particularly dangerous: AI doesn't fail loudly. It produces output that sounds fluent and confident, even when the underlying structure is broken. AI doesn't fix bad structure. It obfuscates it.

The more powerful AI becomes, the easier it is to trust results built on weak foundations.

Structure Is What AI Needs

Generative AI systems perform best when the content they consume is organized, explicit, and precise. Formats like XML and JSON provide exactly that. Unlike PDFs, these formats don't merely describe how content looks, they encode what content means: parent-child relationships, semantic tagging, metadata, hierarchy. Structure that AI can truly use.

The difference shows up in every layer of an AI workflow

Retrieval-Augmented Generation (RAG): Structured content enables RAG pipelines to retrieve the right chunk of information rather than guessing. Less ambiguity means fewer hallucinations and more accurate responses.
Tables and complex data: Structured formats preserve row-column relationships, nested headers, and key-value pairs. These are precisely the things PDF extraction routinely destroys.
Consistency and reuse: Structured content management enables single-source updates, automatic propagation of changes, and compliant outputs across multiple formats. The content says what it means, every time.

Without structure, AI systems must infer relationships that humans glean visually. That guesswork is a core driver of hallucinations and auditability gaps. When organizations point AI at repositories of PDFs and legacy documents, they often get immediate-seeming value. The answers that sound right. But under the surface, tables are flattened, reading order is guessed, context is incomplete, and relationships are lost. Trustworthy AI requires structured, high-quality, well-governed content. Poor structure doesn't just lower quality; it produces unverifiable results.

Before You Structure, You Have to Clean

Here's where the challenge deepens. Even organizations that understand the value of structured content face a second obstacle: their content collections often comprise decades of documents, multiple versions that may be redundant or nearly redundant, and various document formats.

Years of accumulated knowledge, versioning without governance, copy-paste reuse, and siloed authoring leave most repositories deeply fragmented, inconsistent, and full of redundant text. This is content sprawl: the accumulated weight of disorganized, duplicative information that silently undermines AI initiatives before they even begin.

AI-driven systems depend heavily on clean data, consistent terminology, low redundancy, clear structure, and accurate metadata. Most content collections fail on multiple counts. The machine will consume whatever you give it but messy content in means unreliable, untrustworthy output.

This is why content cleanup and harmonization aren't optional pre-steps before an AI initiative. They're foundational to it. Before you structure the content, you must understand what you have and eliminate what's working against you.

Join us for the next DCL Learning Series

Before You Feed the Machine: Cleaning Content for AI Workflows

July 22, 2026 - 1:00 pm to 1:45 pm ET

In this session, you'll learn:

Why content redundancy analysis is a necessary precursor to AI automation and model training
How content cleanup and harmonization serve as foundational steps for AI-ready content workflows
How structured, harmonized content amplifies the value of AI technologies across the enterprise

We'll show how Harmonizer, DCL's content redundancy analysis and reuse platform, reveals exactly how much of your content is duplicative and where true reuse potential lies, enabling organizations to reduce maintenance costs, streamline content operations, and increase editorial accuracy. Harmonizer helps teams collapse redundant material into lean, high-value knowledge assets that are easier to manage, update, translate, and ultimately prepare for AI consumption.

The goal: turn messy repositories into strategic enterprise knowledge.