DCL Learning Series

Trustworthy AI: Optimizing Content for Large Language Models

Marianne Calilhanna

Hello, and welcome to the DCL Learning Series. Today's webinar is titled "Trustworthy AI: Optimizing Content for Large Language Models." My name is Marianne Calilhanna and I'm the VP of Marketing here at Data Conversion Laboratory. DCL, as we're also known, is the industry leading XML conversion provider, and we've been in the business of transforming content since 1981. Today, structure and content and data is foundational to AI because LLMs don't reason over documents, they reason over structured machine-readable content. And the way content is converted, structured, enriched, and connected, directly determines what an AI system can retrieve, wait, and use in generating responses. Without intentional structure and metadata, even the most advanced models struggle with context, accuracy, and reliability. At DCL, we see modern conversion not as a back office task, but is the critical layer that enables trustworthy AI, precise retrieval, and meaningful responses at scale.

So let's begin. I want you to know that this session is being recorded. It will be available in the On Demand section of our website at dataconversionlaboratory.com. And we'll have plenty of time at the end to answer any questions you have, but feel free to submit them, as they come to mind, and you can do that via the Question dialogue box here in the GoToWebinar platform.

Today's speakers are my colleagues, David Turner and Rich Dominelli. David collaborates with our clients on a variety of content technology initiatives, and Rich is a senior solutions architect at DCL. Honestly, you can throw any content challenge at them and they will deliver a solution. It's an honor to work with both of them and have them here today to discuss trustworthy AI. David, I'm going to turn it over to you.

David Turner

All right. Thanks so much, Marianne. I appreciate it. And yeah, so I also have great things to say about Rich. In fact, I actually have a t-shirt that declares me the president of Rich's fan club, but I was not allowed to wear that today, well, for obvious reasons. Rich, if you're ready to go, I thought maybe we would kick off a quick poll here.

Rich Dominelli

Sounds great.

David Turner

Nice. I'm going to go here. And Marianne, I'm going to just count on you to throw up this poll. But the topic is, where is your organization today when it comes to preparing your content for use with LLMs? All right. You got a couple of different choices there. We have a whole host of people on here, so I'm anxious to see how this comes out. Are you deploying LLMs and you just need to improve the accuracy? Are you just piloting at this phase? Are you planning something, but you hadn't really thought about the whole content structure thing? Or are you not actually working with any LLMs? Rich, any guesses you want to venture while the results are coming in?

Rich Dominelli

I'm going to suspect that most people are still in the planning stage, or may have a prototype that they're playing with, but haven't actually deployed yet.

David Turner

I'm probably with you there as well. Although I'm never surprised anymore when I see that people are actually deploying things and are working and moving along.

4:04

Marianne Calilhanna

Gentlemen, I'm going to pop in here and say that most of the participants have voted, so I'm going to end and share the results. Do you see the results there on the screen?

David Turner

I see where it says Poll 1, but maybe everybody else does. Oh, wait. There it is. Yeah, it looks like you were right, Rich. Yeah.

Marianne Calilhanna

He always is!

David Turner

Yeah! It's very evenly distributed. Okay. Good to know. Good to know. Good to start. All right. Well, Marianne, let me just take it back over then, and let's move on to our next slide. I'm going to kick these off today and talk about this whole topic. Now I'm going to give a very US-based example, and maybe a little cheesy example, but my wife and I stumbled across this show recently from the early 2000s called Everwood. And as we watched the first episode, we were like "Wait a second, is that Chris Pratt?" And sure enough, it was Chris Pratt. I think it was maybe his first role, or one of his first major roles. And then my wife was also watching another show, The Resident, at the time, and she was like "Hey, the girl there, she's on The Resident." So we were, of course, talking, wondering if any of these characters had been in any other shows. And so the obvious one came up, which is one of the favorites in our house, which is Law & Order: SVU. And so we were curious, have any of the actors and actresses in Everwood ever been on Law & Order: SVU? Because SVU's been on for forever. They always have cameo roles. So I threw it out there.

I got on to Copilot and I said "Hey, tell me what actors have been in Everwood and what episodes have they been in in SVU?" And it gave me almost instantly an entire list. It got the actors and actresses names right. And it would say "This person was in season 3, episode 8. This person was in season 12, episode 16." But then I had the question, I thought "Well, is this really trustworthy?" And so I started taking some of the episodes and I went and I looked at the SVU website, at some cast, and I wasn't finding a match. So then I thought "Well, I'll go to Wikipedia and I'll look." And sure enough, about 60% of the results that I got were completely fabricated, completely made up. I felt really good it came up. It was fun. We thought "Oh, this is cool." But in the end, the data ended up not being very accurate. Now, obviously that's a meaningless example and there are a lot more important things, but it does just give you an idea. What was crazy about it was how silver-tongued Copilot was when it told me that it had these results for me. And so it was very believable.

And the truth is, is that when we think about AI today, we are talking about this idea of generative AI, and it does a lot of amazing things. We're hoping that we could take all these many different inputs out here and get some trustworthy something on the back end, right? Whether that's reliable data, a correct search result, an accurate piece of content. But the truth is, that could be a challenge because all of the knowledge is locked in different places. There's a lot of different file formats out there, there are tables, there are images, and there's just so much out there. And there's also a lot of things, like different versions of content out there. If you're just looking, just generally, at whichever AI or LLM that you use, it could be pulling data from journal articles that have been retracted.

8:03

It could be pulling things where there are multiple conflicting pieces of information, or just people's opinions on things. So as it pulls that, it really makes you question that trustworthiness. So how do you get better? Well, talk about that. I'm going to mention one of our partners that I think always has good things to say. Val Swisher used to always say, and I'm paraphrasing here, that people and companies love investing in expensive technology, right? It's one of the first things that you think of. If I'm going to solve this problem, I just need better technology. If Copilot's not working, let's look at Claude. Cloud's not working. Let's look at Gemini. What's the latest here? What's the latest there? And the truth is, is that while people love expensive technology, expensive technology plus crappy content really just ends up giving you expensive, crappy content. And my little corollary to this, or axiom, or whatever it is, that it doesn't just give you expensive, crappy content, but it gives you really, really questionable and problematic results, right? Even if you're just looking at the content that's on your server, how accurate can it be? If you don't spend time investing in the content and the underlying data, it can be a problem.

The SaaS Institute did a study a few years ago and it talks about ultimately that bad data is the leading cause of the failure of IT projects. It's one of the driving factors behind customer attrition, because people don't keep up with their data, or their metadata, correctly and it causes a lot of different problems for them. Just as another example here, it's like, well, with a lot of AI applications, it's like you've got this big messy playroom, right? And you're asking it to go and find something, analyze something, massage something from these massive stores of data, but the messier it is, the harder it is to surface what you're actually looking for. Now, as we learned from my Copilot example, and I don't know, Rich, you may have other examples that are similar to this, the AI is going to try to make you happy.

At some point, it's going to come back, and it's going to make a decision: "Oh, hey, I've taken too long. I've got to present something." It's like me sending a child into this messy playroom and saying "Hey, find me a red 2X3 Lego brick for this thing that I'm working on." After a few minutes of searching, because it's such a mess in here, how are they going to find this? They might come back with something red, they might come back with something that's a Lego brick, but it may or may not be a red 2X3 Lego brick. On the other hand –

Rich Dominelli

So one of –

David Turner

Yeah, go ahead.

Rich Dominelli

One of my favorite examples of that is early on when we were first looking at AI, we had an ask from Standard & Poor's to start ingesting financial 10K reports. And one of my favorite tests was "Here's the 10K report," because 10K reports are supposed to have the same sections. One of those sections lists a list of executive officers within the company, and their ages, I'm not really sure why the ages are there, and their positions within the company. And then one of my favorite asks was "Here's the document, please give me a list of the executive officers that appear in section 2 of the document."

12:01

And early on, the ChatGPT-3, even the early versions of ChatGPT-4, would come back with that confidently incorrect answer. It would come back with a, might get one right, maybe even two, but typically it would start pulling executive officers from other companies around the world. Instead of IBM executive officers, we had the CEO of Cisco mixed in there. The ages were never correct, and that type of thing. So it's interesting, and it's almost like having a toddler that really wants to please you in some way, and is always going to give you some answer and act like it's exactly what you want.

David Turner

Absolutely. Absolutely. And you just extrapolate that, you could think of all sorts of different applications, whether it's data on pharmaceuticals or medical device, or information about something that's going on in an airplane engine. You don't want it to just find you something, you need it to find you the right thing, the accurate thing. How much easier would it be if you sent that, I don't know if you'd send a child, but you sent somebody in to look in a nice organized, clean area like this, where things are? You could go right to where the red bricks are, you could go right to where the certain red bricks are, and you should be able to surface that 2X3 Lego brick almost instantly. And I think that's the same thing that we're finding here in content. So when you just focus on the expensive technology, but you don't think about the content piece, you're not going to really get the results that you want.

So it really also takes a piece of digging in. In fact, on Anthropic site, there was a whole thing about how anytime you have prompts that have XML in them, that these XML style prompts will promote clarity, they'll reduce your misinterpretation. Because at the end of the day, when the computer knows it's looking at some data, it knows "Hey, that is the author's name. In fact, that's the author's last name. And in fact, we know it's this author because he has this ID." It makes it a lot less susceptible to grabbing the wrong data.

Anyway, so that's one of the things that we do at DCL. And just from a high level, we typically are approaching projects with these four pillars of information, right? We're organizing and structuring the content. We're going out and we're grabbing it, putting it through a process to get it to where it can be useful for AI applications, for search, for creating new content, things like that. So just quickly on each of these, so in terms of ingestion, we're focused on things like, what's the most efficient way to pull in the content? Is there some a feed that we get? Is it XML already? Is it the right XML? Is it JSON? Are we dealing with ePub files or PowerPoint?

And sometimes, depending on the format, it can get pretty tricky because formats, like PDF, will lock contents. When you try to move from PDF to Word, if you've ever just done the simple transform, there's usually some measure of mistake. So, Rich, I think one of the things that you've been working on a lot is helping to get these things normalized, mining content, auto styling. And I know we've listed AI as the next step, but really isn't there some machine learning and similar types of automation here?

Rich Dominelli

Sure. So especially in a realm of PDF, document segmentation, which is typically implemented as machine learning, sometimes with AI overseers.

15:58

So you start seeing tools, like Docling or MinerU, that's taking neural networks in trying to decode the structures of content within the document. So it will identify things like mathematical or chemistry formulas, or figures, and identify the information that's embedded in that. There has been more work than you could shake a stick at trying to decode tables in PDF format. Because PDF is really designed as a publishing and printing format, so table representation within a document was never oriented towards extracting the information in a meaningful way.

Instead, it was, this is how it should lay out on the page, and good luck trying to pull what data is in that column, and which column belongs to which. So now you're seeing AI approaches to, like MinerU, like DocSeg, like the TableAI and TableML that are starting to be able to pull this information out in a meaningful way. And it's interesting, because it's almost like a circular loop. Because people need to mine this information to support AI training mechanisms, a lot of the work is being done to try to get that information in a more accurate way. You've seen much more development in the last couple of years than you have in probably the previous 10 or 15.

David Turner

Gotcha. So with that normalization then, that really does set the table for the things that other people are thinking of, in terms of how they can use the LLMs and AI. Can you talk about some of these tools that we apply here, or the environments that we're building with these things?

Rich Dominelli

Absolutely. So a lot of DCL's work, as you mentioned, is taking unstructured journal articles or unstructured information of whatever format it's in and extracting meaningful bits out of it, the front matter, the back matter. Things like references, authors and affiliations, keyword extractions. That type of information, traditionally, DCL has excelled at, but have been relying on pattern matching and even some of the tools like spaCy and Gate, and some of the other machine learning tools out there, to pull it out.

But AIs have really made things a lot easier. We're doing a project for a medical journal where we're using that to normalize and pull structured information, like authors and affiliations, out in a consistent way, like references in a consistent way, because most of these publications or even indexes have a standard that they want to meet, but the authors can't be bothered to meet those standards, from their side. So they hire conversion companies like DCL and other companies to normalize the information, to make sure, accuracy wise, that all the information that they require is there, or kick it back, or flag potential problems ahead of time.

David Turner

And then it allows us then to set up everything successfully so that when we're actually in production, we can get things processed, we can get things set up, we can get the QC done, all to really meet that client's need at scale. And I think to that point, that sets up the little case study/ demo we're going to do today. We recently had to work on a project that dealt with a bunch of legal content, right? So this business case is about optimizing content so that an LLM could do its job better, where we basically took this large multi-district litigation case.

19:56

It basically involved bringing a lot of different lawsuits together with a whole ton of plaintiffs and a whole ton of defendants, and things like that. The issue here is that because all this content was in Word and PDF documents and tables, and things like this, there was critical information just buried all over the place.

The knowledge that you were trying to get to, to get you the answers, were very fragmented. And when you're in the middle of a case like this, it's just lots and lots of time pressure, if you will. So there's this this sense of, and I tried to create it with this slide, just a sense of just everything being all over the place. It's like being in that messy playroom. And when you're the lawyer, or you're working on the case in whatever way for the law firm, you're being asked to go and find this information, you're having to dig. Now, maybe if you had it in a database or something, you could run a SQL query, but typically the people that are on the front lines, they're not doing any of that thing. So I don't know. Rich, you worked on this mostly. Why don't you take another minute or two to talk a little bit about the use case and what you were asked to do here? And then I'll hit a quick solution slide, and then we'll jump into your demo.

Rich Dominelli

Sure. So in this particular case, first of all, this is going to apply to a lot of different documents, but what we needed to do is support an interactive query system to support a law firm for a very large litigation. In this particular case, it's PFAS chemicals and firefighters. These are firefighters showing tragic injuries, like cancers of varying sorts, because of exposure to firefighting foam, as well as fireproof gear, so they were exposed to it. So because there's so many plaintiffs within this case, I mean, the sample document we're going to show today has around 400 plaintiffs just listed out, the nature of their injuries, where they're located, what they were doing at the time of the injuries, is difficult to surface and keep track of.

So the idea is that, taking this information, loading it within a document corpus, in this case, to support an AI-based query tool makes a lot of sense because it allows the person to work with the data in a way that doesn't require them to be knowledgeable about how the data's being stored, what structures are there. Or even, we don't need to develop new functionality programmatically to support additional queries or additional lookups. We've worked on other cases where, when the attorney had a question about something, we had to generate SQL and try to answer the question as best we can. In this particular case, the AI is doing a lot of heavy lifting. That being said, we're also using AI in much of the way I indicated before, where we're extracting information out of these documents ahead of time, using AI, to make it easier to query it after the fact.

David Turner

Yeah. And I would think that this has got to be one of these examples where if you just start with the AI on the backend, and a user just opens up and says "Hey, search our company SharePoint for this information," there could be some trust issues, because it's just grabbing details and it may or may not grab the exact right detail. It'll give you something red or something that's a Lego, but maybe not the red 2X3 Lego.

23:55

Rich Dominelli

So what we see, and there's been a lot of research in this, so retrieval augmented generation is the commonly used term to have AIs work with document corpuses. So the more accurate that retrieval is, the better your results are going to be. The early versions of that just use vector searches to match what the person's prompt was against all of the data in the documents. The problem with that is, vector similarity can be a tricky number to work with, because you may have, let's say you're looking for a particular plaintiff with the last name Brown, that last name may appear multiple times for other plaintiffs and the AI will have a harder time teasing out the details. So what seems to work better is constraining the results of the AI as best as possible. And the mechanism for doing that is by storing the information using both vector storage. In other words, each line of the document is converted into a form that's easy for the AI to query against, but also using metadata information, and extracting the key query points out so it almost becomes a hybrid search.

The easiest analogy I can give you is, if you've worked with Google, and I think everybody has, or any sort of web search tool, you want to make sure that those first few search results that come back, for whatever you entered in as your query, are the most accurate and are going to hit the key points that the person is interested in. And the way you do that is through keyword extraction, through metadata analysis, through the traditional means that are already being done for a lot of these metadata extractions, and were briefly ignored because everybody thought AIs were going to solve everything. Briefly ignored on the AI ramp up, and now are starting to become pertinent again. Because on the other side of the things, now that you have AIs with much larger context windows, and a context window basically means how much the AI can have in memory at any given time, people are discovering that the accuracy of the AI's analysis of whatever you stick into it goes down dramatically, based on the amount of information it's expected to look at.

So this particular document that we're using as an example is 100 and change pages long. I think it's 150 pages or so. And even with that, just loading the whole document into the AI is absolutely possible. Getting accurate information out of it directly at that point starts getting a little wonky. My favorite example for that is, there is a website called A Thousand Names. It's a great useful testing website if you're a developer, because literally it will just give you a list of a thousand names, nothing else. The names mean nothing else. It's a great way to find test data. If you're a writer, it's a great way to find character names for a book, or whatever. If you take that thousand names and you paste it into an AI, and you ask it "How many names are in this list?" It will never come back with a thousand –

David Turner

Interesting.

Rich Dominelli

– ever. So if you start with a hundred, it'll come back with a hundred, 200, no problem. Between 200 and 300 is where things start getting weird, where you start getting different answers. So you'll get, sometimes it'll be accurate at 300, sometimes it'll say 280 or a different number. Or if you ask "Is this name in the list?" Sometimes it will be correct, sometimes it won't. And it's those little use cases and demos that make people start realizing "Well, maybe AI is not solving the be-all end-all.

27:59

Maybe if I can get my information into small, easy-to-digest chunks, or have enough information surrounding my data that I know when something is wrong." It's very important. As you mentioned earlier, we're using AI for a lot of ingestion things. We're using AI for a lot of structured extractions. What we discovered very early on is, we need a way of validating the data. We need a way of making sure that what the AI is coming back with is accurate. So a very simple one to do is counts. If we're looking at a paper and we know the paper has 30 references at the end of it, we check the list of references that the AI extracted. Is there 30? Nope. Okay. We need to rerun that prompt or feed it to a different AI or break it up into smaller pieces.

David Turner

Excellent example. All right. Well, I'm going to jump into this demo just really quickly before we get there, just to summarize what we were talking about, in terms of the solution that we put together here. One of the first things that we did is, we just started by creating a centralized structured data store. Sometimes people don't actually bring this part up, but there's wisdom in just instructing your LLM where to look. Here are the things that we want to look at. And in structuring those items so that it can find them. Then we also worked on creating some AI- and LLM-powered – well, I should say interaction. Sorry, I guess I typed a little too fast, but integration might be a new word. And then of course, relying on the power of RAG.

So, talk about this demo. I think what we're trying to see here is, it's not really that this is a – don't get us wrong, this is not like a demo of us trying to sell you something, or some completed product, that's not what this is about. We're trying to just show an example of "Hey, here's something that we did behind the scenes, where we help make sense from a pile of documents that just applying generic AI tools on top of was not doing the job. And so we did some things there to fix that." So, Rich, if you'll just start again with what was the big ask of this, and take over, and we'll go from there.

Rich Dominelli

So the big ask was to be able to –

David Turner

Oh, and I need you to share your screen, right?

Rich Dominelli

I will do that right now. So the big ask was, we have this litigation. The litigation has hundreds of plaintiffs in it. How do we track the information for the plaintiffs? How do we get the information in the format that's easy to use, easy to work with? So this is our document. It consists of a huge litigation section at the top. And then it has a bunch of numbered paragraphs underneath. So the first thing we do is, we take the PDF file, we use traditional tools to convert to markdown. So when we convert it to markdown, it's now in a more textual oriented manner. It's usually not perfectly clean at that point. So what we find that we have to do is, after we convert to markdown, we need to go through it and clean up information within it, make sure that the names are consistently spelled, make sure that if any blank lines are skipped, and that type of thing. Once we're done –

David Turner

Are you using any AI for any of that part?

Rich Dominelli

At that point, no. We're doing our traditional pattern matching and that piece of information is being loaded into the database. And at that point, we do use AI to do some extraction. So, the first thing you're seeing here is the database diagram, listing the information that we're loading in from that document.

32:00

So first of all, this is a typical entity relationship diagram. We have court cases and court cases have documents, and documents have chunks within it. And then we have this party table. And this party table, we are extracting using traditional pattern matching and entity recognition to extract the list of plaintiffs out of the database.

David Turner

Hold on a second. I can hear right now, some people are looking at this and they're going "What is this thing on my screen? I can't I'm not a database nerd."

Rich Dominelli

Okay. So –

David Turner

"How do you expect me to use this? Where's the AI?"

Rich Dominelli

Okay. So, we're going to get there. So the next thing we want to do is, we want to identify those chunks within the document that pertain to a particular party. So for each of the plaintiffs that we've identified in the document, we're iterating through them, and we're using AI to start looking for unstructured information and turn it into structured information. So in this particular case, this is our query right here. This is a canned query at the top. Move my mouse over here so I can highlight it. So essentially what you're seeing is, this is canned. We replace the plaintiff name for each of the plaintiffs we have in a database, and we do an AI call. At the bottom of this, these two lines come from the document chunks that we've extracted from the data. And this is literally baby steps of, does this string appear in this document chunk?

The AI comes back with a structured result, which is what you're seeing here, which says Michael Brown, where he's located, what his occupation was, what chunks of the document actually identify Michael Brown as a military or civilian firefighter. In this case, it was line 12 of it. Where does he live? What is the nature of his injury? All of that is done as part of the ingestion step. The information is loaded into the database. And the reason why we load it into the database is, now we have concrete elements that we can query against and concrete pieces of information that are loaded into additional tables.

David Turner

Essentially, instead of having a big mess of data around there, you've created that Lego storage that we looked at before. You've got things that are established in places with labels. We know that this is a name, this is a occupation, and this is –

Rich Dominelli

Exactly.

David Turner

Okay.

Rich Dominelli

Exactly. And all of those become entities and rows within the database. Now, this is horrible to work with, unless you're a database nerd, like I am, but this is not what an average person wants to ever use for their day-to-day lives. And I don't –

David Turner

So all you non-IT people out there, again, this is all setup. This is all before it gets to you. We're getting to the good part. Getting there.

Rich Dominelli

Let's talk about the second piece of AI. So what you're seeing here is Claude Desktop. Now, what we did is, after we loaded all the information into the database, we put a model context protocol, or MCP, interface in front of it. And all that means is, it's a word to basically say "Here's a tool that an AI can work with." So what you see is, I've asked Claude – and this, behind the scenes, is calling Claude LLM – "What cases do you know about?" And what it did is, it knows that it has a tool named Cases, and it, behind the scenes, queried the database and came back with information. And this is the tool call you're seeing right here. And I'll shrink that back down so you can actually see the English version of this.

36:03

So now you see that there's an AFF multi-district litigation. This is the docket number. This is all information that was loaded as part of the ingestion step, but it's presented in a much more user-friendly way. So is there any questions you would like to ask, David?

David Turner

Oh, sure. Let's see. Are there any plaintiffs with the last name "Turner"?

Rich Dominelli

Sure thing. So –

David Turner

See if I got a cousin out there that's on the lawsuit.

Rich Dominelli

Sorry, you said "plaintiffs." I can't spell today.

David Turner

It's T-U-R-N-E-R. [Laughs]

Rich Dominelli

Okay. So again, this is a baby MCP. And this is running on consumer grade laptop, so it would be much faster in an actual server environment. So what it's doing is, it's going out and it's listing all of the plaintiffs that it knows about in the database, and it's going to come back with, if any, have the name "Turner."

David Turner

And this is reliable. We know that because this is looking at our structured Lego storage, we know it's pulling us a red 2X3 brick.

Rich Dominelli

Exactly. This is going against our database. It's only using the information that's in the database that was extracted from our documents and are now part of our corpus of this information. So let's see what's wrong with Larry Turner.

David Turner

Larry Turner. Actually, I don't know a Larry, but okay. He must be a cousin.

Rich Dominelli

So again, no SQL. What's going on behind the scenes is, Claude is figuring out how to get the information we care about.

David Turner

And obviously it's taking a little bit of time on this because this is basically a demo something here. But again, compared to the time it would take to sort through however many documents to find that information, is probably pretty significant. And I'm guessing that you could do some additional searches. Well, we'll get there in a second. What would we find?

Rich Dominelli

So he was a firefighter. He has prostate cancer, unfortunately. He lives in Mississippi. So why don't we do something a little more broad? What others live in Mississippi? I can never spell "Mississippi." It was a terrible time in second grade.

David Turner

And I guess too, the other nice thing about it is, again, it does use the natural language here. So you're able to make these queries as a lawyer, or whatever, out there looking for this, without having to write a SQL query or something.

Rich Dominelli

So here's the list. There's four people who live in Mississippi and here's the cities and states. No SQL involved, no technical knowledge required. We can ask it to summarize. This just found four people in Mississippi, but we could also ask it to give us how many people have bladder cancer or prostate cancer as a result of their injuries. So this gives users a much easier time. What other queries would you like me to run?

David Turner

Well, I don't know. Can you do a two part? What was the nature of the injury and whatever? I don't know. Something like that. Maybe it has like two questions in it.

40:03

Rich Dominelli

What is Herbert's injury and what was his unit? How's that?

David Turner

Yeah, sure.

Rich Dominelli

So in this particular case, notice I didn't actually say Herbert's last name. The LLM, based on the context of my query interaction, knows that I was talking about Herbert Hamilton because he happened to appear in the previous list. And it came back with he has kidney cancer –

David Turner

More quickly too.

Rich Dominelli

Yes. Yeah.

David Turner

You had mentioned that since doing this little proof of concept that there'd been some further development by some others, what are some other things that have been developed and added on, in terms of maybe enhanced functionality, enhanced speed, to something like this?

Rich Dominelli

So there's a couple of other issues. So we have another prototype. Unfortunately, I don't have an easy to demo, that runs against MarkLogic, which is considerably snappier than this. We also have the ability to review and interact with the data. So if there are minor spelling differences within the context of the document, about locations or users, the user can correct that without having to re-OCR the document, that type of change. One of the things that we needed to iterate back on is general language identifying and appending general language things. So if at the end of the document it says "Everybody prior to this point also suffers from X," recognizing that language and then ingesting it and adding it to our corpus is something that the LLM is uniquely qualified to do. Because now it understands the language and it says "Oh, okay, this is going to apply to everyone prior to this point within the document."

David Turner

Gotcha. Let's jump over here. I'm going to take over the sharing again, if I can. I think you have to stop and then I start, maybe. I don't know, maybe. There we go. Thank you, Marianne.

Rich Dominelli

Our producer handled it.

David Turner

Yep. I love it here.

Rich Dominelli

What you're seeing here is our ingestion flow. So we start with the PDF. The PDFs are being published to us through an RSS feed that most of the court systems publish. We're using Docling in this case. There's a second version of Docling out there, called Granite Docling. This is a public domain open source toolkit that IBM has put out for understanding PDF structure. We're converting it to markdown. We do a cleanup step, which is very important. Then we do an extraction step. And this is what I was mentioning before, where we're extracting the plaintiffs. We're making sure that the naming is correct. If for some reason the name was broken up because it was Word-wrapped in the document, we fixed that there. And then we're also getting the numbered lines within the document. We're tossing it into a Postgres database. Then we iterate through the Postgres database, and we have this RAG builder step.

And in this particular case, we're using an LLM to do that structured data extraction for things like occupation, location and nature of the industry. And in this case, we're calling against Gemini. For this project, we use Gemini. You can also use ChatGPT or Grock or Anthropic. Comes back with structured information and that gets loaded into Postgres database as well. Then down at the bottom here, we have this MCP server, and that honestly is almost no code whatsoever.

44:02

It's almost just a description and configuration for Claude Desktop, that basically says "Here's a tool, here's what the tool's doing behind the scenes." And then you configure Claude Desktop to be able to call those tools. And that gives us that UI that we're able to interact with and talk to.

David Turner

All right. So those of you who are listening today, if you've got an application that you're trying to develop, or a need like this that you're trying to meet, just contact us and we'll let Rich sit down with you and talk through some of these things. Maybe we can help you in some way. Rich, I'm also wondering, is there any place in here for comparing multiple LLMs, pitting them against each other and scoring results? Has that been any part of it?

Rich Dominelli

So there's a lot of different ways of validating LLM results. The LLMs, even the state-of-the-art ones, are still not what they call deterministic. In other words, if you ask the same question multiple times, you will get multiple different answers. Anthropic did a demonstration of this, where they asked for a biography of Richard Feynman 50 times from an LLM and got 50 unique answers back then. So LLM, as a judge, is the mechanism you're referencing. So that typical piece would end up being in the RAG builder piece. And then you want to make sure the information is as clean as possible. And then in theory, we can put a layer between the MCP server and Claude, if we'd like, to also run LLM as a judge to make sure that it's accurate, and coming back.

Unfortunately, because the LLMs are cloud hosted, there is a latency involved. So if you can preload that as early in the process as possible, it's less antagonistic to the user. You don't want somebody sitting there while four LLMs are duking it out for which one has the right answer, and that type of thing. You really want to make that determination as early in the process as possible. So if you can do as much of it in the RAG builder step, that would be great. And that's some of the things that we've adopted for our other ingestion and structured data extraction pieces, is doing that voting engine where we will try, even the same LLM, we'll try running the same prompt against it multiple times, or we'll run a second check prompt against it. So, "How many references were in this document?" "Oh, there's 30 references in this document." "How many authors?" That type of thing. "How many plaintiffs are in this document?"

So you want to make sure that you have as many checks as possible, and never ever take the LLM's answer at face value. You want to make sure that it should give you a indicator where the data came from and how it got the answer it did, and not just relying on training. It needs to be able to come back and say "Okay, I know there's X number of people here, or X number of things here, and this is why I know that."

David Turner

Excellent. All right. I'm going to transition here. Just hit a couple other real-world examples of some things that are going on that we just would encourage you to check out. We weren't involved with this project in any way, but many of you might've seen this in the news. I think Google put out a big press release and a video about it that you might want to check out, but it talks about how they used AI, coupled with RAG, to really create these trustworthy results for these doctors over in, what, the Netherlands? Yeah, I think.

48:00

Rich Dominelli

Yes, I believe so.

David Turner

And really were able to get very significantly improved results. Rich, are you familiar with this one at all?

Rich Dominelli

Absolutely. So PubMed is one of the largest indexes of medical journals and articles in existence. They teamed up with – Princess Maxima teamed up with Google to take the PubMed data and build a RAG system to support pediatric oncology research, to basically identify the most promising treatments based on all of the articles and data. It's a fascinating case study. I encourage everybody to read up on it. There's a great intro, non-technical video, and they have a couple of GitHub links on how that works. But essentially, what's happening behind the scenes is, they've given a Claude desktop-esque interface to pediatric oncologists to summarize latest research on PubMed data, to identify journal articles that have been cited frequently.

And it's an interesting study also because they are going through the full mechanism. Not only are they doing the vector database search, using vector embeddings to try to figure out what the person's looking for, they're also using a lot of hybrid methodologies, where keywords and journal rankings, and how old a particular article is, all play into what results get fed into the summary and what are exposed to the users. So it really highlights the importance of having not only the text of these documents, but also whatever metadata you can extract from these documents in the structured information.

David Turner

Love it. Love it. One other example that we'll just share, this is one that's personal to us. As accessibility requirements are going around the world, we're seeing a lot of interest in creating alt text and their companion, long descriptions, for images. And a lot of tools have these automated "Oh, hey, we can look at your image," and we could say "That's a man on a horse. Image of a person in a room with a blue shirt." But when you're dealing with really complex data, like we deal with these scientific publishers, what do you do when you get an image like this?

And so Rich went to work and started finding out, was there a way that we could take GenI and create and get a really trustworthy alt text description, and not only an alt text description, but really a long form, long description that could be used without any human interaction? And we're pretty pleased with the results. I think, Rich, this particular one here, you could see it's got all sorts of details and panels. You got a person playing the cello inside an MRI machine, and all this. But talk a little bit about what you did, and then I'll show the results of what you got after you want to just talk quickly about your methodology here.

Rich Dominelli

So the interesting thing about it is, multimodal AI is probably about 18 months old right now. And when ChatGPT first released this, they had this fantastic demonstration where somebody had a picture of a broken bicycle, and they fed it to ChatGPT and said "What is wrong with this bicycle? What can be done to fix it?" And ChatGPT identified all of the problems with the bicycle and came back with it. So I did that as a demonstration. I do a lot of evangelism for AI within the company, as you can tell.

52:01

And Mark Gross, the CEO, said "Hey, it would be great. We have a lot of publishers that are asking us right now, is there a way of generating alt text and accessibility information from these documents, especially these older corpuses that doesn't require human interaction?"

So we went through, we did a number of prompts, and then it's taken off since then, where we've had all sorts of different documents come through, and the AI does a really good job of recognizing what's in the picture. But the secret sauce is, the difference between somebody who's doing this by hand, somebody looking at a document and writing a description versus here's a picture with no context, is that context piece. So now what we've done is, we've taken it a step further and we get these texts that's surrounding the picture and include that with what we're feeding the AI, and the results are much, much better.

David Turner

Absolutely. And these are the results that we're getting. And I should say, by the way, I should have mentioned this at the beginning, this is from a PNAS journal. It is not something that PNAS asked us to do. We found it publicly available, there's been a whole set of publicly available journals from several different journal providers, and pulled those things and ran these tests on them. So did just want to throw that out. The result we got, here's the short alt text for it. It's pretty good. But to me, the real sizzle, if you will, is what it gives you in terms of, without any human intervention whatsoever, it created this very detailed, robust, long description.

Now, you could sit there and try to read this while we're giving this presentation, but let's just say, we've shown this to a few people that were medical publishers and they were very pleased. And I think you're right, Rich, it's because, again, they weren't looking across all journals and weren't looking across all images. There's data that's there from that, but really it was specifically focused on, what's in and around this image? What does the caption say? What is the paragraph before? What is the paragraph after? And what's the article about? All of those different pieces. And I think that's what really helped create this accuracy.

Rich Dominelli

So now the interesting thing about this also is, it opens up even broader search possibilities because now you have a document, and the document has figures and illustrations within it, you can search not only the text of that document, which is traditional search, but you can actually search descriptions of the images within that document, and get accurate results. So it gives you the possibility to do multimodal search very cheaply, essentially. Because you do the legwork upfront, it's done on ingestion step, and then it becomes part of the document representation in your corpus, and you can search against it and come back with information very quickly.

David Turner

Love it. Well, we are a couple of minutes away from the top of the hour, and of course, we've gone on longer than we're supposed to. But we do have a quick poll. Marianne, shall I turn it back over to you and let you throw up the poll?

Marianne Calilhanna

We do have a number of questions. I feel like we should maybe go through some questions.

David Turner

Let's jump to those instead.

Marianne Calilhanna

Why don't we do that? So, Rich, in your demo, you used a PDF that's in the public domain; that was not proprietary.

56:01

What if someone had other format, an Excel spreadsheets or PowerPoint presentations?

Rich Dominelli

So as somebody who has suffered through all of these document formats, if you give it to us in electronic format, normally we can get it into a format that the LLMs will understand. We have a tool called Harmonizer, which I'm not pitching right now, but we ingest everything, PowerPoint, Excel, Word, RTF, right? If it's an electronic format and we can convert it to something we can feed the LLMs, we'll do it.

Marianne Calilhanna

There was a question about processing confidential information. I just do want to reiterate that the –

Rich Dominelli

Sure, okay.

Marianne Calilhanna

Can you speak to that?

Rich Dominelli

I can certainly talk to that a little bit because we've had a number of go-arounds. Most of the LLMs that are cloud-based right now, if you are paying for the LLM in some way, shape or form, they have a no train clause as part of your payment. The information that you're sending the LLM is encrypted. The LLM has access to it, but they promise not to store it for training purposes as long as you're paying for it. If you're doing the unpaid tier, all bets are off. So pay for your LLMs, it's worth it.

Marianne Calilhanna

All right. Well, we have come to the top of the hour. We are just about there. If you could just do a quick Next – hit the arrow button, David – we want to be sure to thank everyone who's taken time out of their day to join us. The DCL learning series comprises webinars like this, but also we have a monthly newsletter and a blog, and it is our goal to strive and open up discussion around these things, around structured content, around XML standards, AI. And if you have any ideas, please reach out to us. If you'd like to continue this conversation privately, or you can think about use cases you'd like to explore, let us know. We're happy to help. For today, it's time for all of us to return to our regular work schedule, and we hope to see you in future webinars. Have a great day.