DCL Learning Series

Genius Without the Gibberish: How RAG and Structured Content Boost Generative AI Reliability

Marianne Calilhanna

Hello, and welcome to the DCL Learning Series. Today's webinar, our first of 2025 is titled "Genius Without the Gibberish: How RAG and Structured Content Boosts Generative AI Reliability." My name is Marianne Calilhanna, and I'm the VP at Data Conversion Laboratory. Before we begin, I want to let this session is being recorded and will be available in the on demands section of our website at dataconversionlaboratory.com. We will absolutely save time at the end of this conversation to answer any questions you have, but please feel free, submit your questions, your comments, your thoughts via that questions dialog box in the platform.

So before we begin, I do want to briefly introduce my company Data Conversion Laboratory, or DCL, as we are also known. We are the industry leading XML conversion provider. We offer services that involve structuring content and data that support our customers, content management and distribution endeavors. Increasingly, we help prepare businesses to be AI ready. DCL's core mission is transforming content and data into the formats our customers need to be competitive in business. We believe that well-structured content is fundamental to fostering innovation and foundational for your own AI readiness.

So, today's speakers: I'm thrilled to introduce DCL's power trio. We have Mark Gross, president of Data Conversion Laboratory; Tammy Bilitzky, CIO; and Rich Dominelli, senior solutions architect here at DCL. Honestly, you can throw any complex data content challenge at them and they will deliver a solution. So, it's an honor to work with these three, and I'm so thrilled to have them here today to discuss RAG with you. Mark, I'm going to turn it over to you.

Mark Gross

Oh, I'll turn on my camera. There I go. This is an interesting quote from Gartner. I mean it says by 2025, and the date on it is October 2024. So they're already prognosticating 100% generative AI, virtual customer assistance and virtual agents that lack integration to modern knowledge managers will fail to meet their customer experience. And that's probably true. I mean, if you look at the chatbots you go through and the automated things you're working with, unless they're very, very sophisticated and really you have a lot of information at hand, they're not a very pleasant experience. And I think what's happening is, I mean, we've been doing data for a long time. We've been working with all kinds of data systems for over 40 years and every few years is a new challenge that comes along and a new opportunity. And the opportunity of the last couple of years has been generative AI and AI in general. And how do you use these systems to make them work for you? The issue is that large language models, the source of generative AI, has not proved to be good enough.

4:06

Although what we talked about two years ago and every conference I went to, this is like change the world totally. People have been disappointed and part of it is that it's not – while the answers look very good, they're not necessarily complete or truthful or there's hallucinations, all those kind of things. And the problem is that those models just don't have current, first of all, don't have current information because it takes a long time to create a model. So, it might be two or three or four years old by the time you get to it, and it won't have your specific information. So, while making your own up to date and making up your own LLM for your own company is a theoretical possibility, it's only a practical possibility. It's an expensive process, a lengthy process, and very few companies are in a position to really spend that kind of money.

So, what we're talking about today is splitting up the roles between what LLMs do and generate text very, very well. And really the other part of how do you get up to date information, which is RAG. Go to the next slide, please. I think, there it is. So the question is why is RAG important in a business setting? Because it really supplements where LLMs alone fail. LLMs are very good at generating text and images when they have the complete information, but if they don't, funny things happen. For example, this image over here was generated totally by, I think, it was ChatGPT and it came from a quote, I mean, I could read part of it, but can you create an image for my presentation? I need something that visually represents the key know elements are great for doing things, but they need give me a creative way of saying the following phrase, give me an image that does this or that, but it's even better for summarizing information so on. So, it went and created this from that sentence and its base of information it had already.

But if it doesn't have all the information – so, think about it: LLMs are good at doing something like this and it's summary information, but they need more information than they currently have. So, let's go to the next slide. So, key applications today of LLMs, chatbots. Sometimes helpful Google results, mostly valid, sometimes valid, you don't always know. Internal LLMs are very costly, said. So LLMs themselves are very good at generating human-like texts, but they make mistakes. If you look at these, both these logos are actually generated by ChatGPT, and ChatGPT has its own, their typos not us, that was generated by ChatGPT when it produced this logo, and the same kind of thing in Llama 3 below. It got the Llama part in the picture, but it got the text a little bit off. So, the limitations are that it doesn't have all the information. Let's go to the next slide. This is just a test we did with a simple question asked on ChatGPT.

8:00

We told it "Use only the information available here." So we specified it, told it where the PDF is to go look at it and print me a list of officers, their positions, their ages that appear in a section entitled "Executive Officers of the Registrar." So, that's a pretty clear question. So, let's take a look at the results that we have here. And by the way, that PDF that I told it to go to, that's this and this is the page we told it to go to and this is a section we told it to go to. Again, very clear, but what happened? So, here's the results. First of all, rather than grabbing and reparsing the PDF file that it had, it decided also that it needed to go out to its own initial database, someplace out there. And the version came back with looks pretty good, but there's a few things to note here about when you look at it, when you compare side by side.

For example, Michelle Browdy's age, it gave her an extra three years in age, which she probably doesn't appreciate. And where did she get that? The title is different because it grabbed it from someplace else. It missed a few people for whatever reason. It did get James Kavanaugh, which is the SVP down there, expanded his name, also gave him about four years extra. By the way, the green is the chairman; it did get completely right, so that's impressive. And then it added a few people there, the people that it added. Fletcher Previn is a CIO indeed, but is a CO at Cisco, not IBM. Kenneth Keverian is a chief strategy officer at IBM, who's not even in the listing and so on. So whatever reason things happen, there's hallucinations and that stuff happens. And I'll turn this over to Tammy in a second, what RAG does is in big picture is it's taking information that is truthful and feeding it into the ChatGPT, so that you limit all the places that could get wrong information from. And to get a little more detail on that, let me turn it over to Tammy.

Tammy Bilitzky

Thank you, so enter RAG. "RAG" stands for Retrieval Augmented Generation, and it's a term that, unlike many other technical terms, actually describes the capabilities that it offers. Mark mentioned how AI may be error-prone and hallucinate, delivering results that may or may not be correct, and this is largely not because of the LLM model itself, rather the data that LLM is using to respond to your query. To explain this more clearly, a standard LLM will pull data from any source it can access, whether it's structured, unstructured, verified, or just information that's out there on the internet. And this clarifies the value of RAG and exactly what it brings to the table. RAG allows you to pass your data, ideally structured, curated, and verified with your prompt. And it harnesses the power of the LLM by telling to respond to the query only using the data that's being passed to us.

11:59

So in essence, RAG is enhancing AI's ability to provide accurate, relevant and consistent responses by augmenting it with external knowledge, the data that you pass it, to leverage its generative capabilities, instead of relying just on the data that was used to train the model and what the model will pull from other sources. So the primary advantage that RAG offers is the ability for you to integrate your content into an existing LLM in a manageable way, reduce hallucinations, and deliver answers based on your trusted and verified sources. So, now I'm going to pass it over to Rich to go through the RAG architecture, Rich.

Rich Dominelli

Hello, everyone. So as a slide demonstrates in a simplified format, there's really two aspects to the RAG architecture. The first is the data preparation set, and this is very similar to what a lot of people do when you're building knowledge bases where you're going through a list of data sources, you're extracting information and the information can be present in unstructured, it could be present in external websites, it could be in your own databases, it could be PDFs. So, we want to go through the steps necessary to take that information, make it easily retrievable. So doing that is basically going through a structuring and chunking phase, which is something we do pretty regularly around DCL, where we're taking the information, we're extracting front matter and back matter, we're extracting keywords, we're extracting as much metadata as we can out of the document as well as subject matters, all to eventually end up in this hybrid vector database to act as our corpus.

The other thing we do now, which is new to the picture, is we do a embedding step and what we're doing for the embedding step is we're taking the body of the article and we're running it through a text embedding step, which is identifying semantically what the article is about, and that will allow us to do the pieces of we want to take the user's query, and we want to match it based on this subject matter against the body of the article. So, all of that occurs before LLMs or before the user's prompt ever enters the picture. That's the prep step. This is stuff that people are doing already for their existing knowledge base, but this is taking it one step further. So now, we enter the RAG pipeline. So the user who is trying to use either a chatbot or query your knowledge base in some way, they're going to type in a prompt.

The form of that prompt or how he does that can vary based on your system, but we're going to take that prompt that the person entered. We're going to also run it through a text embedding step. And what that does is it's going to convert the text prompt, much the same way we converted the actual body of the articles into a format that we can do semantic matches, which is basically an array of numbers which identifies not only the words in the prompt but also an essential meaning to it. In addition to that, we're going to look at the prompt and we're going to extract whatever relevant data, keyword extractions, hard data points like referenced authors and that type of information, things that we can use to really narrow down what we want to get out of our knowledge base, and take the information from our knowledge base based on what the user's entered automatically in the text embedding step and those hard data points, and query our vector database.

15:54

Now the results of that query are going to be appended to the person's prompt, and we're going to modify the person's prompt to essentially say "From the information I'm presenting here, answer my question." And what that does is it eliminates the reliance on any previous training. It minimizes the amount of hallucinations that the LLM is going to engage in, and it's going to give you a much cleaner answer from the information that is most pertinent to what the person's query is. And it's coming from your data as opposed to previous training or information that you're not aware of. So you can actually source the elements from your vector database and where they come from. So that being said, next slide, and back to Tammy.

Marianne Calilhanna

Tammy, you're on mute.

Tammy Bilitzky

Good point. Thank you. All right, thanks, Rich. So let's dive a little more into the content preparation portion from that RAG architecture slide, which is critical to improving the accuracy of the AI results. The importance of this step should not be underestimated as it allows you to find and create your RAG universe. It makes it possible for you to get accurate, reliable responses from the LLM. The key areas to focus on now are the first two steps on the architecture diagram. The first is the content extraction where you can take disparate data sources in a variety of formats and quality levels, including PDF, audio, website content, and use different methods to extract the content and turn it into quality data. At DCL, we've been doing this for years and we use many different approaches and solutions to accurately get actionable data from print, PDF, Word documents and a growing area now is video transcription.

Then once you've extracted the content, the next step is to structure it and transform it from raw, extracted data into intelligent content. Intelligent content is essentially synonymous with structure content, and at DCL we typically recommend XML, because it's an attribute-rich format and remains one of the most effective structure content frameworks in use today. XML is optimized for authoring, producing and distributing technical and non-technical information, making it that ideal medium for organizing and managing your valuable content. Flipping back to our diagram for a minute, on the next slide, please. Thank you. Let's stay focused on the left side of the diagram, the data preparation. So data preparation, as we mentioned, is a core DCL service and while we've been delivering data preparation services for decades, we're continuing developing new and advanced solutions to extract and prepare the data, taking full advantage of and contributing to evolving technologies.

And it all begins with selecting the right sources and extracting quality data from these sources. Extracting the data and structuring it again is DCL's area of expertise. This is where we develop and optimize tools and solutions to process large repositories of varying sources. And then we transform that data into XML and populate the XML with any required metadata that can be automatically extracted from the source content and even external sources, external databases or related systems. The last step, as mentioned, which had touched upon a little earlier, is embedding, where we prepare the data for the LLM by converting the structured protection information into the format required by the model, numerical arrays.

20:03

And now I'll hand it off to Rich to explain a little bit more about that intricate process and tell us what that means. Rich? Rich, you're on mute now.

Rich Dominelli

So I am, sorry. So, text embedding is actually a big jump because prior to Word2vec and some of these other text-embedding toolkits that have hit the market, we would extract as much information we can out of the article and create front matter and back matter. But the essence of the article's meaning was still lost in the body of the XML that we were capturing. Text embedding takes it a step further, because now what we're doing is we're creating numerical representation of each individual sentence or each individual section within the article, and we're putting it in the format that computers can understand and query against based on content. So, what this diagram is trying to represent here is how close in an answer space, theoretically, each of these individual phrases are. So "canine companions say 'woof,'" the text-embedded vector, which will just be an array of numbers, is going to be close to "felines say 'meow'" and "bovine buddies say 'moo.'" But it's going to be a much further distance to "a quarterback throws a football."

It's almost like, for those of you who are old enough to understand the Dewey Decimal System and remember that back as a way of organizing the library, it's almost like taking each individual part of a document or each individual part of a corpus and organizing it in a detailed subject matter so that the relationship between those become represented based on the numerical arrays that appear out of that content. So if we can go a step – next slide. What does that let you do? Well, we already have a well-understood problem space of trying to find data based on keyword extraction and relevant in pieces of information that are being pulled out of the articles. This is the structured data, like references and authors and keywords and dates and institutions and funding sources, that has been the bread and butter of conversion companies, DCL being at the top of that, for years.

And the use for that is very clear as far as being able to locate content very quickly in a large corpus. I mean, not talking about a couple of documents here and there, we're talking about hundreds of thousands or millions of documents that you're combing through. Embedding takes it one step further, because now what we're doing is we're also being able to look at the unstructured bits of the information, the bodies of those articles, and also query directly against that in a meaningful way. Next slide, please. So if you're like me and you're reading and trying to find out what RAG is about, behind the scenes, what's really going on is as simple as what you're seeing on the screen right now. We're going out, we're doing a query against an external data source. In our example, it's a hybrid data source. It could be a graph database. It could be a lot of different sources of information, and we're taking the results of that query and we're appending it to the prompt.

23:52

So, looping back around to our 10K discussion, example, from earlier, instead of saying "Here's a PDF, here's a section we care about in a PDF, go find it yourself, Mr. LLM, and come back with relevant data," no, what we're doing is we are performing a query, obtaining the information out of our corpus and appending it to the prompt that the user created. So we're rewriting the prompt to include the results of the query. In this particular case, we're rewriting the prompt to include the relevant section from the PDF and saying "Please go out and get the information using only this data." So this is the updated prompt. And at first glance, when you see an example like this, it almost looks like you're cheating. And this is a very simplified example, but the same works true if you have a huge collection of data out there or an entire database full of 10K information and you want to extract only the information you care about. But given that simplified example, we can go to the next slide and see what the LLM produced when given this prompt. So, now you'll notice that it has a perfect score.

So when presented with only the data we care about, which is really the secret to LLMs, it was able to actually summarize that data and give us the correct results minus the hallucination and people who work for other companies and minus any information that it's trying to come out of its training with and give us absolutely accurate results based on the data that we gave it. Next slide, please. And now back to Tammy.

Tammy Bilitzky

Thank you. So, now we've covered what RAG is and what RAG isn't. The key message that's become very clear is that the data that you pass along with your prompt, your verified, curated, structured data is fundamental to receiving accurate responses from the LLM. Building on that, you want to make sure that you're passing all the data that is needed to get the best response. The challenge now is that today with the current technology, there are size limitations as far as how much data you can pass to the LLM with your prompt or your query. So you retrieve the data from the repositories, you merge the data with the prompt, and if there were no limitations, you could append the entire data set to your prompt to generate the results.

But context size, which is the amount of data you can pass to the LLM, is limited today. And as you can see from this slide, each LLM has different restrictions. And understanding now that the data you pass to the LLM is the key to RAG, it's not surprising that LLMs are increasing their contact size, and I do expect that this will continue to be a competitive area of growth for LLMs for the next long time. But the reality is that LLMs do have limitations on the content size, which controls how much data you can pass to the LLM with your query, which offers another advantage to structured data. For example, when data is structured, we'll typically extract the graphics which are often large, and we'll save those as reference files. So the structured data that we pass to the LLM may be smaller, allowing you to pass more of the relevant content and get more accurate responses. And now, I'll hand it back to Mark.

Mark Gross

So, I love the way Rich introduced that; it's almost like cheating. You're giving all me the answers, but it's very simple, especially.

20:03

You could have just as easily have given it that entire annual report as a source, which would be 25 or 30 or 40 pages, and it still probably would've come back with similar results, because it's now not looking at external sources. And RAG, as we're talking about it, really gives you the ability. The question has always been what do I do with the stuff that's behind my firewall? What do I do with my private data? What do I do with all those things that are really my competitive edge? And the answer is that becomes very valuable. Your data repository is still valuable. It's probably even more valuable than ever was before, because people can now get to it and be able to get information from things that only you have in your repository that others don't. So, I think there's a lot to be said over here and I think it's all very exciting in a new frontier of how to use information. So Rich, Tammy, what do you think? Is structured content, the future of AI? Is this where we're going?

Tammy Bilitzky

I guess I can start that off. I strongly believe that the answer to this right now is yes. There may be and there will be different techniques applied over time to mind data, vet out a relevant data and even to verify data. Following the patterns of technology we've seen them evolving over time, RAG will continue to evolve and it's likely going to be replaced or transformed by newer methodologies, newer technologies. But regardless at the core, data, quality data, remains your key to AI and being able to pass your structured verified data to the LLM is likely to continue for the foreseeable future to deliver faster, more consistent and more accurate results. Rich, what are your thoughts?

Rich Dominelli

So I absolutely think it is. But I think my reasoning for it is coming into the LLM space, looking at what was out there I expected, and I think a lot of people have this expectation of being able to ask it anything, and it's going to go out and find the answers and the information about what I'm asking it. So, I like to use this example of the 10Ks because it's one of those problems that we have tried to tackle a number of years ago, and it had its challenges because of formatting and other things like that. And I was routinely disappointed, and I have tried this problem against multiple LLMs: ChatGPT, Claude, Llama, and I always found that if the LLM doesn't know what it's talking about, if it should just come back with an answer that says "Sorry, I'm not able to figure that out" or "I don't have access to that information," frequently what happens is it'll just make something up.

And that's the hallucination conversation that people are always having and why hallucinations are mitigating the use of LLMs and causing a lot of disappointment in the business community with LLMs. What RAG does is it allows you to go out and constrain what the LLM doing is and have more refined control over what it's doing and that lets you turn around and get much more accurate and good results and improve – as that Gartner quote said at the beginning of the webinar – improve the customer experience and hit that customer experience goal.

32:01

Well, then it becomes an exercise in a fairly well-understood problem. We have a large corpus of data, it's our data or it's our knowledge base, it's our articles that we're feeding to an index or whatever. It becomes a well-understood problem of how do I accurately query that information. And then the LLM becomes like a summarization tool. It becomes a tool for this is roughly the content that I think you should be looking at, summarize it and present it in the way that I'm asking you to present it in. That's a much better use case for LLMs, and it almost becomes part of your presentation layer, where it's going out and it's getting the information through traditional means. We've added semantics to it, which is a big jump. But then the summarization piece is really where the LLM shines, and the structured data component, being able to have those metadata bits, are a huge piece of finding that data that you want to query and feed to the LLM, and do it in a quick and responsive manner. It goes without saying, you need that structured data out there to be able to query your content now more than ever.

Mark Gross

And actually, I love your Dewey Decimal System metaphor, that's still out there. Still go to the library occasionally and that's still the way you find books, but just like other things, computers have allowed you to take it to the next level. Because just looking for a book on that number forgets that some parts of the book belong in other places. Here, the embedding part of it lets you tag basically every paragraph where every sentence or every phrase is really identified by where it belongs, which is, so the Dewey Decimal system when it was introduced 150 years ago was a tremendous next step in how you find information. This takes it to a level that only could be done with the high speed computers we have today. You mentioned the use cases and really going back to the phrase before about what are the use cases and how do you integrate them into, let's just go through a few of them.

A customer support system, it's not just answering questions about "Where's my order?"; it's also answering questions about supporting all the technical information they've got or customers have very sophisticated queries. A telephone company, a telecom company that's trying to support all kinds of technical material, they can have that database that they've built with all their different products. It could be thousands of products and hundreds of thousands of pages, millions of pages, whatever pages are today. And the RAG part of it lets them find that particular piece of information that a customer support person will need.

And then the LLM part of it, the chat part of it, will be able to put it together in a way that's understandable to a customer. And by doing this, you don't necessarily always need a customer support representative doing that. That chat function on your computer can become much, much more powerful. Same in healthcare and finance and other professional areas where you have lots of very specialized information that's not on the web. It shouldn't be on the web and this lets you – and so you can get access to those very large collections of medical publications and journals and in the law area any place in the world, and be able to answer questions. And by RAG, find the relevant parts of it, instead of having to look through a million pages, which LLMs can't really do very well, it can't do.

36:07

It'll identify the 50 or a hundred pages they need to look at and be able to identify that. E-commerce, more and more people, I don't know if most people are using online ordering these days, but I know I do a lot and a lot of times I'm asking some questions of comparison information. How do you get that information together? It's not there today. I still have to go to a bunch of websites to find things. But an agent that can go out and find that information, find the relevant information and put the information in a format that you can use, put it into a table, put it into a picture, put it something would be a very valuable way to do it.

The same for all these other places. You can see crisis management, one of those places where you, talk about the fires on the West Coast now, but any crisis management area where you suddenly need to have lots and lots of information that you've been collecting and not easily available. But if it's all digitized, if it's got the right information attached to it, if it's got the right structuring attached to it, it can be instantly findable by a computer, so that the end user can get to it or a specialist in the middle can get to it. And those places, when seconds count, this can be a game changer in how people process information. It could go on lots and lots of case studies like this all over. Some of them are I think on our website. And I think that's, as far as I – I mean, Marianne, do we have any questions or do you want to go any place from here?

Marianne Calilhanna

We do have some questions. So I'm going to do just want to share this screen with our attendees. We've been collecting RAG resources. Rich helps keep us all informed internally, and we do share them on our blog in our website. You can scan this QR code, or our colleague Leigh Anne is also going to push out that URL so you can go there directly. So, Rich, one question for you. When will AI auto-mute and unmute participants in a meeting?

All

[Laughing]

Marianne Calilhanna

But seriously –

Rich Dominelli

We could do that. Look at the person's mouth and then automatically – that would be good.

Marianne Calilhanna

So, one question –

Mark Gross

It'd probably do that today, because there are muffled devices that cut out outside noise, which will let in a person's voice directly, so they're figuring it out somehow.

Marianne Calilhanna

Right, right. Yeah.

Tammy Bilitzky

Teams will do that today, when I start talking, sometimes it will say "You're muted." It is detecting it. I think it's better.

Marianne Calilhanna

That's true, that's true, different applications. Okay, so, with the current context size limitations, how can my large corpus be effective in a RAG environment?

Tammy Bilitzky

So, I think the concept that we're discussing about is bringing the value of the structured data into that. So if you have a large corpus, number one, there are LLMs that are allowing larger and larger corpuses and larger and large context sizes. And then again, it's taking a look at your data. That's part of the data preparation. What is your data? Why is it so large? Is there data that we can remove that is extraneous, it's not relevant, that makes it not AI/RAG-ready?

40:01

What is in your data that doesn't have to be there? And as part of the preparation, you pull that out, pull out the graphics, pull anything that's not relevant so that the size of what you're passing to it with your prompt becomes manageable, comes within the guidelines, but still helps you drive the right answer. That would be my first response. Rich, you want to add to that one?

Rich Dominelli

Sure. So one of the strategies for accommodating something like that is when you're ingesting your data from your corpus into your database, you're not doing it at a document by document level, you're doing it in manageable chunks. So in addition to having a structured information about the document as a whole, you would have a collection of chunks that are part of each of those documents. The chunking strategy will vary based on the use case. It might be section by section, it might be page by page, it might be smaller or larger than that. But then what you're doing is you're going out and you're doing a semantic and keyword query to find the most relevant chunks of information that you're going to append to your query. And if it's the correct hit, then the query will have a much better hit, a much better response.

So, there was a case study that is probably in our resources, and if not I can dig it up, from Carnegie Mellon where they're going through all the academic papers and their internal to correspondence, and they had a huge curated collection. And they decided to arbitrarily chunk at 1K chunks. So, every document they had, in addition to having metadata and that sort of thing, they broke into 1,000-byte pieces. And then when they went and did their query, they did it based on the metadata and they did a semantic search. And then they used the metadata to further refine the list of information that was appended to the query and then do a much more constrained query after that. And that's one approach and it really depends on the scenario. There isn't a perfect cookie-cutter answer across the board. Finding your data is the most important part, but chunking your data is a big piece of it and that comes back to the structured content discussion.

Marianne Calilhanna

All right, thank you. And if we do not answer your question, please know that we will be in touch after this webinar, that goes out to our attendees. Another question, does your raw data source have to be consistent? For example, we get technical documents from automotive engineers, and they're in Excel spreadsheets, but not every spreadsheet to set up the same way for all vehicles.

Tammy Bilitzky

Well, again, that's what – sorry.

Rich Dominelli

Yeah, it's content imagination. Go ahead. I mean, we tackle that sort of thing all the time. That's not it.

Tammy Bilitzky

That's DCL's bread and butter is taking repositories with content in all different formats, totally inconsistent. It's so rare. When we do a project, we will get a sample from the client, and you do your first customizations and set up based on that. And when you start getting production data, it often doesn't look anything like the samples because content is just so variant. So that's why the techniques and tools that we've developed are specifically designed to be able to handle the fact that content is not consistent, because I would think inconsistent content is far more common than consistent content. The beauty is that preparation stage.

Marianne Calilhanna

Right.

Mark Gross

It goes further to just Excel spreadsheets. I mean, today content is coming in so many different forms. I mean so much information is in YouTubes and all kinds of other places.

44:02

And the technology now to extract text out of those, and a lot of it is just at the very early stages now, but all this information that people have been collecting over the next year or two or three will become very valuable resources. So, these corporate databases, these private data, these journals, these are very, very important assets and very valuable assets, and it'll continue to be important as long as they maintain properly.

Rich Dominelli

One thing I want to just add to that, you should always approach these projects as if there's going to be a decent percentage of stuff that's going to be outlier. And it's important that you design whatever your ingestion tool chain is, whatever the steps you're taking to pull that content in, with a QA step or something that's going to raise a red flag when these things come up. I mean, even some of the projects we do with government resources. We're supposed to have a very specific form document and then maybe one out of every 40 of them is a handwritten scribbled note that looks like it came off the back of a cocktail napkin that we're supposed to deal with. And it's important to catch those kinds of things and be able to raise the red flag in your tool chain and assume that even if it looks like it's a 100% consistent or 99% consistent, there's going to be 10 or 15 out of every hundred that are just a little wonky or a little off.

Marianne Calilhanna

Rich, a question about the classic test we always do with the 10K reports. When you provided the PDF, didn't you tell the LLM to only use the PDF that was the specific piece of curated information, was the temperature turned down to zero? How is that different than providing a source where that PDF was contained?

Rich Dominelli

So an interesting thing in my journey of playing with LLMs is a lot of them supported temperature and controls to keep it to the best results possible. Until very relatively recently, ChatGPT did not have the ability to go out and hit external URLs. It wasn't always consistent about telling you this though. Sometimes it acted like it went out and did it, as in this case, even though behind the scenes it was relying on training data that it had already to pull that information. Regardless of that, the idea of being able to explicitly find that information for it and give it to it to summarize. In this case it was a single 10K, and it's not a hard thing to pick out that information yourself.

When you're talking about thousands or tens of thousands of documents in a variety of formats, then it becomes a much more interesting problem to solve, because then the formatting is different between all of them, getting back to our consistency discussion, and even with the temperature turned all the way down and with the best results turned all the way up, you're still going to experience a large number of hallucinations that are not always clearly identifiable. We had a project where we were taking authors and affiliations, as a good example, and we ran through several hundred documents and were getting great results and then we tried to start using it in production and discover that, oh, well there's a percentage of these that are coming through that look great but aren't. And that's really what RAG is trying to solve.

47:58

You know, we're trying to give as much of a lift and helping hand to the LLM when we're telling it to pull information out that we can, to try to mitigate those problems and try to mitigate those errors that come out of that that are sometimes very hard to discover.

Mark Gross

In my simplified view, really, we're splitting the responsibility between the LLM, which is very good at summarizing, putting information together, from the part about finding the information in the first place. And the RAG part is, it becomes responsible finding the information which it's much better at. And that's the way I think of it, giving the information a hundred, and summarize the hundred pages, don't summarize the million pages out there, because you can't possibly do that. Even Mr. LLM, you can't possibly do that.

Rich Dominelli

And I have to say I was a little disappointed that it couldn't do that either. I was looking forward to this bright new future of our AI overlords being able to answer any questions I had, because I have a lot of questions all the time. And I'm routinely disappointed in the quality of the answers.

Marianne Calilhanna

Okay, do you think that RAG would benefit upfitting existing content for accessibility?

Rich Dominelli

That's a funny question.

Tammy Bilitzky

Since we've been going through this.

Rich Dominelli

We've been literally looking at that as a specific problem right now, using LLMs for generating things like web texts and identifying foreign words in text passages. Short answer is yes.

Tammy Bilitzky

Yes.

Marianne Calilhanna

Yes.

Tammy Bilitzky

And dramatically so. I think we're all blown away and impressed by how good a job it's doing.

Marianne Calilhanna

Right. Here's another question. What safeguards should be considered with RAG approaches involving proprietary data to deter misuse, content theft or misrepresentation?

Mark Gross

Wait, wait. Can you the repeat the question?

Tammy Bilitzky

It's a good question.

Marianne Calilhanna

Yes. What safeguards should be considered with RAG approaches involving proprietary data to determine misuse, content theft or misrepresentation?

Tammy Bilitzky

And we're talking about using LLMs that are external, not an LLM that you're running and is only using your data.

Rich Dominelli

Correct.

Tammy Bilitzky

Because that's one of the first safeguards that we talk about is not using... Because typically unless you have a contract, when you're doing the APIs, when you're passing the data, they're allowed to use the data for training. So what other companies do, as Yale has done, is we create LLMs, we're running in-house, the data's not going outside, and we have complete control over the data, so it's not leaving our firewall. So, it's as controlled and is subject to all the stringency and controls that we have for all our in-house data. And Rich looks like he wants to say something more about that.

Mark Gross

Splitting off the data itself is the same safeguards you always use with your proprietary data. Your database is your database and all the firewalls and all that stuff is there to protect it. It's really that front end, and that's the part that really has to be software that's disconnected from the internet. I'll leave it to you, Rich. You have better words for that.

Rich Dominelli

The short answer is if you're feeding your prompt to a hosted LLM, whatever you're including in that prompt typically becomes part of the training material for the prompt.

52:06

So. How to deal with that. You can use a locally hosted and on-prem LLM. That was the Samsung lesson that came out a few years back where Samsung discovered that all of the information that they were sharing, I believe, it was ChatGPT, might have been a different LLM, was suddenly becoming part of the training data and cropping up in other people's queries. What they did is they brought their LLM in-house. Llama, which is Meta's LLM offering, runs – you can run some of the larger models locally if you have the hardware for it.

The other thing that's starting to crop up, Anthropic, I think, was the first people to do it, but you also see it with Microsoft's cloud offering, is they will host it for you in an isolated instance, where you can say, we're still calling a cloud resource. So you're not bringing in these monster GPUs and servers, but your information is contractually isolated to only the instance that they're holding for you and will not be shared with other customers and will not become part of the general training set.

So, if you have the facility and you're happy with the results for an on-prem LLM, I would recommend doing that, because there's no way it's getting out then. You can certainly put tools to make sure your LLM isn't sneakily communicating to the outside world like GlassWire or that type of thing, or look at some of these hosted solutions that are dedicated to you. They're not cheap, but they are out there. They are becoming more available as things move forward, because it's a pretty common use case where people want to have LLMs want to use an LLM, but don't want to share anything that they're doing with the outside world.

Marianne Calilhanna

And I would just add, from the business side, it's really important that all organizations develop their own AI usage policy. Just as we at DCL, because we're ISO certified, at our company meetings, Tammy always makes sure that we have an education session and our staff all know how to responsibly use any LLMs that we might be interacting with. Okay, another question: do you use the same embedding tools for both data prep and processing the user query?

Rich Dominelli

Ideally, yes. Sorry, I think I echoed. Ideally, yes, you would use the same library for both. One of the things that people aren't normally that conversant with is when you're doing the embedding step, it is looking at the problem at the answer space of text embeddings that you've run before. So if possible, and you can store it and continue to use the same one, you'll get better results. There are a number of them out there. There are some specialty ones out there for supporting things like medical journals and that type of thing. I would encourage playing with a few, not just picking the first example you come across, just because that happened to be whatever Stack Overflow tutorial recommended you play with different ones because your results will vary.

Marianne Calilhanna

All right. Question is, are you using a vector database? If so, which one?

55:59

Rich Dominelli

Right now my favorite is a hybrid vector database, which is the pgvector add-on to Postgres 16, because that lets me have the best of both worlds. I can have relational data associated with the information that we're storing in the database, and as well as being able to do the cosine similarity type queries, again to do semantic queries and do the distance type checks. So, that's my personal favorite. There's a ton of them out there. I came to this solution with a Postgres predilection, and I've used a bunch of projects, so that might be swaying my answers slightly. But it works well and I don't hesitate recommending it. It's a little bit of a bear to set up, but once you get it set up, it's great.

Marianne Calilhanna

We have more questions that I know we're not going to get to, so we will respond; myself, Rich or our other colleague, David, will definitely be getting back to some of these questions. Final question: do of any publishers currently using RAG in the way that you've shown today?

Rich Dominelli

Well, Carnegie Mellon University, for one.

Marianne Calilhanna

All right. Well, we are at the top of the hour, so thank you everyone for taking time out of your day to join us here at the DCL Learning Series. And our learning series comprises webinars such as this. We also have a monthly newsletter where we curate content such as this that we find very informative. We have our blog, and you can access many other webinars related to content structure, XML standards, AI, and more from the on-demand section of our website at dataconversionlaboratory.com. We do hope to see you at future webinars and hope everyone has a great rest of their day. Thanks, Rich, Mark, and Tammy.

Mark Gross

Thank you.

Marianne Calilhanna

This concludes today's broadcast.