top of page

DCL Learning Series

Hallucinate, Confabulate, Obfuscate: The Perils of Generative AI Going Rogue

Marianne Calilhanna

Hello and welcome to the DCL Learning series. Today's lunch and learn conversation is titled Hallucinate, Confabulate, Obfuscate: The Perils of Generative AI Going Rogue. Welcome, everyone. My name is Marianne Calilhanna, I'm the VP of marketing here at Data Conversion Laboratory. And I'm so thrilled with the reception this webinar received. I think it's really important that we educate ourselves on how these large language models work, so hopefully you'll gain something from this. But before we get into the good stuff, I do want to share that the format of today's event is a lunch and learn discussion, so I hope you have a couple nibbles and a cool drink.


Also, my colleague Leigh Anne is working behind the scenes, and if you have any issues with the GoTo Webinar platform, just send her a chat and she'll do her best to help you out. A reminder, this webinar is being recorded and will be available in the on-demand section of our website at www.dataconversionaboratory.com. We invite you to submit questions at any time during our conversation. We will save time at the end to answer any questions submitted. I'm so thrilled to have here with us today, my colleagues Tammy Bilitzky. She is CIO here at Data Conversion Laboratory, and Rich Dominelli, Systems Architect extraordinaire. Welcome, Tammy and Rich.


Rich Dominelli

Thanks, Marianne.


Tammy Bilitzky

Thanks, Marianne. My thanks to all of you for joining. I'm also looking forward to having this conversation with Rich. Rich, any discussion with you as always informative and enlightening.


Rich Dominelli

Well, thanks, Tammy. I apologize to everybody ahead of time. I'm at the tail end of a cold, so you may see me taking a drink every once in a while. I'm sorry, I'm not trying to be rude, but it's good to be here. I think we want to start with some definitions, right? Let's talk about what we mean by artificial intelligence. Artificial intelligence is any time we're trying to use computers to emulate the way human thought works, in order to perform things that are typically hard for computers to do. To understand natural language, or interpret vision, or make decision processes in a fashion that's much more close to the way that people think. 


Back in the day, Ray Kurzweil when he was working on OCR, did a lot of modeling for how the human brain processes vision and how it interprets characters. And that was kind of, I don't want to say the foundation of modern AI, but it was a big step forward in AI. Machine learning on the other hand is taking neural networks and basically building a statistical model. It's a branch of artificial intelligence. It's a subset of it, but it's basically using large statistical models to infer results based on previous training. Deep learning is a subset of it. Again, we have these large statistical models. Sometimes we have emergent thought processes, and these statistical models have grown to the point where it's getting hard to understand what they're doing under the covers. There's a couple of interesting academic papers that talk about what GPT-2 was doing under the covers, and it took quite a bit of time to understand everything that was going on.


4:07

And then we have natural language processing where we're trying to have computers understand the nuances of unstructured data and speech and prose. The English language compared to computer language or honestly any language compared to computer language is much less rigid in its interpretation. It's much more nuanced. There's a lot of context clues that are going on. So, these are things where we're trying to have the computer understand exactly what's written.


Tammy Bilitzky

Right. That's perfect, Rich, and just to pick up on that and maybe tie all three of these together, we now have LLMs or large language models. Which are really more than a natural progression of machine learning and natural language processing, it's really becoming a revolution. Because all the factors have come together now to make it possible. We have a growing volume of deep learning algorithms, as you mentioned. We have powerful, relatively affordable GPUs, and we have this incredible proliferation of unstructured data on the widest variety of subjects all over the internet. We can now train on these massive data sets like Wikipedia and Stack Overflow, parse the data out of the unstructured content, and model into these large neural networks, primarily transformers, that actually mimic how we think in ways that are exciting, but frankly a little scary also. They understand the content, they can summarize the content, predict content. And generate new content that's often better than what a typical person would write, which is why it's become so useful to the general public. We can even improve the results by layering on training data that's structured, fine tuning the LLMs with more task-specific data sets. 


Rich Dominelli

It's true that we're awash in the LLMs right now. Of course, there's GPT-4 that everybody's talking about, and it seems to be the state of the art for what's out there, but there's a lot of other LLMs being released. Of course, there's Google Bard, there's Google Gemini, which is just about to hit. There's Meta's Llama 2, which they open source and it's now become a foundational piece of many of the open source efforts that are going on. There's, which is also coming out from Meta. There's Claude 2 from Anthropic. There's MPT, Falcon, Vicuna. I think over the course of this conversation this next hour, there's probably going to be six or seven more released into the wild. The training data that was fed, OpenAI is a little secretive about what the latest versions of their training sets are. 


However, for GPT-3 they released a paper that talked about what they train GPT-3 on. And at the time it was the common crawl, which is a diverse set of web crawls and captures, including things like Stack Overflow and news articles, and that was 410 billion tokens. And tokens in this case basically refer to a word. There's web text two, books one and two, which includes Gutenberg and many reference titles. There is Wikipedia, there's, Reddit was fed into it. Any Reddit answer that had three or more upvotes were included in the training data. So, there is a wide variety of information that was used to build this massive inference model that we can now use to answer questions, or generate text, or do all sorts of wonderful things. 


8:04

Tammy Bilitzky

Great. So these LLMs are great and we've been able to take advantage of them and leverage them effectively. But we've also, Rich had to deal with the main topic of our conversation, which is the AI hallucinations. Where some of our pilot results, as firsthand, looked really great. And we were processing on a number of documents and we thought we hit gold, but then our LLM confidently gave us results on other content that was totally off target. And had we not had our alternative QC methods already built in the pipeline, we wouldn't have been so quick to catch that problem. And it really went south very quickly as we introduced varying content. 


Rich Dominelli

So it's interesting that you say that, because when Google Bard first hit the streets back in February-ish, it was a general release, OpenAI immediately started running an article, or an ad rather, where somebody was asking Google Bard "Please share with me a piece of information about the James Webb Space Telescope where it would be interesting for my nine-year-old daughter?" And very confidently Google Bard came back with "Well, the James Webb Telescope was the first manmade telescope to photograph an exoplanet," which is flat out wrong. The first exoplanet was photographed in 2004 from the Chilean very large telescope, which is run by the European Space Agency. But it was very confident in its answers. And we have seen that internally. So, we have a project that it is one of our periodic projects that we go back and try to accomplish in the most efficient way possible where we get these financial 10-K forms. They're a financial document that every company must produce that's publicly traded. And the form itself is structured in a similar fashion between each, but each company seems to do their own stylistic choices and layout choices to make that document themselves.


So, one of the standard things that I do is I'm trying to decide whether I can use an LLM to answer questions out of this document. The original ask we had way back when we were trying to do this as a project is "Is it possible to get the list of executive officers and their financial compensation out of this document?" And both of those things appear within the document themselves. The layout is a little odd and can vary quite a bit, but the sections are always there in the document. I have one from 2020 from IBM, very, very simple layout, and I point ChatGPT 35 at the time, add it and say "Please, give me a list of executive officers from the PDF located at this URL." Now, at the time ChatGPT was not able to access the web.  It was not able to ingest a PDF file directly. But it did not tell me this. Instead, what it did is it came back with a list of executive officers for IBM from 2022.


Now, the interesting thing about that is it was intelligent enough to parse the URL and figure out what I was talking about in the first place, which I thought was clever. However, the list of executive officers that it came back from IBM was far more extensive than what was in the document. It had 20 names or 25 names, where the document had nine. So I looked at the document, I'm like "Well, that's weird. The names, there are some of the same names here, but this is definitely not from the document." Then I asked "Okay, if you've ever dealt with PDF documents, what a wonderful thing it is to try to deal with tables in those documents. 


12:06

They are extremely difficult to programmatically parse. Sometimes they're laid out consistently, sometimes they're just a series of text boxes and positions, so they're not a fun thing to rip out." So I said "Okay, on page 15 of this document," I think it was 15, "there's this table containing executive compensation. Please take that table and summarize it in a structured format like an HTML table and return it back to me." It came back with an HTML table that was absolutely fantastic, and I was blown away. I'm like "OMG, this is the holy grail. I have a list a way of converting PDF tables just like that." And then I looked at the data and the data was completely fictitious. It had nothing to do with the table in question, and it was made up on the – Sorry. So, it was like a toddler who didn't want to give you an answer that it knew you wouldn't like. So instead it just makes up something on the spot, because it thinks that's what you want to hear. 


Tammy Bilitzky

And it's a fascinating example. And the other example that we had, we shared news of this internal a little too early. We thought we hit gold and we thought we were going to be able to integrate into production pipeline, excuse me, very quickly. When we had to go back and tell people that in the end what we thought was a viable LLM, we had to switch to something else for a different LLM for our production pipeline. The first thing he asked us when we uncovered it was "Why? Why do these models hallucinate, and what are the implications of these hallucinations?" And I know, Rich that we shared a few different reasons with the organization. And each of them were tied to a number of horror stories that are out in the general public today. 


Rich Dominelli

So okay, internally we have to understand that ChatGPT is not actually reasoning on its own. So in this particular case we had a task to go out and capture authors and affiliations for academic papers that we were converting to a new format. It looked great in the first few, and we thought this was going to work fantastically. So, internally ChatGPT is trying to predict what you want based on its text prediction engine. It is basically saying "Okay, if I have this string of inputs or this string of text, the most likely responses that I should respond with are X, Y, and Z based on my training data." We talked earlier about what GPT-3 was trained on, and let's use Stack Overflow for an example. Stack Overflow, you'll have a great answer, which is usually upvoted and selected as the number one choice, but then you'll also have six or seven terrible answers, which also was fed into the model. 


Same thing with Reddit. Reddit can have great information and it can have nightmares information. Same thing with Wikipedia. It's user edited content. Sometimes it's good, sometimes it's a little out there. When you're calling ChatGPT's API, they have a couple of settings you can do to tweak how creative it is. If you're reaching out to ChatGPT and asking "Give me a Python program that computes the first 20 digits of Pi." This is a fairly straightforward task. You want it to be a very deterministic, there's maybe three or four answers to that question that are accurate and correct, and you don't want it to be too creative.


16:04

On the other hand, if you're saying "Let me write a blog post about the state of street cleaning in New York City." And it will go out and you want it to be creative and it will generate topics and it may have some more recent information, especially since now they're allowing it to go out and search the web interactively. But the dial you have to control whether you are doing more creative answers or more deterministic answers is something called temperature. And a low temperature answer is a more deterministic answer, and a high temperature answer is going to be something that is much more" creative." Which basically means that it will look at a larger field of potential predictions and pick from a larger field than what you may want it to. Top P, on the other hand, is the best analogy for understanding Top P. And there's another parameter you see sometimes with different LLMs called Top K. It's similar to what Top P does, is if you're searching Google and you get back a 100 search results. And the first five are what you want, that's a much smaller Top P. It's basically show me the top 5% of your answers or Top K would be show me the first five answers as a flat number. 


Between those two you can tweak it a lot more. So looping back to what Tammy was talking about as far as our authors and affiliations problem. In this particular case we had a couple of dozen papers. We took ChatGPT untrained. We didn't attempt to do any fine tuning or anything like that. We literally said "Okay, here's a PDF document, here's the text of the PDF document.  We want to take the first couple of pages, look at that and that's all we fed it. Please give us the author and affiliations from those pages because academic papers are usually a big collaboration and sometimes there's five authors or 20 authors and they're all from all over the world. And for our sample set, it was fantastic. For our later tests. It started to drift a little far afield, so we started having issues with actually pulling that information out. 


Tammy Bilitzky

Right. And those forms are so important for anyone using these models, because they need to know that these parameters that they exist and how to adjust them. And we know firsthand the value of accurate, high-quality training data. I would say we spend significant amounts of time and effort to cull through all of our data repositories and public information and make sure that the data we're using to train our LLMs has been vetted and verified. I venture to say we probably spend an equal amount of effort on creating the training data as a combination of everything else in our pipeline. Because we've heard these horror stories that stem from incorrect training data.


Everyone knows the story about the résumés. HR used an LLM to sift through résumés for engineers and they only got back male engineers because the training data that they trained it on only had résumés for male engineers. And the very famous one more recently, where the lawyers had submitted fictitious legal research in an aviation injury claim: they had ChatGPT, totally fabricated, and resulted, really, in egg on their faces and penalties imposed by the judges. Now, Rich, it's essential the consumers had processes and test sets in place like you did with the 10-K to ensure that the models remain accurate and effective.


20:07

Do you want to share some concrete examples of the best way to detect and mitigate also the model drift and decay scenarios? 


Rich Dominelli

Sure. I think we should probably talk about what model drift and model decay is. When I used to be an application programmer back in the day, we used to joke that no application survives contact with the users. And the reason for that is, it works great. You even go through an extensive UAT test and integration test and then you get out into the real world and you find that users are beating up your software in ways you never expected. We used to have a tester that I swear would test programs in the most bizarre ways and still when we went out on the street, the information that end users would enter was often surprising compared to what we tested with. The same is really true with your large language model and your AI-based, machine learning-based pipelines. Because what ends up happening is like our 10-K forms, we had a sample set, we trained on a sample set, we tested our sample set and it looked great, and then reality hit it.

 

We started hitting a much wider variety from what we initially expected. The model itself started to drift away from what reality actually was. Our samples that we were processing six months into the process or three weeks into the process didn't match what we trained with. And were unexpected because it was a larger world than we expected. You get, the larger your training set, the better your results will be, but there's still always going to be that one case or that 10 cases out there that things start to be different. So it's very important to have something in your pipeline that is periodically checking. Understanding that what your inputs and what your outputs are will change over time. Especially if you don't have control over that input set, if you're getting papers in, you need to make sure that you have something in your pipeline that's taking a percentage of what you're producing and testing it in other ways than LLM.

 

Tammy Bilitzky

Yeah, those are all great points, but at the end of the day, this technology AI is essential and we're successfully using it every day. What we've really found to be the key is, as you said before, it's applying proving techniques to detect, because they exist. And you have to mitigate those hallucinations. It's really that classic trust and verify. Where we value harness and maximize the power of AI while compensating for its shortcomings. When you were cognizant that it's a moving target and the shortcomings that we see today and the issues we see will probably not be the ones we'll see tomorrow, there'll be a different set of issues as they resolve these. And we've also mentioned over time the danger in incorrect data sets. And here's where we have opportunity to counter that, by using verified data sets that we've reviewed using a variety of different techniques that we've proven over time and we can check the LLM as you said. We can check that LLM results on an ongoing basis and we can detect if there's model decay, if there's model drift, if the model isn't behaving the way you want, and then try to find other models and train other models to do the job more effectively.


Rich Dominelli

Sure. If you've ever seen the movie Minority Report, for example, there's the idea that we have three psychics that all are predicting the future, and if two of the three psychics believe something to be the truth, then that is the predictive future that's going to happen.


24:09

That one person that is out there with a contradictory version of the truth, well their opinion doesn't matter so much. You're starting to see that's something a bit more with the LLM type things. But also it's not just LLMs. You can use other machine learning technologies like spaCy and entity recognition and things like that to vet the results that the LLM are producing.  So when we're doing work where we're extracting citations within documents or extracting entities within documents we can use, I don't want to say more trustworthy models, but more tried and true methods to test. We can even use regular expression to try to test and vet the results that the LLM are giving us, to keep that trust level there. And then like we said, getting back to reality and getting this new data that we're getting in, we can feed into a training set and make it part of our model going forward.

 

Tammy Bilitzky

Right. Those are good points. I also love the term consensus-based approach. Because it's really the same technique everyone's been using for ages with just a new fancier name. It's essentially under the covers, it's just a polling engine like we used for OCR engines and some of the less defined content conversions. But here it's comparing the results of different models and has multiple AIs acting as a jury. Would you agree that that's one of the more effective techniques, like consensus based? 


Rich Dominelli

It is definitely one of the more effective techniques. The issue you hit is obviously you're consuming more computational power to do that. But the important thing is that you're doing something. That you have some sort of check and some sort of, you're not just blindly assuming like poor Steven Schwartz on the Avianca lawsuit that whatever ChatGPT produces is obviously correct. You're not just trusting the AI to be the final version of truth. On the other hand, we have an upcoming webinar about AI and copyrighted information.


One of the things that was recently published was a paper from Cornell University entitled "Who's Harry Potter?" The interesting thing about this is the tasks that were given is if we have a large language model, now large language models have grown to the point where they take a while to train. I mean just from, it's a lot of computational power, it's a lot of time to ingest everything and it's a lot of time to actually perform the computations to build that statistical model. Nobody wants to go back and retrain ChatGPT-4 because that took months and months and months to do. Now, behind the covers there are some things where people are starting to use large language models to train their successors, and it's faster. And you're seeing that in the open source community more and more where you have training going on from large foundational models, they're calling it. So you're using ChatGPT-4 or Llama 2 to train these smaller open AI-type LLMs. 


But a more interesting thing, and this is the Cornell paper. Is how do you make a large language model, forget something. Not retrain it from scratch, because again, very expensive. But instead forget anything about a particular copyrighted work. So in this particular case, they made Llama 2 because Llama 2 is open source, forget who Harry Potter was, and the name of the paper is "Who's Harry Potter?"


28:00

I encourage you to read it. It's actually a pretty approachable paper. It's not deep in math or anything like that. And what they literally did is they found that Llama 2 essentially knew the texts of all seven Harry Potter books and the movies and who Harry Potter was. And so they went through a series of questions at the start of the exercise to establish that. Yes, the model knew who Harry Potter was, could predict what actually occurs in certain books, where he went to school, who his best friends are and so forth. And then what they did is they took a sanitized version of the text of the books and use that to retrain the model, to fine tune the model, apply additional information to the model, so no longer didn't know that Harry Potter went to a school for wizardry or that his best friends were Ron and Hermione or anything like that. And they were successfully to make the model forget who Harry Potter was, which is a fascinating if slightly scary aspect of this whole AI reality that we're living in. 


Tammy Bilitzky

And I also found that paper very readable and very fascinating. But it also was a little bit concerning. Because I think this is where we're getting into, it is crossing the boundary into ethical AI. Because when you take a publicly available LLM, you assume that it's been trained on a wide range of public data and that it's knowledgeable. To think that it can be unlearned and say for example that New York doesn't exist or George Washington was never president, and then draw conclusions based on that without the model consumers necessarily realizing what was unlearned has very concerning implications. 


Rich Dominelli

Well, I mean you have to understand though that right now our vision of reality and where we're walking through is pretty much determined by Google anyway. So if Google doesn't respond a certain way, people don't think it's real. This is really just a next step down the path. This artificial intelligence, large language models, which are really becoming more and more replacement for search engines out there really will determine what people think reality is. 


Tammy Bilitzky

Yeah, that's true and that's all been super helpful for DCL and it's contributed to the success of so many of our projects. Largely due to your expertise, our CTO or other AI engineers. I think that over the past years, and I know that over the past years DCL has met this challenge head on. We harness the power of AI while we compensate for its current shortcomings. And we do that by working smart. The key is that we have to strike that right balance. We weave in other proven technologies and custom processes software that we've developed and we refined across thousands of production projects. And then we use different types of AI with logic-based software as you mentioned before, to balance assess it. And then we're able to perform the high quality, high volume content analysis and transformation projects for our client. 


We do that by starting with the classification and verification of the data. We have to make sure that we direct the right documents to the right LLM model based on its characteristics. All the content we process is not equal. Same way we used to do data for application optimized OCR engines, because certain documents did better with certain OCR engines if had tables, if it had heavy map, we knew which things dealt better with which types of scenarios. And we're applying the same thing on LLMs. They're also using more-


31:56

Rich Dominelli

Yeah, we're constantly looking for new technologies, but relying on a lot of the stuff that we find that internally we vet and trust. We looked at image classification all along. We do a lot of work with OpenCV. We do a tremendous amount of work with things like spaCy and GATE. And all along we constantly start maybe not being the bleeding edge, but the stuff that we can prove that is working internally for an analyzing this stuff. We adopt and adopt quickly. But like you said, trust and verify. You need to make sure that whatever it's outputting is accurate. 


Tammy Bilitzky

Right, right. Absolutely. And also, we've developed a real expertise in creating these label training data sets. We can take those training sets now and we can use it to counter what we discussed before as far as incorrect training data. And that's something we even help our clients create. And then on top of that, we'll layer the statistical analysis, the XMO-based QC. So at the end of the day, we can't wait for generative AI to be perfect. But we also in our content delivery role, we can't risk our quality. I think, and I think we've all agreed that for the time being it's going to remain a hybrid approach on most of our engagements. We need to keep advancing the bar, doing exactly what we're doing now, integrating AI of all different forms into our pipelines, and doing it in a smart, effective way where we're constantly assessing the quality of our results, trust but verify. 


Rich Dominelli

Absolutely. One of the things I always find interesting, especially looking at these machine learning statistical models is, a score of 80 to 90% when confidence factor when you're looking at these tools is considered a fantastic score. It's never a perfect 100%, this is absolutely the baseline truth. Instead, it's well, 80% likely that this particular item is a table or this particular item is a red ball in an image or something like that. And that is great for the purposes of academic machine learning, but for the purposes of content transformation, which is our bread and butter, it's not enough. You really need to double check.

 

Tammy Bilitzky

Absolutely. Thank you. Marianne, back to you. 


Marianne Calilhanna

Thank you so much. Okay, we do have some questions that are coming in. And I seem to have lost my little question dialogue box, so just give me a second to find it. Okay. One was, I guess it was just a little bit unclear, "Content structure really does help address model drift and decay. Can you speak a little bit more to that?" 


Tammy Bilitzky

You mean structured content? Is that the question? 


Marianne Calilhanna

Yes, yes. Structured content. 


Tammy Bilitzky

Well, we did talk about the fact that when you're training an LLM, what we do is you train it on the general publicly available data. And then you can train it on more specific structured data sets. So at the end of the day, what you're coming up with is an LLM that is trained and optimized for your particular purpose. And Rich, do you want to talk about how you would use that for when you already have the existing model and production you want to address the drift and decay? 


35:59

Rich Dominelli

Sure. The difference between structured and unstructured content obviously is how easy the structured content is to use and process. What we would do frequently is either hand tag or hand convert or convert in a more traditional fashion a sample set that we can use to vet the results of the actual conversion that we're using the machine learning pipeline rather to convert. If we have a document that we've hand converted that we know is a ground truth, we can start using that to assess how different the results we're getting from the machine learning model is and then start tweaking and retraining based on that.

 

Tammy Bilitzky

So what we're saying is that you use it upfront to help train the LLM and then use it to assess the quality of the LLM, and if you see that there's any drift, you also use it to retrain? 


Rich Dominelli

Exactly. 


Marianne Calilhanna

Okay. With the author affiliations experiment, can you please explain again, did the results get worse the more you ran it? 


Rich Dominelli

It wasn't a question of worse the more we ran it. Was a question of, again, not surviving reality once we turned it on. Because we trained it on a set of documents which had a fairly large variety of formatting as far as authors and affiliations go, or so we thought. But the real world is a fountain of infinite possibilities, so there were other items out there that we did not accommodate when we did our initial training. And it bit us. At the end of this, what we discovered is our initial set continued to work great. Our subsequent stuff started to fall awry of what we would accept as quality. 


Tammy Bilitzky

I think it was also an awakening for us, because this is what we do on a regular basis. We'll get a sample in from the client, what they say is a representative sample. And we develop our software, which is a combination of AI and logic-based software in a very flexible way to meet their requirements. And then when you start to get the production data in, sometimes it is nothing like what you saw at the beginning, what you developed yourself around. And you have to very quickly and very nimbly readjust and handle that. And we've got to down pat on how to do that on logic-based software. And now what we've been doing is developing those same techniques and applying them to the AI models that we're using so that we can continually expand and refine and handle all the different scenarios that are out there.

 

Marianne Calilhanna

All right, thank you. I do just want to take a moment and point out in the chat box there is a link, we had shared a link to our next lunch and learn webinar that's sort of a continuation of this AI education and that this next one is going to be around AI and copyright, so I think it's going to be really informative for those of our customers particularly or in the publishing industry. Okay, the next question we have, "Would you speak to the complications of perspectives versus verifiable facts?" I guess that's in training sets is my understanding of that question. 


Rich Dominelli

You want to tackle this one, Tammy? 


Tammy Bilitzky

Okay, so let's get this straight. So I was sure Rich was going to handle it. So perspectives versus verifiable facts.


40:00

At the end of the day what we're talking about is, when you're running an LLM, it's doing predictive, it's predicting what a person's thinking. And it has its own perspective. It's like famous history is, his story. Everything has a perspective and it's all based on what information it was fed and what it was trained. And potentially if this starts to pick up, what was unlearned. So every person and they say you ask 20 bystanders in an event and you won't get the same story from any two of them. Because everybody comes to something their baggage and with their perspective. And it's no different when you're dealing with these models, it's all based on the information they were fed and how it was interpreted by that model. 


And again, as Rich mentioned, under the covers we don't necessarily know how all these models operate and what their underpinnings are. You're relying on them. So the only way you can rely on them is to do, we discussed earlier, which is to feed them verified data sources. When you do a training set, you take a certain amount of information, certain number of percentage of the files are fed to the training set to train the model, and a certain percentage is there to verify the model. And we use label training sets so we can sit there and we can assess that for our specific purposes, this was the items on the 10-K, these are the authors and affiliations and how they were linked. This is how a reference was decomposed. This is how – Did it get it right or did its perspective or its way it was trained skew the results? Rich you want to add to that one, or...?

 

Rich Dominelli

Yeah, I mean this part of this is you have to be careful when you're building your training sets that you're not introducing implicit biases. This is kind of like the example that Tammy had earlier about female engineers. There is a similar situation where facial recognition tools did not handle darker skin tones correctly. Because the implicit training material that they were fed initially when they were building their model did not include that in its training set and therefore the model didn't expect it. So yes, you have to take a step back. Thankfully the majority of the work that we do is not something that's perspective oriented.

 

It's hard to have a perspective of whether or not somebody should appear on an academic paper or whether or not you should be able to extract a particular executive off a 10-K form. But it's definitely something that you have to think about. And it's definitely something that if you're using other people's models, which more increasingly we are, and so is everybody. That you have to worry about because you're never quite sure behind the covers. Well, maybe this thought that all engineers were male or maybe this thought a particular political opinion or a particular political bias, and it's hard. Unfortunately, it's part of the day-to-day life of dealing with these large language models now. All you can do is think about it and try not to make it worse. 


Marianne Calilhanna

Okay, another question, "If we embed our content into our chat model, which uses the ChatGPT API, is there still a risk the output will contain copyrighted or false material that the LLM learned from sources outside of our embedded content?" 


43:55

Rich Dominelli

Of course. I mean, when you're using that embedded model, you are taking along the baggage of however many months and billions and trillions of tokens that was trained with before your content of repository that you fed into it. There are some tools out there now and some machine learning tools that allow you to say "Okay, please constrain your results to information contained in this repository only." Or you can train your own. I mean there's certainly a number of tools out there for building your own LLM at this point. It won't be as detailed, but it won't contain any copyrighted material. 


Marianne Calilhanna

Okay. Okay. Got another one. They keep rolling in. Please continue to submit your question. We still have some time." Does writing style or document structure impact how well an LLM can consume content?" 


Tammy Bilitzky

I would say so let's make sure we got the question. The document structure, and what was the first part? 


Marianne Calilhanna

Writing style. 


Tammy Bilitzky

Writing style. Okay. I'm going to tackle the document structure and I would say the document structure. Absolutely. Because see that the question is, depending on the LLM, how much does it take from the document structure and how much does it infer and ignore the document structure? So we've had cases where what we want to do is we strip whatever structure comes in the document and feed an unstructured document to the LLM. Because you don't want to introduce a bias. A lot of the content we deal with is developed without any templates. It's kind of the Wild Wild West that the author writes it the way they want to. There are some guidelines but they can choose. Some organizations, some publishers require them to use guidelines. A lot of the ones we do, we service do not. 


So if someone says – we will go by a structure, there's usually the font matter, there's an abstract, there's back matter. We have found it far more effective to strip the structure and let the model train the model to be able to determine the structure based on the content itself and the characteristics of the content than the document structure. Because we find that that introduces inherent bias and we get lower quality results. Do you want to tackle the other point, Rich? 


Rich Dominelli

Sure. Writing style –


Tammy Bilitzky

I gave you an easy one. 


Rich Dominelli

Again, under the covers, what these models are trying to do is predict based on text what the next most likely answer should be. If you're writing styles such that you have long run-on sentences that are difficult for a human to interpret, I can safely say that your LLM is probably not going to fare much better. Short declarative sentences will be parsed easier by the AI. 


Marianne Calilhanna

Just like our own brains. 


Rich Dominelli

Yeah. 


Tammy Bilitzky

Because that's how they're trained. 


Marianne Calilhanna

"Have you started working with the newer domain-specific LLMs?"


Rich Dominelli

Yes. We have a project where we're trying to dissect some technical documents and part of that dissection is recognizing formulas within that. So, we are doing some work using AI tools that are geared to actually pulling, identifying on a scanned image mathematical formulas and extracting those correctly.


48:03

And the same for figures and that type of thing so that we are able to get a clean OCR plus feed those mathematical formulas to the appropriate tool for converting it into MathML. 


Marianne Calilhanna

Yeah. When I've seen the work that DCL has done on that, that to me really feels like magic. That's pretty wild stuff. 


Tammy Bilitzky

We've got a very powerful pipeline for that right now. We spent a lot of expertise and energy on that. 


Marianne Calilhanna

Yeah. 


Rich Dominelli

Briscoe, who just decided to join us, finally. 


Marianne Calilhanna

That's funny because my dog just started stirring too. "You seem to be saying that the LLMs can't handle novel cases. Doesn't seem very intelligent to me. Can you comment on that?"



Rich Dominelli

LLMs are not intelligent. They are prediction engines, plain and simple. Under the covers it's difficult to understand what they're doing, but they are attempting to predict what you want the answer to. It's back to that three-year-old trying to give you the answer you want. They are not reasoning, despite what the engineer from Google has decided to come out and say that we fit the AGI point. They're getting closer and closer. And there's all sorts of philosophical statements or arguments you can have in the AI community right now about whether or not we're really thinking in that regard. But they're not really truly reasoning or understanding what they're doing. They're predicting what you want to see based on previous input.

 

Tammy Bilitzky

Based on what they've been trained on. 


Rich Dominelli

Yeah. 


Tammy Bilitzky

It's only going to be able to predict as well as the content that they've been trained on. And that's how we find the more we train, the more good data we feed it. And again, it has to be unbiased data, the better the LLMs behave. 


Marianne Calilhanna

Right. Well, I think that's a great place to stop. Thank you everyone who's spent their time with us today. The DCL learning series comprises webinars such as this, as well as a monthly newsletter and our blog. You can access many other webinars related to content structure, XML standards, XML conversion, did a conversion and more from the on-demand webinar section of our website at dataconversionlaboratory.com. We hope to see you at future webinars, and have a great rest of your day. This concludes today's broadcast. 


Rich Dominelli

Bye, everyone. 


Tammy Bilitzky

Okay, bye-bye. 



bottom of page