DCL Learning Series
Tables are Tough: Perfecting an AI Model to Automate Table-to-XML Extraction
Marianne Calilhanna
Hello everyone, and welcome to the DCL Learning Series. Today's webinar is titled "Tables are Tough: Perfecting an AI Model to Automate Table-to-XML Extraction." My name is Marianne Calilhanna; I'm the Vice President of Marketing here at Data Conversion Laboratory. Before we begin, I just want to share that this broadcast is being recorded and it will be available in the on-demand section of our website at dataconversionlaboratory.com. Also, I invite you to submit questions at any time during our conversation today; we've saved some time at the end to answer anything you've submitted. And quickly, before we begin, I would like to take a moment to introduce DCL and Fusemachines so that you have a bit of context around this subject. DCL's services and solutions are all about converting, structuring and enriching content and metadata. We are the leading provider of XML conversion services, DITA conversion services, structured product labeling, or SPL, conversion, and S1000D conversion.
While we are best known for our excellent content conversion services, we also do a lot of work in other areas that are listed here on this slide. Semantic enrichment, entity extraction, data harvesting, third-party validation of previously converted content, content reuse analysis, and structured content delivery to industry platforms. No matter the service we provide, we harness the latest innovations in artificial intelligence, including machine learning and natural language processing, to help businesses organize and structure data and content for modern technologies and platforms.
Fusemachines is part of the DCL Partnership Laboratory, and Fusemachines is a leading provider of data and AI talent solutions for enterprises that are undergoing digital transformation. Fusemachines helps train and build high-capacity distributed data and AI teams using its proprietary platforms and network. And they grow their network by running AI education programs in underserved global communities.
So today I'm pleased to introduce Mark Gross. Mark is President of Data Conversion Laboratory. And Isu Shrestha. Isu is Senior Machine Learning Engineer at Fusemachines. Both of these gentlemen have spent their careers working with content and data structure, computer vision technology, AI, NLP, and other related services that help organizations support digital transformation. Welcome to both of you. And Mark, I'm going to turn it over to you.
Mark Gross
Okay, thank you very much, Marianne. It's a pleasure to be speaking out to all of you today. There it is. Okay. So today, this is a topic near and dear to my heart, it seems that I've been working on trying to figure out how to get data out of tables since the beginning; over 40 years was the beginning of the DCL.
3:56
One of our first big projects was trying to figure out how to get tabular financial data out of these ASCII files that were all over the place into, this is before Excel, I think it was VisiCalc at that time, and worked with some tools to do that. So it's over 40 years ago we've been doing that. And it's getting more and more interesting, and more and more different kinds of tables. Today what we're talking about is some techniques that we've been working on together with Fusemachines on how to use artificial intelligence and how to use computer vision technology to be able to pull information out of tables. And it's important. I mean, first of all, tables are complicated, and Isu will get into that soon, but let's go ahead to the next slide. There we go. The reason it's important is because tables just contain a lot of information, and it's not just text that you read, but it shows all kinds of interrelationships.
And first of all, to figure out what's going on, some of these are very hard to figure out even what's going on just by reading it. So this is a table out of a scientific journal talking about the physiological indexes of lettuce cultivation. I'm not sure what that is, but obviously there's relationships going on between what's going on the left and what's going on the right, and each of these numbers mean something. Trying to put this into words would take thousands of words. So this is one kind of relationship. Let's go to the next slide. In the financial world, many of you have seen tables like this, and while that first table's more two-dimensional, this is almost a three-dimensional table, how do you relate all these?
You have to really understand what each of these columns mean and how they relate to each other. So we know what the three months ended, ending September 30th of '22 and '21, and the nine months ending. But there's a lot of information going on over here. And really what you're trying to do is relate from that third column to the fourth column and the fifth column to the sixth column, and you really don't know what that is. There's also a lot of blank space over here. So it's not just taking apart a table. And let's go to the next slide.
And this one is for a package insert. We do a lot of package inserts in the pharma industry. Again, this is really multi dimensions over here. It's not just what's going on at the top or the bottom, like that fever chart, that's a whole other table inside a table that's going on. So all of these combine to make understanding tables very difficult, and tables don't always have lines around them, they're not always clear, they might be missing lines, missing rows, all kinds of things which Isu will get into in a lot more detail. And they appear everywhere, in pharma, in the scientific world, in financials, in government documents, tables are everywhere.
And even on an annual report, you might be dealing with 30, 40, 50 tables in there. Let's go to the next slide. So what's the benefits over here? The benefits are obvious. And we’re in a business, I'm in a business where we're dealing with thousands and thousands and thousands of pages that we're dealing with on a daily basis, millions of pages every month.
8:08
And what we're doing is taking information, turning into XML, and much of it's taking tables apart and turning into XML, much as many of you are doing.
And getting them done right is very difficult, it often requires human interaction, and human interaction multiplied by thousands, thousands is expensive and is error-prone. So the extent that we can apply techniques that only been available over the last few years, I mean, I've been working on this for 40 years, but the techniques and the computers that are fast enough to do this only appeared recently, to be able to automate this, to identify what's in a table using semantic information. And the benefits of all that is certainly that once you get this you can move it across to other places. If you are collecting financial information for an industry and you're pulling it out of all the individual annual reports and putting it together, you've got to get the information from one place or another, and you've got to make it all work together.
So content interchange is very important, and identifying what's in tables is important and all those things. And that's what we've been working on, that's what we've been working on with Isu. And Isu has some very interesting approaches to doing this, which we've been working with. And with that, I'm going to turn it over to Isu.
Isu Shrestha
I'm Isu Shrestha, I'm a Senior Machine Learning Engineer here at Fusemachines. I work primarily with NLP, but also a little bit of computer vision, and I work across many different teams. We have offices basedin seven countries including the US and Canada. And this particular project we've been working withNiranjan and Prashant,who are also on the call, the engineers that are working on this project right now. So diving a little bit deeper into this, referring back to what Mark spoke about, this subject is obviously an important problem that we want to solve in the financial world and the business world.
But for me, as a machine learning engineer, I find this problem also fascinating in the way that it rides between the two domains of computer vision and NLP. And so let's explore why that is. If you look at this particular table from the eyes of a machine, you can start seeing some of the different challenges that make this seemingly simple problem into a complex one. So if you look at the top right, so there's the word “civilian.” And then us humans, with just minimal visual cues, we can already tell that this “civilian,” this word, is related to two different columns.
And then similarly the other word “military” and “total.” So these are related to these columns. And for a machine it's kind of difficult to do, and this is where the computer vision part of it comes in. So we need to look at the layout of the table and understand what elements or what cells are related to what other cells. And it's the same thing for the rows as well, if you look at the left side of these rows that are highlighted over there. So there are multiple rows that need to be in the same cell.
11:56
So for us humans it might be easy because we just understand that the sentence has not ended, and that's why the row has not ended at the same cell, but to a machine, if it does not understand the words or that the sentence has ended, it's hard to tell where you draw the line for the next cell.
So here in this particular example it's made it a little bit simpler because of the dots that they've given us, but it might not be there. So that's the challenge of computer vision, which is looking at the layout of the table, and NLP, which is trying to understand the words. And we can't always understand it because there are many different languages, but it also takes that NLP perspective into account while drawing this table out. So lastly, I also want to say that on the right side there are these empty cells, and that's also quite confusing for the algorithm because it needs to make these decisions that, is this cell merged, is this a separate cell? So that also goes on to complicate the problem more. Next slide please. Yeah, so going into the weeds of it, there are also different characters like this that we need to understand, and sometimes they can be mistaken with lines, and the computer, the algorithm might make the mistake that, oh, this is separating the two cells and dividing it. So there are things like this that also add more noise to the problem. Next slide please.
So this is an example of a recent paper on how people are looking at this problem, and this is the general workflow of how these systems interact. So I'll just walk you through that so that you have some context. So first we start off with a table that's on the top left, and then the first thing we do is pick out where the words are. So this is essentially an object-detection problem. So we just draw bounding boxes around all these words and characters, which we think are cells. So after we do that, the next step, which is B, is represent each of – solet's say all of these objects are represented by a node and graph, and all of these nodes are connected. So we first have that representation in a graph.
And then what we can go on to do on C is then have a machine learning model predict relationships between these nodes. So this is where it's trying to figure out which of the cells are straddling the other cells, which of the cells are merged, which are connected and not. So that's happening. And finally on D, what's happening is we have some machine learning predictions, but as you probably know, machine learning predictions can be off obviously sometimes, and there's some push processing that's happening there to filter out that noise and finally produce the output, which is the generated table. Next slide please.
So that was how we can look at just extracting the structure of a table from an engineering perspective, but from a business standpoint, what happens is that there's still more things that add to the complexity of this problem.
15:57
So here I've listed some examples for that. So the first one is, tables are usually, they don't come on their own, they're usually embedded in a document. So we need to understand that this part of the document is not a table, this is a table, and we need to be able to distinguish that and actually hone in on the table itself so we have a nice crop of a table. So that's what we need to do. Can we go to the next slide, please?
So just to look at these problems in a practical way, we break them down into four different parts. The first would be, like I said, first finding the table, what tables are tables. And then the second step would be to isolate the tables that are a little bit tricky, and isolate the tables that are quite simple, so that we can run different sets of algorithms on them so that the simpler tables it runs without taking too much compute time, and also with higher accuracy. And then with the more complex ones, we use other algorithms. So we have classification going on there.
And then the third step would be post-processing, which is the part where we filter out the noise from the predictions and make the predictions a little bit more simpler. And the fourth one is feedback and training, which we haven't talked about until now, but we want these systems to get better and better over time. So we need to have a feedback loop where whenever the machine learning model makes an error, a human correction to it is recorded so that we can teach the machine learning model again that it has made a mistake so that it's less likely to make that mistake in the future. Can you move to the next slide, please?
So now we're going to zoom in a little bit on these separate tasks that we're doing, and go through the whole workflow of what's happening. So this is a table, and this is a table inside a document, and we are honing in on exactly where the table is. So there's a couple of things that I want to say on this slide. Firstly, in the blue box, I don't know if you want to look at that again, I'm not sure how many of you can see just one table there, but if you look closely, there's two tables in there.
So that's one of the problems that we have. Even us humans, we can see that there's two, we can make that mistake, and obviously an algorithm is more susceptible to make that mistake. So there's two tables there. And also, if you look at the block diagram above that, it's a diagram, it's not a table. So we need to be able to distinguish that, which the algorithm is doing well here, but I just wanted to mention that it's not always easy. The diagram can look like a table and the algorithm can believe. So that's where that comes up. Next slide, please. So after honing in on where the table is, now we're starting to separate what type of table they are. So the first type would be, here it's labeled semi-bordered where there's some lines and visual cues telling us where the cells and rows end, or where they start.
20:00
But sometimes they are not included, there are no lines, very minimal lines. So that would be a borderless table. And then there's the simpler kind, where there's a line for every cell, which is a lot easier because we can just find the lines and figure out where the cells are. Next slide, please. So for the more complex tables we then go forward and we run it through our algorithm where we find the different places where the characters are.
So all of these are candidates for a cell. So like I said, it's an object-detection problem where we draw bounding boxes on all these characters. And the next step here would be to find the relationships between these bounding boxes. Next slide, please. Finally, after we have all of that, we finally draw the columns and rows. After, we have the post-processing. So the post-processing, just to go a little bit more deeper into that, all that it's doing is that it's throwing out the noise, which is, sometimes the algorithm predicts two cells, more than one cell at a given location, which is kind of impossible. So we filter out that noise. And we have confidence thresholds where if the model is not confident about some cell existing, we throw that out. So there's this filtering going on, which makes the end product, which is the clean pivot. Next slide, please.
So there you have it, that's the whole workflow of extracting the table. And the end product would be this XML. Just wanted to say that this is also what we feed into the algorithm. So the left side, the image or the PDF is what we feed into the algorithm. And then what we get out of the algorithm is the XML. So we'll talk more about the training procedures and data there, but I just wanted to say that that's the inputs and output for the system. And while training, we also do some argumentation which we'll go into a little bit later.
Yeah, we'll talk about that in a bit. Next slide, please. So going more into the training process, as you've seen, this is a multi-layered system, there's multiple steps to the training process. So the research is heading towards combining all these processes into one process, but right now there are different subsystems working together in tandem. So I'll just go through how we're thinking about the training process. So here I just listed out how all of these subsystems are looking at how well their predictions are working.
So obviously the object detection or the computer vision part are working like a computer vision algorithm or object-detection algorithm, where it's looking at an intersection over union. And so a good prediction there would be the center one, where the label for the object and the prediction, it coincides very well.
23:56
But a poor prediction would be where the prediction is going way off, and the actual prediction should be – or it's getting cropped, the whole table is not detected, or the table is cropped. So that would be a poor prediction. An excellent prediction we don't usually get. Actually, if we get an excellent prediction, that's a very good sign that it's overfitting, if you know what I mean. And then there's other subsystems which are just classification. So we're just trying to classify whether this table is a semi-bordered or borderless. So that would be cross-entropy. If you've done that, then you'd know.
And the third one is just another example which is, sometimes we don't have PDFs, we have images, and we need to read what the characters are in the cell, and that's the NLP part of it. And so that's the OCR as well. So I just wanted to show you guys all these separate subsystems having their own measurement of quality, of how they're working. Next slide, please. So that would be all the machine learning systems behaving on their own in a modular way, but when we look at it from a top-down view, we only really care whether it was able to capture the structure of the table and all the elements within it properly or not. We don't really care about all the little soft systems and how they're doing, we only care about the end product. So measuring the quality of the system as a whole, we use creative distance. So let me just explain what that means. First, if you are not aware of what that means, we'll talk a little bit about the string edit distance. So here I've listed an example of the string moon, M-O-O-N, and the string lions, L-I-O-N-S. You can see that two of the characters are the same, which is ON, but the two characters MO need to be changed or edited into L and I to find the lion. And then we need to add an extra S on the end to get “lions.”
So there's three edits. So that would be an edit distance of three. So what this means is that the more edits the algorithm needs to make, or let's say a human needs to make, into the prediction of the algorithm, the less useful the algorithm is. So in an ideal case, the algorithm would generate a table where the human does not need to make any edits at all. So a low edit distance is good. So translating the same thing into a table, so it's the same concept, which is edit distance. I've put a table there, where the original table, you can see that the age column is merged.
And on a prediction, we need to edit out that merge, that there's the mistake over there where the human actually has to merge those two cells. So the more edits a human has to do, the less useful the algorithm is. So that's the edit distance theory. And now coming back to tree edit distance, where the word “tree” comes in is, if you've seen the CML structure of a table, we have the rows and headers and the different table cells.
28:04
So the tree edit distance is all about how many edits we have to make to that tree structure of the HTML, if you think about it as HTML, but currently we're using XML. Can you go to the next slide, please. So that's it from my end, and I want to pass it on to Mark.
Mark Gross
Okay, thank you Isu, that was excellent. First of all, going back to why this is important is you've got tables in all industries, and getting the information out of them is complicated and difficult, and a lot of information is stuck inside tables for all the reasons we spoke about before. And that happens in a financial community and the scientific community, certainly in pharma and other health-related areas, in manufacturing.
We didn't talk much about that, but you've got data coming in that's reported to the government, that's come back from the government, information on well drilling and underwater kinds of things. And all this has been put together in some kind of table format, and pulling it out becomes a very difficult thing. And today there's an insatiable need for data in all kinds of applications. I think what comes out of this, and I think came out of what Isu was talking about is, largely when you start getting into machine learning and NLP and other AI processes, you're dealing with a statistical process.
And those boxes that Isu showed, that you want to get it to fit as closely as possible, but if it fits too closely it might mean that you've got a problem. This is an area where you have to set your expectations on what's going to come out. And it's very nice to say I'll do everything with a machine and go all the way out there and not involve humans, but it's very hard for it to be a 100% process. You have to really relate to, it's going to be 99% or 98%, or at what level is it worth doing that way? And like many a process we've done, like the work we do for the US Patent Office where we're doing millions of pages a month using these techniques untouched by human hands, is also the knowledge that it's not going to be a 100% process at the end.
And that's understood. You get 98% of what you want at one-tenth the cost, but getting from 98 to 99% becomes a very difficult process. But that's really good enough in many, many areas. It may not be good enough in the pharma world, or it may not be good enough some of these other areas. So I think there's a lot to be done over here, certainly been working on it for a long time. Isu's company Fusemachines has spent a lot of effort in working through these, and it's made a huge amount of progress over the last few years. And just a caution to all of this is, while artificial intelligence can do anything, it's very hard for it to do everything. But if you identify what it is you're working on, then our experience is that you can get really good results.
32:00
And it's all a matter of knowing what you're going for and knowing what the bar is. And with that I will turn it over back to Marianne, who has disappeared off the screen, but she's going to come right back on. Okay, thank you.
Marianne Calilhanna
I'm coming back here. Thanks. So we have a question, I think we covered some of it, but Mark, I'm wondering if you can maybe dive a little bit deeper. Someone asked "Why would we want to convert a table from an image or a PDF into XML?" You know, I was thinking about one of our customers for whom we do a lot of data harvesting, a lot of that content is regulatory information found in tables, on websites and in PDFs, and we're pulling that tabular information out and transforming to XML. I think that's a really valid use case. Maybe you could speak to that and some others that DCL has been involved in.
Mark Gross
Sure. So I mean, XML is the example we use, because that's become, in data handling, the lingua franca, but you can also think about it in terms of converting information into an Excel file or into a Word file. I mean, when you have an image, and you've got an image of a table, you can't do your computations on it, there's nothing to compute, again, it's just a picture. So if you're going to do anything with that information, like do analytics on the table itself, or combine all the information, you have to get the table into a form where you pulled out the information in a uniform manner.
So at one level, let's put it in an Excel file, so you can do your analytics on it, or you can move it into a Word file so that you can do those analytics, or you can move it into XML. And really, in my line of thinking, it's all the same thing, how do you get discrete elements out of that table so that you can do further analysis with it? And that's what happens, the description I described from 40 years ago of taking financial data that was coming in, in whatever files there were, and standardizing it and moving it into what at that time was VisiCalc, the predecessor to Excel, was that: get a lot of tables together so that you have the information in a uniform format so that you can combine it all.
Other areas are, you get invoice information and payments information. Well, if they're all coming in different ways and all you've got is pictures of invoices, how do you move that into your accounting system? Well, if you can pull that information out in some way and organize it, you can develop a method to take it into your computer system automatically. The alternative is to retype it. So we're all trying to get away from retyping that page. If it's one or two pages, well, just retype it, but if it's a million pages, that's not a workable solution.
Marianne Calilhanna
Right. And not to mention search and discovery: you can't search or discover a flat-based image. If you have text, which is XML, you can do some search and discovery.
36:03
Mark Gross
Right. And a PDF file is someplace in between there. PDF files do often have the texts in them, so you can do a search on it, but you can't do an accurate search, as we say. I mean, it will find you all the words, but a lot of times you want the words in context, or the words where appear with something else. So again, if you have 100 pages, just searching for it is great, but if you have a million pages, then that becomes a voluminous process, like the work we do for the New York Public library where we've taken the copyright records.
If you want to find an author with a particular name, yes, you can search it against the original OCR materials, but that name might appear in many different places, and it might not even be the name always. So if you're looking at 100 years of data, because it's a million pages, that doesn't really work very well. Once you've moved it into an XML, you can do things like say "Give me that particular name," when it's an author's name rather than anyplace else it might appear. So classifying all this information becomes very important as we go on to larger databases.
Marianne Calilhanna
All right, thank you. We have another question here asking if you have any advice on how to better structure tables in print so they can render better in XML outputs?
Mark Gross
Well, I guess we would go back to what Isu was talking about, things that make a table complex, if you left out those things, it'll be much easier. So I think the issue’s that, Isu, you may want to add to this, I mean, if you don't have lines around the table, well, that's very hard to figure out where the tables starts and ends. A human can do it, sort of, a lot of times, but a computer has more trouble. If you have blank lines, blank columns, that becomes difficult.
If you go into three-dimensional tables, so I showed some of those, where there's multiple columns that mean different things, that becomes difficult. So to the extent you can simplify the table and make it just column and rows, well, that's much easier, but on the other hand it takes away a lot of the reasons you make a table in the first place. A table is made so that you can organize lots of information in all kinds of ways that we haven't even thought about yet. So this is a blank canvas, it what gives you the flexibility. And Isu, you might want to add to that.
Isu Shrestha
Yeah, sorry. I was just going to add that, yeah, I mean there are ways, for example, even if it's a complex table, but all of the columns and rows have lines in them, it's very easy to make an algorithm that can detect lines and just parts of the table. But again, these tables are not designed to be consumed by an algorithm. These are designed to be consumed by a human. If a company wants to design tables that need to be consumed by an algorithm, they'll just give you the XML directly. So these are the modern companies that already have their information digitized.
39:59
So we are not worried about them because they already have the XML, you can just take the XML directly. But we are worried about companies in many parts of the world that have all their, for example, financial documents in a human-readable, pretty table format, but we don't have access to that digitally. So yeah, that's why.
Mark Gross
Yeah, that's a very good point that Isu makes. I mean, if you want financial data in the United States, for most companies, the information is released as XML, it's filed with the SEC as XBRL files and stuff like that. But if you're trying to collect that same information in Africa or in Asia, most of that information is delivered as paper, or it's delivered maybe as PDF files. And so there's a whole world of data that is not currently in XML. And so that's what we're talking about; we're talking about the other data, I guess. The easier stuff, that's already being done.
Isu Shrestha
I was going to say it comes back to the question of why do this in the first place, which is, imagine you're an investor trying to invest in a country, and you're looking at a hundred companies with all their financial data, and it's too much to go through each one manually like that. So to do that analytics, you use these algorithms. Sorry Marianne, you were saying something.
Marianne Calilhanna
Well, I'm just going to put on my editorial hat for scholarly publishing, because there are still instances where I see tables getting into journals, whether that's a PDF of a journal, print, or just digitized, where we start with editorial too. I mean, some tabular content, a good editorial cleanup could tackle some of the things like straddle headings that maybe aren't necessary. Maybe you have something, a straddle head that says “dollars,” well, you could just put that as a add-on under a row: number, comma, dollar sign. That's just an example off the top of the head. But I think starting with a clean editorial review is also helpful at times to get rid of extraneous things. So I have another question about "If you already have tables that are in XML, are there methods to extract the semantic relationship and convert to structured data?"
Isu Shrestha
So that's a very good point, actually. That's one of the training techniques where we take example of tables that we already have the outputs for, and generate the images, and then have the machine learning model make predictions on that. So that's a very good question, and that's one of the training techniques.
Marianne Calilhanna
So another question is "Once extracted to XML, do you ever create interactive tables?"
Mark Gross
So, Isu, you're maybe looking at if from another aspect, once you've got the XML, there's software that builds interactive tables out of that, because now everything has been structured. So I mean, everybody's view is based on what they're laser-focused on.
43:59
We're laser-focused on creating the XML and having it structured and be correct so that it can be used by other software, other equipment. So once it's done correctly, there is software that'll create all kinds of interactive tables.
And the software's used for all kinds of things like that. For example, a lot of the work we do with creating XML for repairing equipment of all kinds, usually large equipment and military equipment, the XML gets produced, and then it gets moved into an IETM, an interactive electronic technical manual. So that software is what will be showing the flow charts, that software will be showing the tables and doing the interactivity. What we're providing is the data structure in such a way so that that software will work, that's the way I think of it.
So yes, you take information, you structure it. One of the leading ways it's structured today is into XML. That XML then becomes the input to all kinds of other software, analytics engines, software that display things in very nice ways. That's done by the software, but the same data can be used by many different sets of software.
Marianne Calilhanna
That also answers that, one of the first questions, additional business cases. So IETMs are a great example. This question is for Isu: "How do you deal with hybrid tables, so, tables that have text and images or drawings?" Can you speak to that?
Isu Shrestha
We don't have many of those in real life, but yeah, that is a problem, if it does come up. So that's when the algorithm is not able to decide whether it's a table or an image. So that's probably where it might not record that very well, because in the algorithm, one of the distinctions it's trying to make is whether it's capturing a table or text or different sections of a document, it's trying to hone in on the tables only. And if I understand your description currently, you are saying there's an image inside of the table. So usually if it's just a watermark or something that's behind and not intended to be read, the algorithm does fine on that, but if it's something that's part of the table, then it might not record that, or make mistakes on it.
Mark Gross
I would just add that that's a different problem that, I mean, it's true most tables don't have that, but some tables do. And certainly I can imagine, I've seen product catalogs where it's a long table showing different models and designs, and then there'll be pictures of the device inside. So there's different techniques to pull that. Usually it'll be a pre-process before it got to Isu's software, that would go and find the images and pull them out, and put them somewhere temporarily while the rest of the analysis gets done. So the analysis we're talking about here is really designed to deal with text and formulas and things like that that might be inside a table.
48:00
The images really should be done separately, and then they might be merged in later. That's really a specialized version of what we're talking about over here, I think. So I think it goes back to, people can put anything into a table; it's not always really a table.
Marianne Calilhanna
Thank you. So someone asked if maybe Isu can speak a little bit more on the accuracy of the model at this point. The person went on to say that they understand this depends on the complexity of any given table, but are you actively performing these AI conversions already, or are you still developing the accuracy of the model before rolling it out to businesses and industries?
Isu Shrestha
Yeah, you're correct in saying that the accuracy depends on the type of table. So when we separate it out, usually with the simpler tables where there are a lot of visual cues, we can get the accuracy pretty high, lower 90s or 95, on that range. And again, so when I say accuracy I mean tree edit distance. So even a few edits lower the accuracy down significantly. So that's for simpler tables. For mid-range tables, it goes a little bit lower. And the lowest, the most complex tables, it hovers around 75, or 70 in the worst cases. So that's what we see right now. Yeah, so that's the tree edit distance we see so far.
Marianne Calilhanna
Thank you. And also, Isu, what are the core logics used for the graph relation prediction step, is that something you can speak to?
Isu Shrestha
So it's basically taking each node of the cell as a point on the graph. So if you can imagine all the nodes connected throughout all these nodes, and then our job is to predict which node is – so from the graph, it already knows which nodes are adjacent to it, and the algorithm is predicting whether the connected node is actually a relationship to the next column or not. So in other words, is it a merged column that is merged across many other columns, or is it only one singular column? So that's what it's trying to predict in the graph.
Marianne Calilhanna
All right. Another question: "Can AI be used to extract context from the surrounding text that explains the purpose of the table or discusses aspects of the table data?" Maybe they mean the table title.
Isu Shrestha
Yeah. From what I understand, what you're saying is that from the surrounding text we know that it's a product catalog, so we know what to expect, what type of table to expect, and then therefore tweak our algorithm in the same way. So that's an interesting point. Currently, we don't support that. Currently, our goal is to make the algorithm as general as possible and able to support financial documents, scientific documents, or commercial documents.
52:02
But yeah, that's a good point, and maybe a next step on how to improve the algorithm.
Mark Gross
Right. I would just add that, I mean, what we're describing over here is again laser-pointed on how do you take apart a table and figure out what the pieces are? This is a tool in our toolset, in DCL's toolset, which does have more capability to take information, try to combine it and pull it in from the surrounding text, from databases and all kinds of things. So it's not like in the general case you can just pull information without thinking about it, but certainly within specific use cases, that is the kind of thing we do.
We'll see what's happening above the table, below the table, we'll keep track of what's in the summary of the article perhaps, and use that as clues that would feed the semantic analysis engine. So the answer is a qualified yes, this is the kind of thing we do. I would caution to say this is not one of those push the button and it'll do anything in the world. With all the hype around AI, there's no silver bullets, but there's some very useful tools to do things that we really were never able to do before.
Marianne Calilhanna
So are you saying we can't just ask ChatGPT to turn tables into XML, to spin tables into golden XML?
Mark Gross
That's a whole other webinar. [Laughs]
Marianne Calilhanna
So another question: "How much data was involved for the training sets?"
Isu Shrestha
So there's multiple steps, as you can see, there's multiple different algorithms to work with. So there's a pre-training step where we train it on the order of a half a million tables. And then we go on to fine-tune it on different sets of data. So you can think of it of half a million or close to a million tables.
Mark Gross
And be before we scare everybody off, I think, Isu, you're talking about that generic case of trying to take any table and organize it. And so that's a big task. If you could break it down to certain kinds of tables, certain kinds of analysis, you probably can do much smaller training sets, but that would be a more specific tool you're building for a specific application, a specific use case. So you have to keep in mind what it is you're trying to do. My approach is always, let's try to put a line around the problem we're trying to solve and try to solve that one. As you make it larger, it requires more and more, and even a half a million cases may not be enough. So I think how big a training set you need is how much does a car cost, it's depends on what you're trying to do.
Isu Shrestha
You're absolutely right. Thank you, Mark. The half a million tables is what we, in some sense, primed the algorithm with. So we train it on half a million tables. And then, if we wanted to specialize, we take that model that's trained on half a million articles, which has learned a lot from that, and then we, we call it fine-tuning it. So we take that thing and specialize it for a particular client or particular job that it needs to do, so we can do that. So we wouldn't need as much training to do that.
56:07
Marianne Calilhanna
Okay. And then last question: "So I assume that because this is machine learning, the algorithm will continue to improve over time?" Is this correct? Is that a correct assumption?
Isu Shrestha
Yeah. So that's where the feedback mechanism comes into play. I mean, it wouldn't just get better on its own, it would just make the same errors, make the same predictions. What we need to do is, we need to feed it with the corrected versions of where it's making these errors. So if it's making a certain type of error, we fix it. We first identify that, okay, it's is making this kind of error. Then we give it examples of that type of data, more and more of that type of data so that it makes less of that mistake.
Mark Gross
You also have to get to the point of retraining it periodically to make sure you incorporate all the information. It's not like a fourth-grader who learns everything on his own and somebody doesn't; you really have to be proactive and collect information, retraining over time, and then it gets better over time.
Marianne Calilhanna
Well, thank you both. Thank you to everyone who's taken the time this afternoon, this evening, this morning, wherever you are in the world, to be with us. The DCL Learning Series comprises webinars such as this. We also have a blog and a monthly newsletter. And you can access many other webinars related to topics like this, and content structure, XML standards and more from the on-demand webinar section on our website at dataconversionlaboratory.com. We just pushed that URL out via the chat. We hope to see you at future webinars, and I hope you enjoy the rest of your day today. Thanks so much.
Mark Gross
Thank you.
Marianne Calilhanna
This concludes today's webinar.