DCL Learning Series

Tables are Tough: Perfecting an AI Model to Automate Table-to-XML Extraction

Marianne Calilhanna

Hello everyone, and welcome to the DCL Learning Series. Today's webinar is titled "Tables are Tough: Perfecting an AI Model to Automate Table-to-XML Extraction." My name is Marianne Calilhanna; I'm the Vice President of Marketing here at Data Conversion Laboratory. Before we begin, I just want to share that this broadcast is being recorded and it will be available in the on-demand section of our website at dataconversionlaboratory.com. Also, I invite you to submit questions at any time during our conversation today; we've saved some time at the end to answer anything you've submitted. And quickly, before we begin, I would like to take a moment to introduce DCL and Fusemachines so that you have a bit of context around this subject. DCL's services and solutions are all about converting, structuring and enriching content and metadata. We are the leading provider of XML conversion services, DITA conversion services, structured product labeling, or SPL, conversion, and S1000D conversion.

While we are best known for our excellent content conversion services, we also do a lot of work in other areas that are listed here on this slide. Semantic enrichment, entity extraction, data harvesting, third-party validation of previously converted content, content reuse analysis, and structured content delivery to industry platforms. No matter the service we provide, we harness the latest innovations in artificial intelligence, including machine learning and natural language processing, to help businesses organize and structure data and content for modern technologies and platforms.

Fusemachines is part of the DCL Partnership Laboratory, and Fusemachines is a leading provider of data and AI talent solutions for enterprises that are undergoing digital transformation. Fusemachines helps train and build high-capacity distributed data and AI teams using its proprietary platforms and network. And they grow their network by running AI education programs in underserved global communities.

So today I'm pleased to introduce Mark Gross. Mark is President of Data Conversion Laboratory. And Isu Shrestha. Isu is Senior Machine Learning Engineer at Fusemachines. Both of these gentlemen have spent their careers working with content and data structure, computer vision technology, AI, NLP, and other related services that help organizations support digital transformation. Welcome to both of you. And Mark, I'm going to turn it over to you.

Mark Gross

Okay, thank you very much, Marianne. It's a pleasure to be speaking out to all of you today. There it is. Okay. So today, this is a topic near and dear to my heart, it seems that I've been working on trying to figure out how to get data out of tables since the beginning; over 40 years was the beginning of the DCL.

3:56

One of our first big projects was trying to figure out how to get tabular financial data out of these ASCII files that were all over the place into, this is before Excel, I think it was VisiCalc at that time, and worked with some tools to do that. So it's over 40 years ago we've been doing that. And it's getting more and more interesting, and more and more different kinds of tables. Today what we're talking about is some techniques that we've been working on together with Fusemachines on how to use artificial intelligence and how to use computer vision technology to be able to pull information out of tables. And it's important. I mean, first of all, tables are complicated, and Isu will get into that soon, but let's go ahead to the next slide. There we go. The reason it's important is because tables just contain a lot of information, and it's not just text that you read, but it shows all kinds of interrelationships.

And first of all, to figure out what's going on, some of these are very hard to figure out even what's going on just by reading it. So this is a table out of a scientific journal talking about the physiological indexes of lettuce cultivation. I'm not sure what that is, but obviously there's relationships going on between what's going on the left and what's going on the right, and each of these numbers mean something. Trying to put this into words would take thousands of words. So this is one kind of relationship. Let's go to the next slide. In the financial world, many of you have seen tables like this, and while that first table's more two-dimensional, this is almost a three-dimensional table, how do you relate all these?

You have to really understand what each of these columns mean and how they relate to each other. So we know what the three months ended, ending September 30th of '22 and '21, and the nine months ending. But there's a lot of information going on over here. And really what you're trying to do is relate from that third column to the fourth column and the fifth column to the sixth column, and you really don't know what that is. There's also a lot of blank space over here. So it's not just taking apart a table. And let's go to the next slide.

And this one is for a package insert. We do a lot of package inserts in the pharma industry. Again, this is really multi dimensions over here. It's not just what's going on at the top or the bottom, like that fever chart, that's a whole other table inside a table that's going on. So all of these combine to make understanding tables very difficult, and tables don't always have lines around them, they're not always clear, they might be missing lines, missing rows, all kinds of things which Isu will get into in a lot more detail. And they appear everywhere, in pharma, in the scientific world, in financials, in government documents, tables are everywhere.

And even on an annual report, you might be dealing with 30, 40, 50 tables in there. Let's go to the next slide. So what's the benefits over here? The benefits are obvious. And we’re in a business, I'm in a business where we're dealing with thousands and thousands and thousands of pages that we're dealing with on a daily basis, millions of pages every month.

8:08

And what we're doing is taking information, turning into XML, and much of it's taking tables apart and turning into XML, much as many of you are doing.

And getting them done right is very difficult, it often requires human interaction, and human interaction multiplied by thousands, thousands is expensive and is error-prone. So the extent that we can apply techniques that only been available over the last few years, I mean, I've been working on this for 40 years, but the techniques and the computers that are fast enough to do this only appeared recently, to be able to automate this, to identify what's in a table using semantic information. And the benefits of all that is certainly that once you get this you can move it across to other places. If you are collecting financial information for an industry and you're pulling it out of all the individual annual reports and putting it together, you've got to get the information from one place or another, and you've got to make it all work together.

So content interchange is very important, and identifying what's in tables is important and all those things. And that's what we've been working on, that's what we've been working on with Isu. And Isu has some very interesting approaches to doing this, which we've been working with. And with that, I'm going to turn it over to Isu.

Isu Shrestha

I'm Isu Shrestha, I'm a Senior Machine Learning Engineer here at Fusemachines. I work primarily with NLP, but also a little bit of computer vision, and I work across many different teams. We have offices basedin seven countries including the US and Canada. And this particular project we've been working withNiranjan and Prashant,who are also on the call, the engineers that are working on this project right now. So diving a little bit deeper into this, referring back to what Mark spoke about, this subject is obviously an important problem that we want to solve in the financial world and the business world.

But for me, as a machine learning engineer, I find this problem also fascinating in the way that it rides between the two domains of computer vision and NLP. And so let's explore why that is. If you look at this particular table from the eyes of a machine, you can start seeing some of the different challenges that make this seemingly simple problem into a complex one. So if you look at the top right, so there's the word “civilian.” And then us humans, with just minimal visual cues, we can already tell that this “civilian,” this word, is related to two different columns.

And then similarly the other word “military” and “total.” So these are related to these columns. And for a machine it's kind of difficult to do, and this is where the computer vision part of it comes in. So we need to look at the layout of the table and understand what elements or what cells are related to what other cells. And it's the same thing for the rows as well, if you look at the left side of these rows that are highlighted over there. So there are multiple rows that need to be in the same cell.