Semantics & Structure: Unlocking Content to Reach More Global Customers
[Marianne Calilhanna] Hello, everyone, and welcome to the DCL Learning Series. Today's webinar is titled "Semantics and Structure: Unlocking Content to Reach More Global Customers."
My name is Marianne Calilhanna. I'm the Vice President of Marketing at Data Conversion Laboratory and I'll be your moderator today. Just a couple of quick things before we begin. This webinar is being recorded and it will be available in the on-demand section of our website at www.dataconversionlaboratory.com.
Now, before I introduce our panelists, I'd like to provide a short introduction on Data Conversion Laboratory, or DCL, as we're also known. Our mission is to structure the world's content.
DCL's services and solutions are all about converting, structuring, and enriching content and data. We're one of the leading providers of XML conversion, DITA conversion, S1000D conversion, and SPL conversion services. While we are best known for our quality and our customer service, we also do a lot of work with semantic enrichment, entity extraction, data harvesting, third-party quality assurance validation, content re-use analysis, and structured content delivery to industry platforms. If you have complex content and data challenges, we can help.
I'm so pleased to have today's panelists here with all of us today. We have Mark Gross, President of Data Conversion Laboratory, Marjorie Hlava, President at Access Innovations, and Jay Ven Eman, CEO of Access Innovations.
Margie and Jay, can you share a little bit about Access Innovations with us?
[Marjorie Hlava] I can. So Access Innovations is particularly dedicated to enriching metadata. And we do that through semantic enrichment, as it's often called. We like to use things that are reproducible and consistent and that we can improve the accuracy on. So we depend heavily on controlled vocabularies of different kinds, including taxonomies, thesauri, and ontologies depending on what the consuming applications are. This is normally considered in subject metadata, but we also do other kinds of metadata, encompassing basically the entire bibliographic citation, as they used to be known.
We find that our clients have a lot of very big collections, often, and so we've applied automated means to this indexing using a human-aided intelligence system that gets us into the upper nineties in accuracy.
It is rule-based, because we like, as I said, to be predictable and consistent, and be able to prove how we got to where we're getting, because after all, we're handling other people's data. In order to do that, along the way we have developed a number of software tools to support the services and consulting that we provide for clients. Back to you.
[Marianne Calilhanna] Thank you, Margie.
Mark, can you take it away?
[Mark Gross] OK, thank you, Marianne. First, Margie and Jay, I just, it's really an honor and a pleasure to have both of you on this program today. We've known each other a long time, and there's always things to talk about, and it's really nice to see you here, although I'm in rainy New York and you're in sunny Albuquerque.
Always sunny in Albuquerque, or most of the time. First, I want to just lay the groundwork. You've been, you've been at the forefront of these technologies for a long time, actually, a little longer than I have been. Just, just, can we go back a little bit to that?
It sure is different than, I think in some ways, but why did you get into it? Why did you think this was a good idea?
[Marjorie Hlava] Well, we, Jay and I were both working at a NASA installation at the University of New Mexico when we met.
And we were using a lot of the early databases, and searching, I was searching about 20 hours a week on line. And they just fascinated me that they were, they were put together and the coverage of areas, and then every now and then, somebody would ask us to do a database for them, maybe put it up on DIALOG, or SDC ORBIT, or one of the early systems, POP INFORM, for example, or, even the NASA Recon system.
And it became increasingly obvious that, instead of just searching those databases, we could build them. And that's how we started. We formed the business in 1978, really to build, build well-formed database collections of people's articles and information.
[Mark Gross] OK. And, and it's more important now than ever, because of all the content that's out there. And I think there's two things, I just, so there are reports that, on the, on the economic side, that the book sales could be 75% higher when they have complete, more complete metadata.
And, and more than that, things can be found. They couldn't be found before. I mean, we have some, some stories about, you know, you've been doing some stories about things that couldn't be found before. [Jay Ven Eman] No, really, it's a real challenge.
I'd like to go off-topic for just a second, for those in the audience that are US-based: today is Veterans Day in America. For those who are maybe dialing in from Europe late in the evening, I'd like to do a shout-out to the veterans in the audience and thank them very much.
And I will segue that into the fact that there are a lot of veterans in the US, a lot of veterans' services that are available to the vets. Uh, how do they find those services?
Well, maybe they have access to a computer at the library or their own, and, and they go and try to search for things, and it's really very, very difficult, even, even today, with much more advanced search tools and Google to find resources that they need. And of course, in a pandemic, we're all looking for answers, and it's much easier to find when you're dealing with well-structured, well-semantically enriched content.
Can I, if I have a moment, I just have a quote from the, again, the US Center for Disease Control, and they wrote, that disease name was subsequently recommended as COVID-19 by the World Health Organization.
Meanwhile, 2019-nCoV was renamed SARS-CoV-2 by the International Committee on Taxonomy of Viruses. So if you're looking for answers, you need to know all that. Or do you? And that's where taxonomies and semantics can really help guide you to the information you need.
Aspirin: acetylsalicylic acid. That's a quick example.
[Marjorie Hlava] I could, I could give a couple of more. So part of the – we found that the industry has grown up along two different columns: we have library science and we have information science, and they don't talk to each other all that much.
But what we've, what we've found is that library collections tend to have three keywords that they use, usually from an established list, like the library subject, Library of Congress Subject Headings. And so we did a university where we took a university's special collections.
They were getting about 40% accuracy in search off of their subject headings from the LCSH, and we indexed it with the JSTOR collection, their taxonomy, and did the same search queries and found that we got over 85% accuracy in returning the results. And so, that's a real-world example of how metadata has improved accessibility. We had another adventure some years ago with the Russian information community, and we were going to bring in open-source information.
And because that information was, it was an adventure that the Maxwell enterprise had sponsored and nobody was quite trusting each other, and he was going to trade technology for information.
So he was providing microfilm machines and photocopying machines in exchange for the tapes from the all Soviet information scientific and technical information network. And so, Maxwell, they didn't quite trust him, so he didn't send the cameras and the microfilm machines, and he didn't send the batteries for the photocopiers. And in exchange, the Russians provided the full text of all of their documents, but did not include the metadata.
So when we were eventually able to straighten that out, we found that retrieval on those collections without metadata was about 20%. And with metadata was about 90%. So in terms of trying to have real information exchange, the metadata is crucial. And we could go on with lots of examples, but those are the kinds of things that, that happen in real, real-world instances. And the metadata is pretty, pretty important.
I, um, one of the things that happened as we go through these election times, one of the – there was the e-mail server thing with Hillary Clinton, and wherever you stand on all of that, the interesting thing to me about it was when she said, Well, it's just metadata. Like everybody knew what metadata was, and suddenly "metadata" as a word was catapulted into a whole new environment and, uh, people start to know more about we're doing. And it has changed over the years. I mean, I used to say "I build databases," and people say "What are those?" "We do metadata enrichment." "Huh?"
And now, by and large, people, people tend to know what that is and, and they want some too. And you're right, Mark, that it's a whole lot more important. And we've done a whole lot more of that kind of work now than, than ever before.
[Mark Gross] So, no, I think it's pretty amazing when, like what we're – esoteric terms suddenly come out on the front page of The Wall Street Journal, or The New York Times, I always find that pretty amazing. And, you know, when people, in national interviews, a term like "metadata," as if people understand it, it's quite amazing. I want to go back, folks, to something you said about that, that without the metadata, the search effectiveness was 40%.
That means that if there are five articles out there that are relevant, you only find two of them, and three of them are not found. [Marjorie Hlava] That's right.
[Mark Gross] That's amazing. I mean, it's not just oh, it's just a little bit of a difference. It means that you haven't found significant pounds of the literature.
[Marjorie Hlava] And it's compounded, Mark, by the fact that, like Jay referred to the COVID example earlier, we've built a little COVID taxonomy, which we're happy to give to anybody. And the reason is, because, at current count, there are at least 19 different synonyms published in the literature for COVID. And if you really want to do a good search, you have to either put them all in or you have to have a search engine that will use the synonymy.
[Mark Gross] Right. So, so, so, well, this is where just a search or, or a Google search, I guess, to use the vernacular, how many of those are going to be included when you, when you do that? I mean, is synonymy included there, or it's really, totally outside of that?
[Marjorie Hlava] Yeah, you have to put them in yourself in Google.
[Mark Gross] Right, So, so that means that if you put in and just do a search on COVID, you'll find maybe 10% of the articles, or 15 or 20%, and miss the overwhelming number out there. [Marjorie Hlava] Right. And – [Jay Ven Eman] Right. And the –
Oh, go ahead, Marge, I'm sorry. [Marjorie Hlava] The people that are using your term will be the people that you're talking to, and everybody else who's having a conversation using a different term, you won't be part of that conversation. And it starts, it puts up all kinds of barriers.
[Jay Ven Eman] And on the COVID thing you mentioned, Mark, if, if your discovery platform is properly designed, then what will happen if you're looking for information on COVID and all of its synonyms?
Not only will they get a better chance of surfacing them, But you'll also get better relevance. Because if an article is assigned using humans or a system like ours –automated – to it, it means it's about COVID as opposed to an article just mentioning it in passing.
So many articles today that have nothing to do with COVID that aren't going to provide you with the solution to anything related to that, like treatment or a scientist looking for information on it, but it just mentions COVID, that'll be surfaced because the word is in the text. And a full-text search that, when you have semantically enrich content, then that can be weighted more heavily so that it's, it rises to the first or second page of your search results because it's about COVID, not simply a mention of COVID.
[Mark Gross] So we're talking about two different issues over here which go against each other or work together. One is being able to find everything.
The other is to discriminate and only leave, bring up to the surface the things that are really relevant, even though they're found. And in a world where you have, I don't know how many tens of thousands of articles that came out about COVID, that's, that's, or hundreds of thousands, that, that becomes, that's really a big issue. So, so, the, so, you mentioned that you built a taxonomy really just to deal with COVID or build out these synonyms.
But that's just a piece of what a typical, let's say we're dealing with journals in the medical space. They're going to be building, they're gonna be building out, what does it take to build a taxonomy like that? I mean...
I'm putting you on the spot. [Marjorie Hlava] You mean how do we put it together from the beginning?
[Mark Gross] Yeah. [Marjorie Hlava] So, normally, we look at the content, because, well, and, and I'm gonna put a placeholder in precision and recall.
But to build a taxonomy, what we do is we, we look at the individual collection of content that the taxonomy is going to be used for. So, some people have a really broad multi-disciplinary, everything's-there kind of collection. But when speaking of medical publishers, for example, they might have a database that's primarily about nursing, or primarily about clinical care, or primarily about cancer, or any number of other things. And if you apply just a very broad-based taxonomy, like the medical subject headings, you aren't going to get to the really specific level of the nature of the content itself. So, we, we build the taxonomy to the content.
And we, we like to get the whole content. Then, we get everything that people have used to describe their own content in the past, because, in some cases, medical insurance and their point of view is a whole lot different than that of clinical care or diagnosis. And so, we're looking at the articles themselves. We mine them for, for the terms, using a number of different technologies. And then once we have those, we gather them into arrays. So, kind of think of little piles of concepts that kind of go together and try to see which ones nest with each other, which ones are exactly the same concepts stated differently, and which ones are, which ones are related to each other, which ones nest into a hierarchical level. This is a narrower sense of this broader concept. And when we have that hierarchy, the associative relationships, which are related terms and the synonymy or equivalence terms, then we have a good basis for a word base. It's a database of words on a particular topic and then we, we run those against the content again to make sure that every article has a taxonomy term and, or preferably more than one, and that some don't get too many. And we look at the level that some terms need to be split and so on. So it's a, it's a, it's a fun analysis of the words that are used to describe a specialty or an area.
[Mark Gross] So it requires some level of subject matter expertise, I would guess, to be able to really make those kind of decisions?
[Marjorie Hlava] It requires two things.
It requires some lexicography, some understanding of the words themselves, and it takes understanding of how those words fit together in a subject area. So, we have a, we have a staff of people who are taxonomists or lexicographers, and we depend on the client for the subject matter expertise, for the most part, because we want to be sure that we are doing it the way they want it done. And that subject matter expert review is a really, really crucial part of making sure that it works correctly.
[Jay Ven Eman] Yeah, I mentioned, I read that quote from CDC, and I mentioned the International Committee on Taxonomy of Viruses. That's a pretty narrow committee, but there are hundreds of such committees in all kinds of areas, and we certainly use those sources, of course subject to intellectual property rights.
Most of them are public domain, when they're that kind of a committee, to validate and curate terms in something like a specialty taxonomy on COVID.
You also get into very generalized terms, like, you know, the "quarantine," or do you use the word "quarantine" or "shelter in place," or "locked down" or "lockup"... Go ahead, Marge, excuse me.
[Marjorie Hlava] I was just gonna say that in using the subject matter experts, we have a great many learned societies and scholarly publishers as clients, and so, they have among their memberships a lot of people who have some rather strong opinions about how a word should be used. And, we need to honor that.
[Mark Gross] I've met some of those people over time. Let's talk about, a little bit about what happens, You know, you've built a taxonomy in one area and now there's a closely related area, and they want to merge the collection, merge the taxonomies.
What happens with that? Is that doable? What does that involve?
[Marjorie Hlava] It is doable.
What we do is, with the new collection the, the second one, we run the taxonomy using that automated drill base against their content to see how, how well it matches. And then we know if we need to add a bunch more terms or if we can just use it, right, flat out or, or not.
And I'd say in about 25% of the times, if it's a closely allied field, we could use it.
Although we have, we have two associations that are clients of ours that deal with cancer research. And they take a really different perspective, so we can't use the same taxonomy for both of them, because their orientation is different. And, but, but in some others, we have a nursing application where one of them is for the workbench in the, in the hospital, the nurses' desk, and the other is for the use cases for a bunch of journal articles. And in that case, the, the content is very, very similar and so we were able to use the same one. So it depends. So the best, best thing to do, is to do a test initially and see how well the content matches the target taxonomy.
Yeah, you have, of course, COVID seems to be a good example. It's, it's, the research being done is very multi-disciplinary. So you have folks up at Los Alamos National Labs here in New Mexico, which is where we're located, uh, using particle physics theories, et cetera. They do a lot of work in medicine.
Up there is, and then you have chemistry, biology, all working together on these solutions, and the terms that they use are, have to be disambiguated and related. This concept is related to that concept: that guides you to the better material.
And getting out of science into e-commerce, that's a great way to improve revenue, is to use related-term concepts, and it's being used now.
If you go and look for "bicycle," on the sidebar, usually on the right, it'll come up with all kinds of other things you might need for your bike. You might want to think about lights, helmets, accident insurance, you know, that kind of thing.
So those, you know, cross-platform, cross-discipline, the semantics, and the structure really helps guide the user to information that they are looking for and then with related-term concepts, things that might be of interest that they hadn't thought about. And that's very true in a multi-disciplinary environment, like in science.
You're using the term "disambiguation"; I'm not sure everybody knows what that means. I think maybe talk a little bit about that. [Marjorie Hlava] Well, so, English is such a fun language and we, we are rich in puns, and we are rich in puns partially because words can have a different meaning and different use cases.
So if you take the word L-E-A-D, that might be to lead a team. It might also be lead. And obviously, those are very different. It also might be the piece of water that leads into a bay.
And it might be a, uh, piece of rope that you use for leading a dog or a cow. And so it has lots and lots of different meanings. And as we get more into science, something like mercury can be a god, or it can be a car, it can be an element, it can be a planet. And we need to disambiguate how we use that.
Cell is another one. It could be my cell phone, it could be a prison cell, it could be a terrorist cell, it could be a biological cell, all kinds of different meanings for exactly the same spelling.
And so what we do is, we need to say in this context, this word means that, and we usually do that by adding adjectives, biological cell, prison cell, and so on. But it's not always easy to do, in which case, we have to add a bit more oomph to them, or use a different word.
[Mark Gross] Right, the, so, let's, let's, so, building taxonomies and talking about them is all a lot of work. What do you say to all the hype about artificial intelligence and how, that's gonna, why not just use AI and machine learning, ML, to just do it all for you? What's – I'm sure you have strong opinions about that.
[Jay Ven Eman] Yes, we do, the, uh, a little history: Margie talked about the origins of our company, and back in 1978, storage costs were actually higher than the cost of a little box machine.
We didn't have small machines in '78 and so you had, you did not have full text. You had a bibliographic citation with an abstract and subject terms and then if you found something of interest, you gave it to your corporate librarian or your neighborhood librarian to get the full text.
Now we have full text, and along in the eighties came full-text search engines, which is a combination of different kinds of AI tools, vector analysis, co-occurrence. What they found was that, certainly, retrieved content. But the relevance was very, very low. In other words, you got a lot of stuff that just was not useful.
And it took up a lot of time of the knowledge worker. A recent study showed that about thirty– 35 to 36% of a knowledge worker's time is spent looking for content.
And very little time using it. And about, another statistic was about 56% of the time, they don't find what they're looking for. But with AI, the problem there is, it's highly statistical and not semantic.
So, it doesn't necessarily handle meaning. It will come, it will happen. They're doing great things with AI. But, in terms of discovery, um, they found that, you know, full-text search engines didn't work that well, so taxonomies came back into the vogue and they're still there.
It's faster, cheaper, um, and it, it's easier to maintain. The, oh, that outfit in California, I have to look up their name – they're, ah, OpenAI LP. They put out the GPT-3: they analyze 300 billion words to come up with 175 billion, what they call parameters, and all it does is anticipate language, it doesn't really do reasoning.
But they did say that you could add weights to those parameters to improve the accuracy of the system.
You gotta think about how you're going to deal with 175 billion parameters, and how do you get around to weighting those?
[Marjorie Hlava] And they did them – [Jay Ven Eman] A taxonomy – oh, go ahead, Margie.
[Marjorie Hlava] They did them all by hand. Well, the, some of them, the whole, whole premise is, is coming, it will work. But that's, it's down the road a ways.
And there have been some interesting recent spectacular flops, like IBM Watson did really well on Jeopardy, and if you look at the, the pictures of it, there's a little computer screen on the, on the podium.
But then in the room next door, there's a huge number of machines running parallel processing. And it's a huge dictionary lookup system.
So what they did, when they tried to do the MD Anderson implementation, Jay, do you want to go into the details on that one?
[Jay Ven Eman] Well, sure, the MD Anderson, for those who aren't familiar, is a major cancer center in the United States, in Houston, run by a part of the University of New Mexico and Texas, and they had a four-year project using Watson, first to look at leukemia, then they switched halfway through, looking at lung cancer.
It takes a lot of data and a lot of training in a few jumps. If you jump in the middle, you gotta retrain, entirely retrain the system, and it's one thing to train for, for a system to recognize a picture of a cat. It's another thing to ferret out a tumor in an X-ray, a lung tumor in an X-ray.
And, the problem there is the cost: You can run data, pictures of cats and have humans vet the results at five cents, two and a half cents, five cents an image, but an X-ray rate is going to cost you 2,000 bucks an image to, to vet. It was working sort of well, but the problem they ran into after $62 million invested in over four years is that it wasn't working, and it was mostly a matter of project management.
You had grant PhDs who knew how to do artificial intelligence, and know nothing about project management in the real world. So, it was a failure of a very long, very well-known process, that is, project management.
It had nothing to do, necessarily, with the AI.
[Mark Gross] Which brings me to the point you said: it will happen, but it'll take a while. And I, personally, I mean, the places we're successful using AI and machine learning, is where you can reduce the size of the problem so it's not the whole world. I mean, so just feeding in millions of articles about cancer and hoping it comes back with, is a pretty big piece of, really big problems are solved at once, whereas if you can sectionalize it, it's something that may be more reasonable. Which I think takes us back to, I mean, our, DCL's whole world revolves around structure, right? Marianne said at the beginning: we structure, our business is structuring the world's data. So, so the thought is, I mean, if, if, if you can structure it and put in some information about what's in that article or how it's formulated, what's there, suddenly you can reduce the size of the problem you're looking at.
[Jay Ven Eman] Absolutely. [Marjorie Hlava] Right.
The – [Jay Ven Eman] Go ahead. [Marjorie Hlava] and you find that if the data has, is clean and has structure, and then you add a taxonomy to that, then many of the algorithms for artificial intelligence work a whole lot better.
They're much easier to work with and they, they implement a lot more quickly. So I, I've made a diagram, actually, of the, in our own software, we use about 22 different algorithms, any one of which might be called artificial intelligence, But there are some that we don't depend on in the, in the automatic indexing system, although we do use them in the harvesting of terms. And those are the ones that are more statistical based or Bayesian neural nets or, or vector analysis or engrams where we go gather things and bring them back. But then we run them through a human process and put them into a taxonomy, a thesaurus ontology, to be used in structuring the data.
And in that case, the artificial intelligence operations work very well. So it's just, when it's, when it's purely "dump it in the top of the machine and everything'll come out perfect," I don't think that works.
[Mark Gross] Right, it doesn't work.
[Jay Ven Eman] Yeah. I wondered if this was a good time to show that little diagram as to what, what does structure look like when you're talking about a document-oriented content and the semantics? There's that little diagram.
[Mark Gross] Yeah, that's a good idea. Marianne, can you put that up? There we go.
[Jay Ven Eman] OK, so, slides, do you want to do definitions, Marge?
[Marjorie Hlava] Well, it might, it might be helpful.
[Jay Ven Eman] OK, well, you can back up, she can back up, I guess, right? To the last slide. Sorry, we're running you around, Marianne, thank you.
[Marianne Calilhanna] No problem. [Marjorie Hlava] One of the things that people talk about is, is what is a taxonomy? What is at the source? What's an ontology, what's semantics?
And so that's what this diagram tries to outline, is that there are a lot of standards in the, in the area for these, for these, and so we talk about all these things, and we throw them around kind of interchangeably, but they are not, they are not exactly the same. And each of them has a, has a definition.
So if you start with a taxonomy, that's a collection of terms that are organized in a hierarchical or nested structure. And they can have more than one parent, but they need to be clearly, clearly stated. When you move to a thesaurus, you add a few additional fields, like relationships and synonyms. And relationships and synonyms are technically not held in a taxonomy.
And when you move to an ontology, and we can send these definitions to anybody who wants them, when you add to the, the controlled vocabulary as an ontology, you've added a bunch of, really, instances or specifics that show, um, via definitions, using classes or properties.
But the relationships between those terms and instances of the terms themselves. And they are frequently stated in a format which is called a triple.
And the triple is contained in a subject, predicate and object, or a subject-object-predicate, except that the predicate is in the middle connecting the two of them together. And that's the challenge in building an ontology, is, and stating those connections and how you want them to, to be done.
And then semantics is, this is, is an area of study, it's, it's combining linguistics and, and logic so that we can determine the meaning of a particular information object. And so, that's it; I just wanted to do the quick definitions and we could flip over to Jay's diagram.
[Jay Ven Eman] Yeah, thank you.
[Marianne Calilhanna] Jay, I'm popping in because we are getting some questions, and I think one will be answered as you walk through this next slide. Someone asked: they'd love to hear a little bit more about the synergies between DCL and Access Innovations. [Jay Ven Eman] OK. [Marianne Calilhanna] You know, and how our partnership really speaks to, to, to this. So I'll flip over to that slide and let you walk everyone through that.
[Jay Ven Eman] I think that'll, that'll help, because it does kind of bring together both structure and semantics. That first little box there is just something I kludged up for today, an article, a title, date, my name, and the start of a sentence. And it's, typically, that's going to be a Word document.
Or perhaps even worse, a PDF. And that's considered, in the world of document-oriented databases, that's considered unstructured. That's one thing to point out.
We're kind of focused here today on, on document collections, and I talked to the question about artificial intelligence, machine learning, deep learning. And we mentioned kind of a failure.
But AI has many, many, many success stories in a lot of different areas, not necessarily here, in terms of working with unstructured document content, but certainly in a lot of, I mean, robotics, et cetera, and self driving cars.
But at any rate, how, how do we get to a document that can be used reliably by different software agents?
This is where DCL comes in to add structure, and between Data Harmony, Access Innovations' Data Harmony software, and the DCL structuring capabilities, you can add what we call XML. Extensible Markup Language is the markup of choice these days.
There are a lot of different standards: EPUB for books, and JATS for journal article tags set, and so on. There's a lot of a different markup language, but the markup itself adds to structure in the case of a document. So those tags are metadata.
So you see in the middle there, there's a tag for author, the start of an author field, and then author first name, author middle name, author last name. I don't have a middle name, so there's no data there.
But it does show that my last name is two words. Artificial intelligence would have problems with that, most likely if they are analyzing a million documents with names in it.
The V, V as in Victor, E-N would end up in as a middle name, not the last name. And I can tell you that happens, because when I go to check in to a hotel, I have a lot of problems. They aren't aware of what my name is. So that tells you a lot. It also tells you in terms of entities that, that I am, that name is an author name.
How do you tell the difference between whether Mark Twain wrote an article or it is about an article? Well, the meta– metadata, in terms of the structure that DCL can add, would say, AU, last name, Twain. So he's the author. But if it says subject term Twain, Mark Twain, then you know that it's ABOUT Mark Twain.
So all of that structure really helps in retrieval. In the yellow box are the subject terms that our Data Harmony software would add. You'll note that some of the terms that are assigned don't occur in the article, and that helps retrieval and tells you something about the article, which helps you because it's been assigned, you give it a higher weight.
So if you do a search, and you get back, 50 thou– in Google, or on your discovery platform, and get back 500 items, the most important rise to the top, because they have these category, they're called "category" there in my example.
But they can be called subject terms, or taxonomy terms. The other thing to consider is those taxonomy terms tell you something about me. So they could, if you're creating an author database or a membership database, those terms could be assigned to my record.
You have not violated any GDPR. Any personal – because that's an article I want published. So they're going to know what I'm writing about. And I prefer that those tags be assigned because it's more accurate about my interests and what I wrote about, so it, it tells you something about the author as well as something about the article.
So there's the relationship: the structure is added by DCL. They extract entities. We add semantic enrichment. It tells you the "aboutness" of the article, the "aboutness" of the author, the "aboutness" of the institution that the author belongs to.
You can add that to an institution record and that improve, improves retrieval, discovery, navigation, analytics, visual display, creating fast topics such as COVID, so it gets deposited into a discovery platform.
I think that kind of covers it. Did that, do you think that answered their question?
[Mark Gross] Yeah, I think it did answer the question very well. I like the way you put that. And really, that kind of tagging or structuring is, is what DCL does. Using a whole bunch of tools to get there.
But I just want to point out, again, that you could see that if that information is in there, that gives a pretty good cue upfront to all the artificial intelligence engines so they get that kind of information right, and they can focus on, on the pieces of the question, that, that, that AI can solve very well.
And also, I think, another point, I think, on AI, we were talking about, we've spoken about this, before about it, it's a statistical process. Whenever you talk to data scientists about an AI solution, and you say I need this 100%, they'll tell you can't get to 100%, because it is a statistical process. It can get to 85% or 90% or 95%. But what this means is that the things that are really important, you can get to 100%, like getting, like getting Jay's name right when he checks into a hotel. And the other pieces, which are more, are, are prose-related or descriptions, those could probably not need to be at 100%; that combination is, is, it gets you to a pretty good solution, I think. So I think that's, I think that puts, I think that slide actually puts it together very nicely.
Let me just ask: Marianne, we're sort of getting towards the end of the hour. Do you want to go to some questions, or?
[Marianne Calilhanna] Yes, we do have some questions I'd like to share with you, and I'll take a moment just to invite our attendees: now's the time to ask our industry experts any questions you might have around semantics, metadata, structure.
And we have had a couple of questions about the slides, about Margie's definitions. The recording will be on the DCL website, and I'm happy to send the slide with definitions to anyone who is interested.
Um, so in the beginning of your conversation today, you know, you were talking about this combination, semantics and structure, and how that improves search results. So, someone asked, you know, I get that you are able to improve the number of items found in a search; how do you control false positives?
[Marjorie Hlava] Yeah, so I, earlier I had said I want to put my finger in precision and recall. And the other thing is relevance. So there are a number of ways to measure how well search does, and those three main parameters: precision, which is giving the, returning to the query, giving the searcher exactly what they wanted. Period. Every document in there exactly matches what they were looking for.
That's precision. Recall is making sure that you get everything in the database that has to do with their query. And then relevance is based on a lot of profiling and personalization and so on so that I can, I, the search engine, can give to you what I think you want. So, it's a, it's a confidence valuation and whereas precision and recall are, are fairly straightforward to measure. Relevance is not – relevance is what you get from a search engine like Google.
Precision and recall or what you get from EnDec are faster, MongoDB or Lucene, something like that. So, there, there are different purposes for those systems, if you want a quick and dirty discovery, I'll just jump in and get those top 10 hits. That's one thing. But if you're working on a research project or a dissertation, you want to be sure that you got everything having to do with the topic. And when you're looking for clear answers, you want precision, so precision over recall is kind of a healthy tension. You don't want to give people what we call noise, but you want to be able to give them a fairly comprehensive answer to what they need.
[Jay Ven Eman] Right. Noise, you would say would be equivalent to a false positive, is that correct, Margie? [Marjorie Hlava] That's right. [Jay Ven Eman] t's retrieved as supposedly relevant but it's not. So it's noisy.
[Marjorie Hlava] So when we, when we do those measures, we use what we call hit, miss, and noise statistics, and a hit is exactly what a human would have applied. A miss is something that a human would have applied and the computer didn't get. And that can be serious if you're overlooking some important research. And noise is something that the computer supplied that the human is really not interested in: "Where'd that come from?" And so if you go back to the COVID, or let's use a different example. Let's use breast cancer. So somebody has breast cancer.
There are a number of names for the same thing. You can have stage four breast cancer, you can have metastatic breast cancer, for example, and if you only put in one of those words, you're not necessarily going to get all, all of the stuff from a different, different word, and the words don't look the same at all.
So if you don't supply the synonyms and get those embraced by the search engine that you're using or the search software that you're using, then you're only going to get a partial answer. So we're, we, we really want people to get complete answers to the, to the questions they ask because that's what gets good retrieval.
[Mark Gross] Now just flip around; the definition of recall is, recall means that you don't want to miss anything. That's really, you don't want to miss anything and the tension is, well, if I don't want to miss anything, I'm going to have a bunch of things I have to plow through. I don't want to plow through too many.
So that's that healthy tension over there, right? I think that's...
[Marianne Calilhanna] Thank you. You did touch on this question at the, at the beginning of the conversation, but I think it really bears just a little further conversation. It's relevant to everyone. So, how do you accommodate new content in your taxonomy without compromising assets that have already been described?
[Marjorie Hlava] Well, you can do that a couple of different ways. It used to be, in the old days, to rerun your back file was a horrendous, painful months-long process. Now it takes 2 to 10 hours. So it's not, it's not the big problem than it used to be to update your taxonomy and rerun the data.
Another thing that people do is they add additional synonymy to the terms. And so at the front end, it's, it's taken care of by the synonymy as opposed to by the tagging. So when we talk about tagging, we're talking about adding the terms to the, to the record, to the XML.
So the preferred subject term gets applied and is added to the XML. And usually it's, usually seven of them, as opposed to the standard library indexing of three OPAC items and, and in that way, we're able to get it travel forward. So if you want to update the XML with the taxonomy, you have to rerun the backfile, and usually people don't do that until they have at least 10% change in the vernacular. So, if there's been a 10% change, then the debate needs to be had whether you're going to do the re-indexing or not.
[Jay Ven Eman] Right, and, you have, and this process that she described may sound complex, but it isn't. It's fairly straightforward. It's, it's, it's built into the routine of maintaining your system. And then you have new content coming in, so you have new terms, and you want to get those tagged and structured and added to your repository. Where AI has a little problem still at this point, is, you know, do you really, you now have 400 billion terms.
Do you want to rerun all of that and add weights to the now probably 225 billion parameters? Right now, that's not very feasible, it's not feasible at all. So this is much simpler, actually, and fairly straightforward.
[Mark Gross] Right. And, and, it probably has more reliability built into the process. But I think that's a useful rule of thumb, that seven terms, as what you, was sort of like the right kind of place to be on these. And what you said about 10% change is really also a rule of thumb, which, which really goes way back. I remember visiting Merriam Webster probably 30 or 40 years ago where they were building dictionaries, and they go and constantly are updating their definitions. And they had, they had people sitting there reading newspapers and magazines to see if definitions changed. And the concept was when 10% of the terms needed modification, that's when they went to a new print run. So things haven't changed that much.
[Marjorie Hlava] It's still a good rule of thumb. [Mark Gross] It's still a good rule of thumb, but it's much easier to run it through a computer than to try to print a few hundred thousand copies, so... [Marjorie Hlava] And distribute them. [Mark Gross] And distribute them. And distribute them, right.
[Marianne Calilhanna] So, we have a question here about other industries Could, this is both for DCL and Access Innovations. Could you, could you speak to other industries where you're seeing growth in semantics and structure besides publishing or medicine?
[Marjorie Hlava] I can do that.
[Mark Gross] Every place. But you probably can be more specific than that. [Marjorie Hlava] E-commerce is one of them. There are a number of worldwide systems that people use and most of you are probably familiar with looking at the barcode on a doc– an item that's scanned at checkout, and behind that are a number of coded systems that are used. And these coded systems, like the UNSPSC and the ECLASS in Europe are broadly there to map between big stores.
Either online stores like Amazon or eBay, or Hara Libre, or any number of places, depending on the nation that you're in. And they need to have those coded in there so that people can search their platforms, look at things, and then get them delivered, and they need to talk between each other.
If you're looking at the distributor database, and then you're looking at the retail store database, they are using different code sets. And so, we've had a fair amount of success working with those different code sets and mapping them to a central taxonomy, central word base so that they can be more evenly applied.
And it works in all kinds of purchasing. People that might use SAP Ariba, for example, is, or Spotify. Those are different systems that use those kinds of things. So, e-commerce is one of them. I can tell you that, for, for websites, all kinds of organizations are very quickly embracing taxonomies. And we get, we get very interesting use cases submitted for those.
[Jay Ven Eman] Well, the Weath– Weather Channel, Margie.
[Marjorie Hlava] Oh, so the Weather Channel is an interesting use case because they they are very time-sensitive. So if somebody comes in and says, I need a clip of a woman running in rain and I'm on the air in 15 minutes. So, you can't just say, Oh, well, I'll get back to you on that. You know, it needs to be pretty automated and pretty detailed, and it turned out, that they have at least 15 different words that mean "rain."
So, those are synonyms, and there's really quite of few that you could use for woman, girl, female, et cetera, and so, to get those quickly to them, we used an extensive synonymy, and that's that cut their search time... [Jay Ven Eman] In half. [Marjorie Hlava] More than 50%, which was a good use case. So, there are, there are many industries that are embracing taxonomies besides publishing. [Jay Ven Eman] Yeah, Nike, DuPont...
[Mark Gross] It's not just, we think of e-commerce as things that you buy, you know, personally, but, but, but it really applies also to the industrial world, to airplane parts, and military parts, and standards for various kind of things. Those are all using different terminology, but they're coming to– coming together, they're coming together. Of course, countries, people, you know, for, as an airplane gets built in 30 or 40 different countries, and each one is using language in a different way. So, those taxonomies and tagging those things are, so that's relevant. In legal and legislative information, or an organization like – [Jay Ven Eman] Yeah, government. [Mark Gross] Legislation, in 100 different countries, and how they relate, they use those, information is used in different ways. How do you make sure your policies are consistent across countries that have different customs, different languages? So all those require way of a matching of information.
Everywhere. [Marjorie Hlava] There's a good use case in the European Union: they have an office of official publications in Luxembourg, and it publishes in 27 different languages and four different character sets, a huge vocabulary called the European vocabulary, or the EuroVoc. And it supports all the rules and regulations of the European Union, and you cannot just do an automatic translation of those. Because the cognates are different. I mean it, it means the, the words mean different things and some of them are pretty hilarious if you just do a straight translation. It's sometimes vulgar in different languages.
We, we live in the American southwest and there's a very popular restaurant name in the US that actually in Mexico means a woman's breasts. And so every time they see it in Mexico, they think this is hilarious. What are those Americans thinking? And you just need to be very, very careful about that kind of stuff. So, when you do those international translations, and they are important to do, and I don't care what the use case is, if it's commerce or regulation, or whatever, you need to be sure that you're going for the concept, and not for the term.
[Mark Gross] That could be a whole other webinar with good stories about things that got mistranslated. [Marianne Calilhanna] That's a great idea.
And in terms of structure, I always think I, I, at former job, we used to joke that XML is like air. You know that it's everywhere around us, we don't see it, it just is. So I think about that in terms of the broader notion of structure, all industries. We have come to the top of the hour. I'd like to thank everyone for attending this webinar.
The DCL Learning Series comprises webinars, a monthly newsletter, and our blog. You can access many other webinars related to content structure, XML standards, and more from the on-demand webinars section of our website at dataconversionlaboratory.com.
I hope you have a great day. This concludes today's broadcast. [Mark Gross] Thanks, Margie and Jay, it was great having you on.
[Marjorie Hlava, Jay Ven Eman] Thank you.