DCL Learning Series

The Demise of Documents - Structuring Content for AI

Marianne Calilhanna
Hello, everyone. Welcome to the DCL learning series. Today's webinar is titled “The Demise of Documents? Structuring Content for Artificial Intelligence.” My name is Marianne Calilhanna, and I'm the Vice President of Marketing here at Data Conversion Laboratory. I'm just going to give a quick introduction before we jump into our Lunch and Learn conversation. A couple quick things. This webinar is being recorded and you can have access to that from the on-demand webinar section at dataconversionlaboratory.com. We'll invite you to share that with any of your friends or colleagues. Second, we'd like this to be as organic and natural and conversational as possible, so please, if you have a question or a comment at any time, you can submit it via the appropriate box on the GoToWebinar control panel.

I think we all recognize that a major shift in content consumption has transpired over the years across many different industries. For those in the scholarly publishing market, that transition from journal-based content, journal-based collections, to article-based content really forced digital transformation for so many organizations. We're now experiencing another shift. Content is no longer designed and optimized for human consumption first. For content to be found, it really must be served up semantically tagged and structured in an open digital format.

And who better to speak about structure and semantics than our panelists here? I am happy to introduce Mark Gross, President at Data Conversion Laboratory, Marjorie Hlava, President at Access Innovation, and Jay Ven Eman, CEO at Access Innovations. Mark, why don't you tell us just a little bit about our history? Then we can turn it over to Margie and Jay. I think Mark is coming on shortly.

Jay Ven Eman
He lost his video, looks like.

Marianne Calilhanna
Mark has gone offline, so I'm going to give a quick introduction to DCL while we're waiting for Mark to pop back on. We are having some storms on the east coast, so I hope we didn't lose Mark. We'll find out momentarily. DCL, we like to say our mission is to structure the world's content. We offer a number of services you can see here. We really are one of the industry experts in terms of XML conversion, DITA conversion, SPL conversion, S1000D conversion, and anything that requires sort of complex transformation from one format to another. If you a have any complex content or data challenges in which structure and metadata are involved or required, we can help. Margie, Jay, could you tell us a little bit about Access Innovations?

Marjorie Hlava
Sure.

Jay Ven Eman
Go ahead.

4:07
Marjorie Hlava
Access Innovations is well known for structuring data but particularly subject metadata and other kinds of things that make data findable. The reason we have a partnership with Data Conversion Lab is that we are able to add automatically a lot of human-aided intelligence to the content so that it's structured and ready and findable, or as we like to say, we want to change search to found. We have a tool set to help us do that, but it's really the human intelligence behind that that powers the automatic – the artificial intelligence that we have for all these services.

Marianne Calilhanna
Thanks. So, Mark has lost his internet connection. I don't know if it's related to, like I said, some of these east coast storms. But, you know, let's just start. This is a Lunch and Learn. This is a conversational event. I do see that Mark is coming back, but I'm going to sign off and let you guys just start the conversation.

Jay Ven Eman
Thank you, Marianne. I'd like to start, if I might jump in, because you have to –

Mark Gross
Hello there.

Jay Ven Eman
...start with the beginning of this conversation, and that's to say, just ask a question. I'm not the moderator here, we're all kind of equal, but I'll ask the question of Marjorie and Mark. What is a document? I mean, is it a monograph, a book? Is War and Peace a document? Is an email a document? How about a memo, a first-class letter, are these documents? What do we mean by a document? Are we really talking about digital objects? So I throw that question out to my fellow panelists to get us going.

Mark Gross
Am I on? Can you hear me?

Marjorie Hlava
Yes, you are.

Mark Gross
Okay. Nothing like your internet going down 30 seconds before this starts. Not only my internet, but it's everything. Hello, everybody. Sorry I wasn't there for the intros. Glad to have everybody there. I just really wanted to maybe take off from, Marianne loves to put Ray Kurzweil quotes here because I love Ray Kurzweil and just the thinking behind it. I think besides what a document is, there's also the question of what do we mean by understanding a document?

I think that relates to the question of what a document is. Because much of the discussion today really about computers understanding documents is really, practically speaking, is to be able to find that document, whatever that is, that you need, rather than really trying to figure out what the real meaning of a document is. I mean, I don't think a lawyer depends on a computer to figure out what a particular phrase means, or I don't think we're expecting a computer to figure out, back to War and Peace, what is irony, or what is satire, or what is sarcasm.

7:57
I mean, there's been attempts at that, but I think it's really just being able to find what you're looking for, and we've expanded that to finding it not just based on what the word is but based on contextual things and syntactic things that are going on. Documents, I think once you look at it that way, documents are like any collection, well, I would say any collection of words. A snippet is a way, if you do a search you'll find one paragraph or one sentence, is that the document? That's a piece of a document. You also want to be able to find YouTubes, are those documents? The source is – The word “document” may be obsolete. Though we're going to use it forever, but it may be obsolete.

Marjorie Hlava
I agree with you, Mark. I would think that a document is any identifiable piece of text, as opposed to an item, or a unit, or a record, which could be an image, or a videotape, or whatever. But if it has a textual layer, to me it's a document and we can tag it. It might be that you break War and Peace or anything like that into individual paragraphs. War and Peace doesn't have images, but you could take the captions, and the figures, and the individual hunks of anything and divide it into subsidiary documents, and that's frequently done, I think, with scholarly articles or textbooks, for example.

You don't want to take the whole textbook because that's a rather large piece to ingest. Although you can tag a textbook at a high, broad level, I think it's probably more important to tag it to the chapter, the section, the image, the formula, and so on. In that way you get much better retrievability, much better precision, much better recall in search. So the idea of taking a document, and we talked about structured versus unstructured, to me any blob of text is really already structured if it has a file name and some sort of identifying date stamp and so on. But in order for it to be retrievable, it needs to be separated out and ingestible into a computer program.

Mark Gross
What about, so I understand a video may not be, but what about a transcript of a video?

Marjorie Hlava
Absolutely. Transcripts are, and their connection to the time stamp is very important when you're structuring video. So yes, I suppose.

Mark Gross
Isn't the transcript just a stepping stone to the video itself? It's an artifact that we need to use now because that's the way we're doing our searching, but the content is really in the video or it's in the backdrop to a PowerPoint presentation or any of those things.

Jay Ven Eman
It’s a stepping – it's two parts. One, it's a stepping stone. The transcript is a stepping stone to the video, the YouTube itself, but it also could be a form of analytics. Because if you have, I mean, YouTube has billions, certainly multiple millions of videos, and then you have private organizations with hundreds of thousands of videos. You want to find it, you want to discover it, but it also can provide analytics by analyzing all of the transcripts and adding some semantic enrichment to be able to determine what is this video about and how many do we have on that particular subject? That kind of thing, as well as discovery. So you're absolutely right.

12:22
Marjorie Hlava
There probably also needs to be some indication of version control, because digital documents are not static, they're easily editable, they're retracted. Something that's in a document that an author cited may no longer be in the article when the article goes through a secondary review, like if you've used preprints, for example, by the time it's actually published in a referee journal it may not be containing that information that was originally cited, for example. So I think we need to be aware of, particularly, the date stamp or the version of an article or a document when we're tagging it so we know that it's this particular instance of a document that we're dealing with, not five revisions later.

Jay Ven Eman
Yeah. I think, you mentioned earlier, Marjorie, about being able to tag down to the paragraph or, for example image level, of a document. Then in scholarly publishing they have the concept of article record. This deconstruction which you have to be able to reconstruct and use the metadata to determine, is this the article of record or is this the actual, what the authors considered to be their complete document?

Marjorie Hlava
Yeah. Well that's true. You also have the ability now to tag in line. Even JATS lets you add contributed metadata in line in the document, so you can go to exactly, in a 30-page article for example, where that concept was considered and discussed. I think that's an important addition. Then we probably ought to think about whether that document is the same document after it's been in line tagged or not.

Jay Ven Eman
Yeah. JATS is Journal Article Tag Set, for those who aren't into these acronyms. It's a standard for tagging journals.

Mark Gross
Right, so that's twos piece we're talking about here. One is finding the right article because you've added the right metadata and you've had the right synonyms so you're finding things. The other is finding the right version, which may shift. There's more dynamics to material these days than that. I mean, there's snippets of information go out. There are quotes that are taken out of context. All those things are out there. Also, I mean, when everything was in print, we used to think of static documents, but once you get into this electronic form of document, anything goes, it seems.

So, back to what is a document? Is it that snippet that goes out there? Is it a quote? Is it a quote taken out of context? How do you figure that out? Going back to part of what I think we wanted to talk about is the idea. is that we're more and more writing, there's so much information out there that nobody can digest it all, so we're writing for these engines that will help us find it. The idea of, we're writing for a computer to read this information to be able to pull it together.

Marjorie Hlava
Right.

16:08
Mark Gross
We may never look at the original document once it – so, how do you, how do you – What kind of problems is that going to cause now and down the line?

Marjorie Hlava
One of the problems it causes I think is an awful lot of misinformation. I mean, somebody made a comment and then they retracted it and they rewrote it, but the original comment is still in the electronic record and it's nearly impossible to expel that stuff from the document. It leads to an awful lot of misinformation, particularly if somebody's using a figure of speech, or, here's an example of what is absolutely wrong, and then that's the piece that's quoted as attributed to this author. I think particularly since we already deal with a tremendous amount of fake news, the capability for misinformation is a bit scary actually, it's a little difficult. I'm not quite sure how we handle that, but it's hard to call it back once it's out on the web, that's for sure.

Mark Gross
Just adding to that is language translation, automatic language translation, and what goes out of there, especially figures of speech. I mean, the old computer translation story, it probably needs to be updated, but a computer translating “the spirit is strong, but the flesh is weak” into Russian, and then translating it back, it comes back as “the vodka was good, but the meat was bad.” It's all figures of speech back and forth, and how much of that actually happens as misinformation? I mean, today in the scientific world many articles are in English, but more and more are coming out in Russian, in Chinese, in other languages, and what's going to happen over there?

Jay Ven Eman
Yeah. I think just a comment you made earlier Mark, and Margie too. It's calling out of context, for example. I think one of the issues there in terms of adding structure and intelligence to what we consider this unit as a document, however you define that, is linking. To the extent that we can encourage and actually do as a service, it's what we do, add linking so that a quote has that link back to the full quote document, the full article. I think that's one of the things that will help with misinformation or quoting out of context a person, is that linking back to the source document.

Then the other thing that's going, Margie and Mark too, is, turn that around about computer first. How do you write for a computer first, reading the document? I think part of the answer is, well you don't, per se. It is the structuring, it is the semantics, it is the metadata that can allow a software agent to make sense out of it. Well, we already know that computers can write extremely passable articles or documents by themselves, or with the input of a few keywords, so the computer is actually writing the content as well.

20:09
Marjorie Hlava
I think there's a lot of that – Well, you touch on GPT and SCIgen-generated articles, which is rampant. It's kind of a good thing that Microsoft bought GPT and won't give it to just everybody, because otherwise we'd all be reading whatever they wanted to generate and nothing much else. But for a scholarly publisher, that's a really serious problem when you have programmatically written articles and hybrid articles, where part of it is programmatically generated and somebody puts a nice wrapper around it which kind of foils detection of the syntax.

We've been working on that a lot lately in our company to detect programmatic articles and trying to weed them out of a corpus. We're having, unfortunately, very high success rates. Another place that this kind of thing happens is people are writing for SEO, I think. They're trying to game the search engines so that they can get higher rankings and scores on internet, and Bing, and other things like that. There they are specifically writing for the computer algorithms.

Mark Gross
Right. I think there's a midpoint though. I mean, knowing that people and computers are really looking at the abstract or the front of an article more than the rest of it, that's where you would try to pack the information in that's relevant. That's not really trying to game the system, that's just thinking in terms of where the goal is. We do put in keywords and other words in there to try to do that.

To some extent, and we probably should try to avoid figures of speech and other things that– I remember once giving a presentation to a Japanese audience, but I was speaking in English and there was an interpreter doing the interpretation as we were going along. I welcomed them to the Big Apple and then realized how does that translate into Japanese? I mean, it's an English figure of speech to begin with, and it's only local to New York. As you speak, you realize that you're using that, and –

Jay Ven Eman

But Mark, doesn't that make the reading of a document more colorful, more interesting? I mean, of course I'm now thinking a little bit away from a scientific article or a report generated inside a large company, a research report, a study that's in-house only. I mentioned War and Peace, that is a document, if you will, a monograph. The creativity issues.

Mark Gross
That's where I think before we spoke about, I think there's two. There's the computer understanding something for the purpose of being able to find it and there you're dealing with technical articles or with legal. Then there's the, does a computer understand the finesse in an article? I think we're pretty far away from that still. I don't –

Marjorie Hlava
Well –

Mark Gross
Writing an article that has that kind of color to it, I don't know if it's there either.

23:55
Marjorie Hlava
Well the whole, you reference Ray Kurzweil in that lovely little circle, and he's been working on natural language processing, natural language understanding for years. When I first met him, it was at an automatic translation conference in Stanford, and that was a long time ago. The debate there, or the synergy between automatic translation, automatic indexing, cross-language retrieval, all depends on the main tenets of natural language processing, but also one of those tenets is figures of speech and common sense.

You build those into the engines and you build as many of them as you can into the engine to promote that understanding by the computer as well as by people. Because if we go down the road of no figures of speech, no examples, erroneous or not, plus you need to add some woke vocabulary to it, it's going to be really dry reading, I think. For the computer, I really think the primary audience of most documents is no longer people, it's the computer. The people are secondary, even tertiary.

I can remember, just seems like five years ago, doesn't seem like very long ago, that publishers really did not want to have their stuff crawled by Google. That would just be the great satan would steal their data and blah, blah, blah. Now everybody wants to be sure that their data is crawled by Google because that's the first place that their users are going, usually because the publisher's search engine is rotten and they don't use a taxonomy, just saying, so the stuff is not discoverable except on Google. I know some very large publishers where the biggest type of search is known article service, known title searches, where they put in the title exactly as it's written on the document so they can get to it. That's really a shame. That's too bad because you can make things findable with subject metadata.

Mark Gross
Right.

Marianne Calilhanna
Well, I'd like to jump in with, we have a lot of really great questions and comments. I'd like to throw one out to the three of you that I think is pretty interesting. Does a document need to be tagged a document? Doesn't the process of a document make it a document? This would also require that a level of granularity of the tagging would be needed to call something a document.

Marjorie Hlava
Well, it's certainly popular for people to tag by document type. It's a scholarly article, it's an obituary, it's a letter to the editor, it's a note, it's a book, it's all kinds of document types. The computer doesn't care what kind of a document it is, but I think the user cares what kind of a document it is.

Jay Ven Eman
I'd say I don't think it has to be tagged or even particularly granularly tagged to be a document. One of the premises of the conversation at this lunch hour is that the computer reads it first, and while AI is getting more powerful, we just said it can write articles, it can generate some understanding, you're going to get much better results in using that document down the road when you do have it tagged with metadata and marked up.

27:46
It doesn't have to be, though. When you save a Word document with a simple title in Word, that's a document, it qualifies as a document. But if you're going to add it to a repository of eight million other documents or eight billion other documents, adding a little more information about the document, in the form of metadata and structure, will help the machine more reliably utilize that document. I think that's one of the things we're talking about, is discoverability. The reliability of your discovery engine improves these processes. Mark?

Mark Gross
Yeah. To Jay's point, I think that's really – I mean, artificial intelligence, and search, and machine learning all have a great part to deal with this, but if you can give some cues, if you tag it and give it some cues like the metadata and other kinds of keywords and stuff, that just helps it move it along so that the machine doesn't have to learn as much and can be much more reliable. You're not dealing with 90% in accuracy or 92%, you can bump it up by using the two of them. It's really, we're talking about improving the process by just adding a little bit more information.

Marianne Calilhanna
Here's another interesting question. If I tweet something online, does that become a document? Of course, someone said, "It does if you say the wrong thing."

Marjorie Hlava
Well, I think it's a blob of text. It can be indexed as a document.

Jay Ven Eman
Also, we want it to be a document. You want it to have survivability beyond the trillions of other tweets.

Marjorie Hlava
It can be something somebody desperately wants to recall after a while.

Mark Gross
Right. I think the term of art might be blob of text. A document is sort of what we think of as a document, blob of text now is anything that's out there, and the question is whether it's retrievable in a way that you can find it, way that you can get rid of it, way you can delete it from the internet, all those things. Today certainly these things are findable, either straight through or through all these engines that save everything and let you find them later and things like that. All these blobs of text are findable at some level. If they don't have metadata, they're harder to find.

Jay Ven Eman
Yeah, and it's not new either. I mean, do those notes in grade school that you passed to your friends, are those documents? Are they more like old-fashioned tweets, where they're written on paper?

Mark Gross
Well, there were always –

Jay Ven Eman

“This class is dumb,” you know, you pass it.

Mark Gross
They're documents that just get lost. I mean, I was at the Library of Congress, the manuscript librarian was showing off some of his favorite things, and one of them was Abraham Lincoln's notebook from third grade, which had little cartoons of his teacher. That's the equivalent of what you're describing. It is now a document and a very valuable document.

Marianne Calilhanna
Historical document. That's really interesting to think about. Someone posed the question, "Is there a format that machine search engines prefer? Is it structured in XML, HTML, other formats that are easier to find?" Maybe you could comment on that.

31:42
Marjorie Hlava
Well, I would say the term of art is XML at the moment. HTML and XHTML, XHTML is a little easier for machines to ingest than HTML, but all of them are ingestible. Certainly lots of other formats are there as well. There are formats that are not quite as broadly embraced anymore that are high-structured and very rich in metadata, like the MARC format, for example, which is lingua franca for libraries and for sharing of information. You have the ONIX format on your little diagram here now. I really think that XML is probably the most versatile at the moment.

Mark Gross
Right, and the most accurate. You can work with it, but in terms of what's findable, I mean, you can find things in PDF documents also, it's just not as accurate probably because the words may not be complete and things like that and people have OCR'd original documents. There's lots of OCR in the background there, which is not accurate, like in discoveries, in the legal discovery system, those are paper documents that got scanned and then automatically OCR'd and searches are made against those. They're still valuable, it's just not as accurate. Sometimes the only thing you can do is go against these less accurate collections of information. We think in terms of scholarly materials and then in XML and you could get really precise and you put in the metadata, but sometimes you have 10 million documents that came in, in a court proceeding, and nobody's going to XML those but you could still do some searches against that.

Marjorie Hlava
Well, and I think there's a certain percentage that's probably allowable. I mean, because when you count accuracy in an OCR conversion, you're saying every character in a 98% accuracy means there's two typos, or spacing errors, or something, for every hundred characters. Well, you still get a pretty good search against – The US Patent office uses that kind of an algorithm and it works fine for them.

Jay Ven Eman
I would say you want to look at, the person who asked the question is, are you in a particular industry, or discipline, or vertical market? Because sometimes not formal standards but quasi standards are developed by certain industries or vertical markets, or they develop an XML-based standard for exchanging of documents or information, if you will. You might look within an industry to see, are they doing something? Margie mentioned MARC, which is a library standard for the exchange of data. We mentioned JATS earlier, which is Journal Article Tag Set, that's appropriate for scholarly articles, articles that go into scientific publications. For B2B publishers or within your organization's internal document creation and storage there may be some standards or quasi standards that you could use for tagging, and marking, and structuring.

Mark Gross
Well, like, for example, just the standards industry has its own, it's an expansion basically of JATS, but it's its own XML format to keep track of things that are going on over there. Journal articles sort of are considered to be, I guess, static, but it's expansion of XML that keep track of changes, and modifications, and so I think Jay, your point is very well taken. Industries have different vocabularies and different interests that need to be taken care of, so it often pays to have a specialized version for an industry.

35:58
Marianne Calilhanna
Here's a really fascinating thing to think about, one that gets me a little – induces a bit of fear as someone who creates a lot of content. Do you think for technical content and maybe more there could be a point where there are essentially two versions required, one for machine and one for human? Further, potentially metadata tagged within the content which does include bits of the actual content and the full content for the human reader. Wow.

Marjorie Hlava
Well I think that gets back probably to the article of record. A number of organizations, most scholarly publishers but also legal and regulatory governmental agencies, for example, do have a version of record, and that one needs to be the same on computer or in print, for example, or PDF, or whatever the distribution format. The distribution format should not matter. The document needs to be exactly the same, even down to the pagination in those cases because they're citing a page in reference in that document. Anything that's a standard, a regulation, a law, for example, and there are other examples, need to be citable in that fashion. Other popular material might be able to have a computer consumable version and a human readable version, assuming in that case that the human readable version might have images, for example, whereas the computer one does not.

Mark Gross
Right, which already makes them different. Most organizations still that use XML but the PDF version gets done first and then the XML gets produced from that, so by definition you already have different versions as it goes along. I mean, so there are different versions out there, but they should be the same. I think you're right, but what is a document of record? I mean, in those journals the PDF version probably is the document of record, and then what got distributed is not. You have to go back to the document of record, I would think. I think the goal is it should be the same thing.

Marjorie Hlava
Well some organizations, I know that IEEE, for example, keeps an article of record, which is a metadata enriched copy or version of the article. It's not exactly the same as the one that's on their search system, but it is considered their article of record and it is in XML, it's not in PDF.

Jay Ven Eman

Yeah, again, you can go back historically. I mentioned War and Peace. You can always read the Cliff's Notes version. Woody Allen had his famous joke, he took speed reading, he read War and Peace: "It's about Russia." That's the best he could do, in his joke. Yeah, there's multiple versions probably, but that gets back to the metadata to be able to, and version being able to reconstruct what you had at a particular point in time.

Marianne Calilhanna
Right. Here's another question. How long do you expect the work of tagging metadata to stay in primarily the domain of human production, and how soon, if ever, will tagging be done entirely programmatically? What barriers still exist and what advantages do humans have? Mark, I think that's probably a great one for you, from the DCL perspective.

40:13
Mark Gross
Well, I think the answer to it, it depends on what the material is. I mean, it depends how people are willing to produce information. It's certainly possible to do much of the tagging as it's being entered so that it's part of the process. The authors and the people putting together information are not necessarily working in the same systems or not thinking that way. Do you stop a scientist who is trying to get his thoughts down on paper, saying, "Wait a minute. You can't do that now. You've got to put all your tags in." Or do you let them get their thoughts down on paper and then do it as a process afterwards?

I don't think it's a matter of whether it's possible, I think it's possible, the question is where is it best done and how many people do you want to train on that? I think going forward, I mean, it's going to back to that Kurzweil, if we're saying that computers are understanding everything as it's coming in, at some point they won't need the tagging possibly, but I think we're still not quite there yet. I can see a day like that, but there's still billions of pages coming out that are not being done that way.

Marjorie Hlava
I kind of think it depends on what you're tagging. If you're tagging a document fully into full XML markup, I'm not so sure we're there yet with that. If you're talking about the subject metadata tagging, I do think we're already there. I think we have a great many customers that tag all their incoming data automatically, fully automatically. They sample it to make sure that there aren't new concepts and so on coming out that they should be paying attention to.

I mean, in the medical business three years ago there was no COVID-19 and all its many variations to be considered, but it's certainly front and center now. You have to keep an eye on that process and that's why we say we are human intelligence-assisted or human-assisted intelligence, because you do have to keep your eye on the new terms coming forward. For subject metadata I think for the most part we're already there, and for metadata extraction, getting the title, the author, the pagination, the date of publication, the geographic location and so on, I also think that's fairly reliable extraction. But I think full XML coding of an original document really does need a fine hand to get it finished off.

Mark Gross
Right. A lot of them, I mean we do now automate – I mean, we get to the point where many articles, I don't know whether it's 80% or 85%, are automatically tagged. I mean, there's still the ones that don't get there, but in terms of tagging scientific articles and articles in specific areas, I think they do get automatically tagged. But there's outliers, and part of what a computer needs to tell you is when something doesn't look right and needs to get pulled out of the system so it can be manually reviewed.

43:51
Jay Ven Eman
Right. I think an XML-first workflow, which is getting to be more and more common, or you have your scientist, Mark, because you don't want them worrying about that sort of stuff. You have instead the tagging is hidden, it's behind the scenes, if you will. They're prompted “please enter the – ” That sort of thing. Even in corporations and government agencies where they're producing a lot of documents. Same thing in, you have a template that you design and the template reflects your XML, but you don't have tags there, et cetera. Part of it is automated that way in the actual creation of the document so that it's immediately machine readable.

Marianne Calilhanna
Here's a good question. Is document structure influencing AI or is AI influencing document structure? Is bias in AI influencing finding appropriate documents?

Jay Ven Eman
Bias is an interesting issue. I think it's both, I mean, it goes both directions. They're influencing each other. Of course, bias is a very complex thing when you're talking about this sort of a situation of massive amounts of content. If your AI system produces an output, it might show bias because that's the way the real world is at the time, or whatever the date it was that they analyzed at the time. To try to compensate for bias, that's difficult to do. You may want to try to measure it to see if it's there.

Marjorie Hlava
I think there is a tremendous amount of bias that can be input into artificial intelligence, particularly when you get into the semantics. I think there's a huge amount of semantic censorship that's going on at the moment, where one group calls it X and another group calls it Y. You put in the word X, you only get the information from the group that calls it X, you don't get the information from the group that calls it Y because it's not coded that way. You put in a synonymy and you try to find a neutral term and one side or the other is going to explode.

For example, and it's only an example, it's not to show my personal bias. If you wanted to look at the early information coming out on COVID, you might want to use the word Wuhan or Wuhan virus. Well, that is absolutely not allowable, acceptable or proper anymore from the politically correct police, and therefore if you've tagged the document early on with Wuhan virus, you better go back and recode it with one of the many aliases for COVID-19, CoV-SARS-19, or whatever you want to call it. Because there are over 100 ways to name that virus at the moment, and if you want to get everything you have to use all of those words, not the words from one camp or another camp.

Mark Gross
Right.

Marjorie Hlava
It does really change the way things are retrieved and people are only going to get what the intelligence engine has been fed.

47:58

Mark Gross
I think that's the point. That's not a bias in artificial intelligence, that's a bias in the engine that's been put in there to make those decisions. It's not a natural requirement of artificial intelligence or machine learning, it's something that people put in. We're sort of backing away from, it's not that computers are reading it that way, it's that people are making the decisions on how it should be read, which is I think a little bit different.

Marjorie Hlava
Yes. I agree. That's not the artificial intelligence causing the bias, but everybody who's making a point, making an argument, has a bias one way or another when they write the article. I advance this theory based on these items, and it's my story and I'm sticking to it. What I called it might not be the same as somebody else called it. In order to get all of that stuff, we need to have a really good synonymy or a really good inferential engine that'll point up all that stuff and give us those results in the same search.

Mark Gross

You make a good point, that really you may have to go and rework the synonyms or rework the keywords periodically to still make sense.

Marjorie Hlava
Well, a lot of publishers have a policy to re-index their corpus every five years or so. It's exactly because of that, because the old words are no longer used. I mean, nobody says washing-up machine anymore, they all call them dishwashers. We need to be able to accommodate all of those.

Mark Gross
“Washing-up machine”?

Marjorie Hlava
That's what they were called when they first came out. Look at the early ads.

Marianne Calilhanna
Here's a good comment. Authors are told what they need in an article to be processed for publication, but should we start making authors aware of how to write articles to be savvy to the parsing and tagging of the article for optimal searchability?

Marjorie Hlava
No. I think a large part of the scholarly record and what's so exciting about it is that people are free to express themselves in many, many ways. If we make that too narrow and too prescribed, you're going to limit the capability for innovation and public discourse.

Mark Gross
That doesn't mean that authors shouldn't be aware that if they change the title it'll sound more interesting even to people. I mean, so we do some of that anyway, right? You write the title and you write the abstract so that it's interesting to people. The question is, do you also make it "interesting" to a computer? That part I don't know.

Marjorie Hlava
Well, and one of the challenges is when we index scientific articles, science, technology, medicine, engineering, those titles are pretty descriptive. You can depend on those. When you're in the social sciences, they are often written to be provocative, to get people to read the article, and they have nothing to do at all with the content. Well they do, but it's a hook to get people to read it. Same with news articles. They might put a Biden did XYZ on the title, and you read the article and, well, that wasn't what I expected to be reading when I read the article. It's just in news and social sciences I think it's very true that titles are often written to be provocative.

52:18
Jay Ven Eman
I think part of the question I think is I think it's a good idea, as people getting their education, if everybody has to write, no matter what your job is, you have to communicate, even if you're communicating via just a text message, or as we said earlier, Twitter. People learn that it's been known that you can scam search engine optimization algorithms by putting in the right kinds of metadata in the header of an article that you're putting up to improve its ranking.

Authors who publish regularly know, when we talk about biases just recently here, the bias of different journals and their publishers. The journal publisher, they kind of know, they learn what the bias is and they'll write to kind of lean towards that particular bias to try to get their articles published. There's a lot of people already sort of intuitively know this, but to get some basic instruction on how to improve discoverability of your article, again whether it's published commercially or in-house, I think is a good idea. I mean, they used to have, I think they still do, library orientation sessions for incoming freshmen. I don't see any reason not to include something like that in the course of one's education.

Marianne Calilhanna
I would like to, this is a little bit of a different question, but it goes back to something Margie said and I would like to pose this to you before we start to wrap up. How accurate is current automated language translation? That's a big question.

Marjorie Hlava
Well, a lot of people are using automatic language translation all the time. "Hey, Siri." "Hey, Alexa."

Marianne Calilhanna

Sure.

Marjorie Hlava
Well, I just booted up Alexa now.

Alexa
Not exactly, but I offer no resistance.

Marjorie Hlava
Oh, stop it. We're using that all the time and the same thing when you dictate on your telephone. All of that is automatic language trans– Well, it's speech-to-text translation. Then the next stage is taking something written in one language and translating it to another, and there it depends on which language you're going to and from. For example, some people think that English is the same worldwide. It's not. We recognize four distinct kinds of English. There's British English, American English, Australian English and Indian English, and they are very different. Their cognates, their use of words, their definitions are different, so you need to know which English you're working in.

Marianne Calilhanna
Spelling.

Marjorie Hlava
Then translations to Spanish, German, Russian, Arabic, to a large extent Mandarin Chinese, and Japanese are also fairly good. When you get into other languages, like Marshallese or something, there the time has not been spent putting in automatic translation activities and so it's not as accurate.

56:07
Mark Gross
Yeah. It's probably also true, Margie, you probably know better than me, those languages for which there's a lot of material available publicly with dual translations, like the European languages which have a lot of material that comes out of the EU bodies would have more information. I guess in Canada materials have to be done in both English and French, or Canadian French, which is probably not the same as Parisian French. Those probably have, because a lot of this is being done with machine learning, so those probably do a good job. In other places where you don't have that it's just probably there's not enough material to build training sets with yet.

Marjorie Hlava
Right. I had some rather amusing, or unfortunate, depending on how you view them, options with Spanish because people who think Spanish is the same are the same as those people that think English is the same. There's Mexican, there's Castilian, there's Barcelonan, there's what they speak in Chile. They are all different dialects and they have huge populations. We live in a border state, in New Mexico and there's a restaurant chain which named itself what it thought was a cute name, and then when it tried to expand into Mexico it realized that the name that they had chosen for their restaurant chain actually meant a woman's breast. They ran into all kinds of troubles. Other companies have a similar kind of challenge. There's, what's socially acceptable in one country is not necessarily the same as what's socially acceptable in another, and the translations mirror that.

Jay Ven Eman
As Mark said, at least the vodka's good.

Marjorie Hlava
That's right.

Mark Gross
That's an important consideration.

Marianne Calilhanna
Well, we're coming to the end of our hour. We have a lot more comments, questions. Perhaps one time in the future we can all have an in-real-life event where we can really continue this dialogue. I do want to mention one thing, it's something I noticed at the beginning of our conversation and my friend Barry commented as well. We noticed that all three of you have a whole bunch of books behind you. How do you find content in those books?

Marjorie Hlava
Well, this particular set of bookcases behind me at the moment is my fiction collection and things are filed alphabetically by author.

Mark Gross
I file them by color.

Marjorie Hlava
Stop it, you do not.

Jay Ven Eman
Yes, one of the things I have in my bookcase, talking about language, et cetera, is a classic Roget's Thesaurus, and a dictionary.

Mark Gross
Yes, I have one of those too.

Jay Ven Eman
Can't do without it.

Marianne Calilhanna

I just picked my Words in Print off the shelf and thought, I wonder if it's time to retire this book?

Mark Gross
I need some more space for books and I have; my son who's moved out quite a few years ago, there's an original Collier's, last edition ever printed of Collier's Encyclopedia. I asked him if he wanted it and he said "What for? The information is 30 years old."

Marjorie Hlava
Well, but I have a copy of a 1957 Encyclopedia Britannica set, and I keep it because every now and then some obscure question of British history comes up around the dinner table and it's most likely to be answered in that edition, fairly reliable. It's peer reviewed, you know.

Marianne Calilhanna
Fascinating. Well, we have come to the end of the hour. Thank you all. This has been really great conversation. Thank everyone who stayed with us. I hope you enjoyed your lunch. I hope you learned a little something. This is part of the DCL learning series. The series comprises webinars, blog posts, and we have a monthly newsletter. I invite you to be sure to subscribe to that. This recording will be up on the dataconversionlaboratory.com website in probably about two days. You can find it in the on-demand webinar section. Thank you so much. We hope to see everyone back at a future Lunch and Learn conversation.

Mark Gross
Thanks. Thank you, Marianne.

Marianne Calilhanna
This concludes today's program.

Jay Ven Eman
Thank you.

Mark Gross

Great to see you.

Marjorie Hlava
Okay.

Jay Ven Eman
Thank you. Bye.