DCL Learning Series
Converting Data for Ultimate Customer Success: Hits & Misses
And now I'd just like to go ahead and introduce our speaker, Mark Gross, the President of Data Conversion Laboratory. Mark, you want to go get started?
Yes, and there's my picture and everything. Okay, thank you very much, and it's wonderful to be here with this webinar with PTC. Thank you very much. Just, I guess, a few words on, on DCL for those of you who don't know us. Data Conversion Laboratory, we, we convert data; that's, that's, I guess our first and middle name, and that's what we're doing for 40 years, actually going into our 41st year soon, and our business originally was to convert information and data and documents from almost anything, so Word, PDF, Frame, DITA, all those other kinds of things that are out there.
But, of course, along with data comes a lot of other things that people don't think about that much. We do a lot of semantic enrichment of content, adding information on to what's already there, using artificial intelligence, machine learning and all the latest technologies. Especially when you're dealing with large amounts of content, you need to use as much automation technology as you can and that's been our, the way we operate. A lot of entity extraction, which is becoming more important now as people are building larger data sets, data mining, all kinds of custom algorithms that will find information inside, inside information. Becoming also more important is data harvesting and finding information on the web, a lot of content, a lot of data is sitting on the web these days, especially with regulatory content, all kinds of public information, we've developed technology that lets us, well, find things on the web; to harvest things off the web.
Also very important as you're dealing with larger and larger amounts of data, it's hard to proofread and hard to check it all, so we've built a lot of software to do a validation of information. To do QA, to check results, check what's out there, we do a lot of work actually with checking information that was collected by other people, not just us. Content reuse, and I'll talk a little bit about that, it's becoming more and more, as people build larger collections, there's a lot of redundancy and it's important, especially when you convert information or even just to check up on what you have, to, to find all that redundant content and remove it. We have tools that help do that, structured content delivery and delivering things just the way you need it, a lot of data streaming information delivered all over the world. And training sets are the byword to artificial intelligence. You don't have anything until you've got data. We've developed quite a practice in building training sets that I use for training algorithms throughout, well, things that we do, but things that other people are doing, and of course content analysis and just check, checking up on content and specifically Content Clarity, we'll talk about a little bit later.
So the topic is data, digital transformation; that's the the byword today. It was a recent survey, 2017, Gartner said about 42% of CEOs have embarked on a digital business transaction. The reality is it's a lot more now because things have happened in the world that truly condensed, compressed a lot of that. COVID and people working at home and people, reduced workforces and having to do a lot more with less have made it more and more important to be able to to deal with, with whatever work is out there, and specifically data work, it needs to become more and more important to have, to have automation, to have the data, to have it structured. And those companies that actually, in 2017, had made their start on it, found that when COVID hit, they suddenly had to start working from from home, they were in a better position than others who hadn't done that.
So, digital information, digital transformation, can, involves a lot of money. Thousands, hundreds of thousands, probably millions of dollars so, many companies, in purchasing new tools and technology with the hope and intents is that it'll have all these benefits, increase sales because they they're able to display their wares more easily, and people are able to find the information more easily, and people buy by looking information up on the web a lot of times, or getting, getting personalized information to them, better customer experiences, and you've, all of you have experienced the, the web chats out there that are either very good or really terrible. Personalization, information, and getting information just the way you want it, or just the way you're expected to want it, and that's, of course, also a good and bad. If it's really personalized and things that you want, it's terrific; if it's just getting ads over and over again on things you just looked up 10 minutes ago, or three days ago, it's, it's annoying and terrible. So it's a flip side to everything but in general it's improved service that we're trying to do, and I think in many cases we've all experienced that some organizations are able to provide much better service because of the, of this digital transformation.
So part of, part of this, of course, is now, a critical part of that is content transformation. No matter what new tools you bring in there, and systems, and the budgets you throw in it, data is critical, and many times, we often say you have nothing until you have good data. Another survey about that is the leading cause of failure for IT projects. It actually comes out of a study that's referenced below, talking about the, the $600 billion is the cost of poor data quality. I think it's really a lot more, especially since $6 billion doesn't sound like so much after we're talking about trillion-dollar kinds of spending. But, but poor data is, is a leading cause of IT project failures and, and the leading cause of just the difficulties of being able to get together and build, and build good systems.
And, and a big piece of it is just being able to take the data that you already have and be able to, be able to work with it. The keys to a lot of these transformative benefits are, lie in well-structured and well-tagged content components, content data. You know we all have information, we know what it's like and many of you have gotten past this point already, but if you're trying to put together a marketing piece or a proposal or trying to put together a specification document or anything else like that, very often you're dependent on material you've already had, have done already, sometimes you feel like Gee, I've written this a hundred times already. How come I have to write it again? And, and the key to be able to, one, the key to all this, to be able to make better use of information you already have, and information that's coming into your system all the time, is to be able to, to be able to find it, to be able to structure it in a way that you can find the pieces of it and to be able to, be able to find, to find, to be able to pull it back together rather than having to rewrite everything.
So I'm going to talk a little bit about three steps to getting to that point. One is, when you convert lots of data, there's a step of pre-conversion to do a content analysis. I think it's a very important step and we'll see a little more about that. I think there's a step to find all your redundant content. And there's a step to understand what your information model is, which is a fancy way of saying to know how all your information fits together so that you'll be able to find that easily and economically. So I'm going to use, I'm just going to introduce this concept of architecture of information with DITA. DITA is one architecture, how you handle information.
So that means taking all your information you've got over here and putting it, and tagging it in a way so that you know what it is. DITA stands for - thank you for putting this on here - Darwin Information Typing Architecture. It's a, it is an architecture originally put together by IBM in the late 1990s; by 2004 they donated it to Oasis and it became a public standard, which meant it was able to, to be used everywhere for certain kinds of information. It's ideally suited for information that has lots of components that you want to put together in different ways, very important in technical information, maintenance information, educational kind of information.
There are other architectures that are being used for other things. For scientific articles and books, there's a standard called the JATS standard. For aviation and military use, there's a standard called the S1000D standard. All of these are different ways of pulling information together. In the pharmaceutical industry, something called SPL for, for product labeling and all that information that you get in those, you know, when you get prescription drugs, you get those long megillahs that come along with it? That's put together in SPL, which is the standard way of doing that. So, so I'll just focus a little bit on DITA and we'll move on, because there are other things that really work in other places. I didn't realize, I didn't know till recently why it was called Darwin.
Darwin, of course, is the author of the concept of inheritance and, and species inheritance, and one of the, one of the features of DITA, and other kinds, and other of these specializations is the ability that, once you define something at a high level, you can build lower levels below it. And they inherit the properties of the top. But I'm not going to go into any of that in any detail; I just thought that was interesting. It's a way of typing information so you know that this particular item is a topic, this particular item is a list, this particular item is a list of tools, things like that, so that, that you're not just, you know when you do a Google search, you find lots of extraneous information because you haven't typed information in any particular way. But if you've identified information, new information, to say that this is a tool, or that this is a step in a process, then you can say things like Find me - do a search only on these particular kind of items. And, and all this is really a benefit, and once you've pulled together, it's better - you can build better documents, you can use the documents in better ways, because they're a better type and you can, and you can reuse elements and lots of other benefits like that.
And we talked about, so moving on, we've talked about architecture, and DITA is one used, I think many of the people, many of the people in this audience are probably, this is probably one that's looked at, and then there's others besides that; I think it's a blessing that there are these almost out-of-the-box kind of architectures available today because it makes it very easy to get started and to find tools to work with them. And back in 2000 or 2005, if you wanted to use XML, you had to, you had to build your own structure or an architecture which you know, had startup fees of like 20,000, $30,000 and it was a bit of a hump to get started, so it's great to have something that you can start with.
So let's talk a little bit about pre-conversion analysis. So, one of the things that I think is very important to do is inventory your content. And the reason for that is because most people don't really know what they have, and what they have that's in good shape and what's, what's obsolete. And, and what is, you know, what is redundant, and what you don't really need any more. Very often, we start with a conversion discussion, it's like "Well I've got 18 terabytes of content." "Well, how much of it do you really need?" "Oh, I need all of it." But, but there is a cost to moving content over. Just like when you move houses across the country, there's a cost to moving everything with you; you try to cull out the things you don't need anymore, the things that are not valuable anymore. Likewise, going through your materials, inventorying is very important. Identifying variables and redundancy, I'll talk a little bit more about that in the next step. Think about what kind of information you have over there. You know, is it, is it information on maintenance information and what's involved in that, is it mostly marketing information and what's involved in that, uh, the crossovers including your marketing information and your technical information, that, that's all part of that pre-content analysis that should be done up front. And also it's important to find, know the special characteristics about your content: is there, is it mostly textual material, is there a lot of tables in there, math equations, all kinds of things like that, just to know what you're gonna have to make sure you're, you're supporting as you go further.
And don't make assumptions, really, because I think most, you know, until you really look at it, you probably don't know what, what you have back there. Don't, many times we talk about, well let's not convert anything because we're going to have to rewrite it anyway. I would say don't devalue your legacy content, because it's, some of it's very valuable, it's 10 years, 20 years, 30 years of material that's been worked on. And, and rewriting is a tremendous undertaking. It's very easy to underestimate that's, what's going to do to rewrite stuff, to vet it, to go through legal review, to go all those, got to do marketing review. Don't ignore the metadata. We sort of haven't introduced it. Metadata is the data about the data, about what your content, so you know the information like when it was written, who wrote it, what is the title of that particular section, very frequently things get moved over, a lot of that information can get lost. And, and I think you know I'm a big proponent of course of automation. Don't underestimate the manual processes are involved, this is not something that people do in their spare time, it's usually a major undertaking.
And, and also don't underestimate the cost of managing this kind of project, so if building the process, the systems and processes to make sure you've handled everything, there is a cost to that. Just, I just wanted to show you a tool that, that we use for some of our upfront analysis. And there are other tools like this, this is a tool that we use that will go through a large collection of information and come back. This is just the headshot over here of where it's finding how many, how many items you have and how big it is and how many pages there are and all kinds of other information about the collection of information is. Behind this, there's detailed information that tells you about how many things, how many things that are referenced in the collection are, are not found, and what they are. How many, how many links to images, how many, how much validation problems there are, and it identifies where they are. It sort of gives you an inventory of what should be looked at before you ever undertake a conversion because many times fixing those things up front is the easiest way to make sure that you've got what you need, and then the rest of the steps are going to be much easier.
Talking about content redundancy, and I've mentioned a few times, very many times you've got a lot of information that's just redundant across your collection, because as you go, you cut and paste, and you pull things together, and your proposals have a lot of information that's, that really crosses across your market information, it has information it crosses. Your technical information does, I mean we've we've done collections, extreme collections where you've got like 20 models of a same kind of product that you've got manuals for each one of them, you might find that 80, even 90% of the content is redundant. Which means you're editing it, every time you update something, you're editing multiple times, every time you translate it, you're translating multiple times, all kinds of things like that. So we find that, we think it's very important to try to eliminate that redundant content up front, so if you have to, later on when you're, when you're doing your conversion, you have a lot less to convert. It's, sometimes it's half the material, which reduces the total budget just by being able to do this. It's also, there's a lot of nearly redundant material out there. Which means somebody took a paragraph and put in a semicolon, or put in a colon, or put in a comma, or changed one word, and it's really redundant content but it's a little bit different and it's a little harder to find.
So that's that's a piece, and the "Don't" of this is an easy one: just don't forgo this step. I think it's, it's very important for many, many people. This is sort of the, we have a product called Harmonizer, which actually does that kind of analysis. It'll go find, go through a large collection of information, you put in 50,000, 100,000 pages, it'll come back and tell you what percentage of it is found identically in other places, what is nearly identical in some parameters around that, and, and what parts are actually unique. It comes across - probably should have made this larger, I apologize - but it'll come back with reports that show you all the paragraphs that are identical to each other, and where they are found in your document set so that you can go back, and you have a checklist over here, things that you may want to correct early on, or change early on, so all these redundancies can pop out of the system as a next step.
And this report, which you can't see but I think the slides will be made available so you can look at them more carefully, or we can certainly provide information, more information, this shows you that there are certain items that are nearly identical to each other so, that first, we call the match group, will show you that there are, the top, the top over here shows you the composite paragraph and all the things that are near to each other, that if you correct, if you change them, you would have those all match together. And then shows each of the variations; it's a little hard to see, but it, you know, it says "if you are a 'non-computerist'," which is a word that's used in some parts of the world, but 'if you're not a 'computer user'" is another way of doing it. There are different ways of looking at that same information. You can see exactly what it is, and you can make a decision on whether you want to maintain separate collections over here, or you want to make them all the same. Sometimes you find errors over here that may have existed in the documents set forever. Right, so.
We once did a, we once did a, an analysis on a set of, of engine manuals for airplanes and there were similar models across the collection. And in 19 out of 20 cases it was "turn this clockwise," and on one of them, it was "counterclockwise," which was obviously an error, and had been in the collection for 20 years. So that's, that's sort of a side thing that happens when you clean up your, your materials, and also this kind of analysis lets you find the things that are really variables in your content. So, it might be the same paragraph but has a different model number in it. You would find that in this kind of analysis, so that when you design your system, you can, you can, if you're in DITA, or if you're in S1000D or in many other architectures, you can have variables there, so you can have the same paragraph, but a different, but just change that one word or that one title, and this is a way to find all those opportunities. All these opportunities are tools that lets you simplify your collections so that instead of handling 100,000 pages, maybe you're handling 50,000 pages or 30,000 pages.
So I just want to talk a little bit about information models. The, and, and when you're doing information, information models are really telling you how your information is structured together. So, for example, in a technical manual you might have steps in a procedure, and the steps go together and in a certain order. That's, that's an information model. Of course it gets much more complicated because there are lots of other things in there, but it's basically how you think of documents and how they come together, so that when you pull together a document they'll all come together correctly. So going back to DITA for a minute, the, it lets you, the, one, one of the things you're using it, you're using it for is that you, you're able to take reusable chunks of data that you're combining, going to combine together and use over and over again in different places.
So if you have a warning, maybe you've got the same phrasing for a warning that you're going to pull in many different places, how do you know how to pull those things in so there is a, there is a, I would call it a table of contents that pulls everything together and pulling in all those, it's culled different things in different in different collections and different architectures, but how do you pull things together is part of that information model. There's the metadata that you want to collect; that's part of your data model: what is the metadata, what, what kind of structure it is, where does it fit into the collection, who wrote it, when was, when was it vetted, when was it revised? it's all information about the, about that. It's not the content itself, it's information about the content. And, and, depending on what you're doing, you may have more or less of that kind of information involved over here.
If you're, if you're collecting, if you're, if you're doing scientific articles, which is a whole different, right, and using the JATS structure, well your, your metadata might be different, your metadata there might be who's the author. Who is this, who's the, what affiliation does that author have? It might be all the images that you have in there and what, you know, what kind of licensing is involved on that particular image; when can you use it? So there's a lot of information collected about it, which is important; different industries, different collections have different information that's needed.
So it's very important to, when you're pulling this together, is to get feedback from user committee, community about what's in the information model. There's internal and external use of your content, so how are you going to be handling things inside the internal processes for authoring, for collecting it, for pulling it together, and then how are people outside going to be using it? Some of that metadata might be saying this is available only inside the company, it's not available outside the company, this is only licensed for this particular use, not outside use.
And it's important to have this integration model before you choose the tools that you're going to be pulling in, to make sure that the tools you're pulling, pulling in will be able to handle, to handle your, your specific needs. Or you can make decisions about those specific needs are not so needed. So it goes both ways. Don't forget about author input; they know your information better than anybody else. They can inform the workflow, they can inform the metadata, they can answer those questions that, you know, you don't have to research.
And again, don't overlook your, your legacy content. So, and I think I may have said some of this already, this is a nice target picture of what's in there, how to think about what's out there, and again, the, the content unit is the, is the content itself. And, and that's the, the center, that's, that's what you, what people usually think about. The, the information types are what you're going to be using to start, what, what kind of information is it, is it internal information, internal, is it technical, is it descriptive, can you use it in marketing? All those kind of things. And then surrounding that is information on reusability. You know, where, and once you start using your usable content, the authors will need to think about that; the author, they won't say Look at the, look at the image below. They'll have to say, they will say Look at the image in something that can be defined elsewhere.
And much of your content is already like that when you do the reuse analysis. Other things will need to be fixed. And that part is part of what can be defined over there, and then the linking and how it comes together. The, you know, what images belong with which piece of content, and what tables belong with which kind of content. You know, what tables that might have been in the back of a book someplace should really fit together with the content now, and the mapping is what I was calling table of contents, how does this information all map together?
And that's in a nutshell, or target, is, is an information model, without getting too much into the technical details, and I know I always simplify, I always get accused of oversimplifying, but I think this is a pretty good picture of what it is. And I just want to finish with, you know, don't underestimate, again, and we talked a little bit before about don't underestimate how much manual work is involved in all this, and I, and I said before, I'm a big proponent of automation where you possibly can, and just, this is a workflow for, a particular, particular workflow that we use internally, but I think it's fairly typical of what we might be doing and segregating what can be automated and what can not be automated.
So there is a piece upfront that really, I think, needs to be human: how content is delivered to us and what we do with it. There's, there's quite a bit of of vetting and prioritizing of the materials that are coming in and making sure they're correct at the beginning. We have many ongoing clients where this is also an automated step because we've set it up so that will be ongoing streams of information. And we talked about ourselves as content, Data Conversion Laboratory, but really, more than half our work, probably about 70% of our work is really ongoing material that's, that's coming in from all over the world and gets converted automatically. That probably would be fed in, it would be a stream coming automatically. But, but this is talking about, you know, a conversion of your legacy content, how would that happen? As a ingestion point, which is an automated process to take it in and and and log it in and pull it together and all those things you might do manually, but if you do it manually it's labor-intensive, leads to a lot of errors, it's just, the source conversion is automated, there's some, there's a lot of automated cleanup utilities that we put in over there to make sure things are upfront, and then it goes to a, usually it goes to a manual step, and this is really, if you look at this, this is the only manual process in, in this, which is, once it's come out of the automated steps there's some, there's, there's a vetting process here to make sure everything looks okay.
And, and there's some things that computers can't figure out because, you know, if, you know, like I always say, never underestimate the creativity of the, of the content authors when they were trying to put things together and get it to fit in a certain way or work in a certain way. So some things I'm not going to have worked correctly and that's, that's a step that usually humans have to do. And then there's, following that there's an automated quality control because it's checking for the things that are just, is everything tagged together, is everything linked together, is all those kind of things, and that point conversion to XML and validation is, would be an automated step, and gets, and then it gets converted to a specific format that a client is going to be needing, and off it goes.
So just the point of this is there are, a lot of the steps, if you segregate them, can be automated. There are some steps that are going to be manual; it's different in every case or a different kind of material, but the focus is on getting 70, 80% of it to be done, done in an automated manner. So moving to digital transformation, I just thought, it's not just a chore, it is a chore, there's a lot of work involved a lot of times, but it's also an opportunity to improve the quality of the content that you do have and make some decisions on what you need and what you don't need.
It's an opportunity to reduce all that duplicated content, which will pay dividends as you go, you know, in the future, since you're going to need to maintain less, you need to vet less, legal review less, translate less, and also you have better control of what's there so that you know exactly where everything is, and one thing about, about organized data and getting content in such a way it's going to prepare you for the next data evolution. I don't for a minute think that this is the last time that data will need to be looked at, and there'll be new technologies that will do different things.
But setting up this way will prepare you so that if you need to make changes later on or evolve later on it'll be much, much easier, and meanwhile you get all the benefits of your of your new, improved data collection. And thank you very much.