top of page

DCL Learning Series

Harmonization of Content: Before, During, After Migration


Patrick Bosek

Hey, everyone. Welcome to WinWithContent by Heretto. That's where we talk about big wins in content projects with experts or people who have been on these projects. I'm Patrick Bosek and today with me is Mark Gross from Data Conversion Laboratory, sometimes known as DCL. Welcome, Mark, glad to have you. How are you today, Mark?


Mark Gross

Thank you, Pat. It's great to see you. 


Patrick Bosek

It's great to see you too. I gotta say, you have a very interesting background there. It's like you are in a library. Did you build that yourself, or...? 


Mark Gross

This is, you know, when, when, I didn't realize, when I built – this is my study, and I built it about ten years ago – didn't realize I was building a set for all these meetings that I'm going to be doing in video. What – it's ironic; maybe I spend 95% of my time in digital, but I love books and I and I have a pretty extensive library. 


Patrick Bosek

That's awesome. That's a, well, it's a fantastic background and it seems like it's probably served you well through these these very difficult times that we've been in, with all the Zoom and everything going digital. 


Mark Gross

Yep.


Patrick Bosek

Cool. So we've got a really interesting case study today that we want to talk about. It's a big project that DCL worked on a few years back. So you want to just kind of give us the high level of who you are, who DCL is, what you guys work on, before we jump into the, talking about the situation? 


Mark Gross

Sure. So we're Data Conversion Laboratory, where our businesses converting content and it's really taking unstructured information and turning it into structured information, which is really the, we've been doing this for over 40 years, well before XML and before DITA, and before any of these structuring technologies. But there were things back then also that needed to be done and that's what we, and as industry developed and as the world advanced in terms of data, that's really what we've been doing. And when we started out, people said "Gee, what are you gonna do when everything is converted the next year or two?" But we all know that's not happened, because there's more and more and more content every day, and we used to talk about megabytes, and now we talk about terabytes, and structured information is more important than ever, and that's why we have systems like Heretto and content management systems because there's just too much to handle otherwise. So that's what we do. And there are a lot of tools for working with that, and that's somewhat what we'll be talking about today. 


Patrick Bosek

Yeah. I was wondering if I'm one of the people who's asked you, many years back, what you were gonna do when everything was converted. But probably not. So, all right. So before we jump into this situation today, I just want to remind everyone that if you want to, you can ask us a question at any time. There's a button that says "Ask a question"; go ahead and click that. We can't write you back, but we can read your questions and answer them live on the show. So we do appreciate that. It's always a lot of fun. I'll also just remind everybody that this show is recorded. So after the fact we encourage you to use the same link that you came here to come and see us to share this with your friends, colleagues, family, coworkers, whoever you think might be interested in the topic that we have today. Okay, great. So with that, let's actually jump in and talk a little bit about the situation that you guys were brought in to help with. We got a slide here, if it'll load for us. There we go. All right. Yeah, so you want to give us an overview of what you were brought in, what the situation was when you guys started working with this organization?


4:05

Mark Gross

Sure. So this I think this is about a year and a half ago that we started working on this project and it's not un-typical. I mean, this was a manufacturing company. The problem they have is their documentation was InDesign, they had PDF documents, and they, it was getting more and more and more unwieldy. Not an uncommon situation because as you build new products and you add new products, you add more manuals, you have more, more content. And today people are moving away from paper, from paper content, or from PDF documents. So, it's like some of my background may be obsolete in the commercial world, but, and it was taking longer and longer to produce all this. You produce a new product, you gotta update it, you've got to maintain it, and you got to keep going with that. 


And I think the biggest game changer has been translating the languages, because now, instead of just dealing with one set of documents, you suddenly have to deal with multiple documents, and every time something changes, you gotta, and pretty soon you're dealing with not, the ten documents you started with, but a hundred documents, and 200 documents, then you multiply it by languages, you've got thousands of documents. And as we become a more global world, many, many organizations are faced with producing documents in multiple languages. So all that led to a situation where it's becoming unwieldy. The costs of producing documents are going high and the costs of, of translating, we're going high and everything else was going high. So they were looking for a solution where they can, that use "build once, use many" so they can produce many different versions of documents. 


Patrick Bosek

Yeah, so this is –


Mark Gross

Basic, I think overall here.


Patrick Bosek

This is something I think that we've both run into just a number of times over the course of our career, right? So you have an organization that's still doing everything by hand, like that is the definition of InDesign, right? So if you're, if you're producing your output in something like InDesign, where you're laying it out, you have a direct linear equation to a number of people, to the amount of content you can put out. And you come in and this isn't scalable anymore, right? 


Mark Gross

Right. It doesn't scale because you've got everything, you've got to do everything one at a time, and every change means you've got to change it in dozens of places. So that just, just balloons as you, as you move along. And it's all that luggage you're carrying from year to year to year, just gets larger and larger. It's a bigger, bigger pile of information that you've constantly been dealing with. So it comes down to: how can you scale that? And how can you, how can you set up something that if you change it in one place, it'll, all the change, it'll, it'll ricochet through the entire document set so that it's all there, all the time. And timely, right? Nobody wants to wait a year for their, for their document.


Patrick Bosek

Yeah. And the other thing I think is amazing about this, that it still exists, is that InDesign is one experience, right? So like, when you create something in InDesign or something like that, and Word, anything that's hard copy like that, that's the only experience you get. And, the average consumer at, what is it, it's like, it's like 5 to 7 different channels that they'll check or they'll interact with in the process of buying or using something through a life cycle as a product today. So it's just, you are by definition unable to accomplish your goals. So this is the classic case, right? You come in. traditional desktop publishing process, there's just cost overruns, really painful manual process. So we're really just looking at the really traditional business situation that, it's got people starting to think about structured content. That sound about right? 


8:09

Mark Gross

That sounds right. I think every manufacturing company has that issue. It grows, grows across all kinds of industry groupings. What we're talking about here today is a, is manufacturing companies, but everybody who's got lots of content is, is doing that. And I think we all know that our research, and where we're looking for information, is on the web, and that's where we start getting information, and you expect it to be available on your handheld device, you expect to have it in lots of different places. So that, if you don't, if you don't have the information on those places, you're, you're missing opportunities, both in making sales, but perhaps more importantly in customer service, that you're not able to answer people's questions and nobody's, nobody's waiting 48 or 72 hours for an answer these days. Everybody, everybody's expecting things to be there right away.


Patrick Bosek

You've seen the research that has come out from, like, customer effort score and customer experience as people relate it, as it relates to brand, brand loyalty. And these are the biggest determining factors of brand loyalty today. And like you said, like people who have to wait, they're not loyal, that's just how things work. So let's, let's move on to the solution because I think everybody knows where the solution's going here, but there's some really interesting aspects of it because of the technology you guys employed and the way that you guys go about this. So I really, I want to give you the broad solution, but then I want to talk a little bit more about Harmonizer. So, talk us through this.


Mark Gross

Okay, So I mean, so yeah, so the broad solution's: how do you convert information from, into DITA? How do you take all your documents, and so, DITA by itself, DITA is, is a component management process. You want to take all the little bits of information you've got in your, in your document and, and establish them in a way that you can you can reuse them in many different ways. Instead of having a 50-page manual, you end up with a hundred or 200 segments that you're going to be using. So that way you can interrelate and collect them in different ways. And also, you can, you can move around. That's what a system like Heretto allows you to do. The, so that is what we do. We do all that, all that stuff of taking things apart and turning them into DITA, into DITA components.


The, I think the specific thing we were gonna talk about today is, is one thing, one thing you gain by doing, besides having the components, is that you can reuse your information. If the same paragraph or two paragraphs of information appears in a hundred different places, you only have to have it once and it will appear automatically in those different places, which means not only do you save and manage your information, you – it's a lot of information you need to deal with – but also, also, if you have one paragraph to change, and that change will, will flow across your entire stream of everything you're doing. So that's really what you'll want to do. 


But finding all the, all that information in your document set where, that is reusable, used to be a really burdensome process, and people would sit there with their pages spread out all over the conference table with lots of little yellow stickies and what things are there. And so, what, what we provide, what we, what we built is just to help us get through this and help our customers get through. This is a, is a product called Harmonizer which goes through an entire dataset and identifies all the, all the sections that are identical to each other. That's the easy one.


12:08

But it also identifies the things that are similar to each other. So, as we all know, as you, if you're, if you're building things, if you're building documents one at a time, even if you're copying and pasting, you're making little changes as you go along. So just looking for identical things is not going to find all those things. That little comma that got changed to a semicolon, and a word or two that got changed, or if it got changed for a British version of something and now using a different word in that paragraph, all those things would not be found.


So Harmonizer was built to find all those similar sections and build reports that identify where exactly in the document does something exist. This this paragraph exists 38 times and here's where they exist. This appears in those places, and that provides a tool that lets you work on these documents in advance. So that's what, that's what Harmonizer does. I mean, it lets you remove duplicates, which is very nice. The other thing is it that there are other effects that we didn't even realize when we first built this. It lets you find typos and misprints because sometimes a word got changed and it's the wrong word. And so if you have twelve, a paragraph appearing twelve times, and eleven times it's one way and the twelfth time it's another way, there's probably wrong with that, something wrong with that twelfth paragraph.


So this is, this is a process that helps us do that, and also, by categorizing this information this way, it gives us something that are, that, especially in a manufacturing environment, you're building sophisticated, complicated equipment, you need your subject matter experts to be able to look through things. This simplifies the process. They don't have to look at hundreds of thousands of different places. They have to look at the one item that, the one version of it that they need to be alluding. So these are, it's a tool that lets you, that lets you build the process more, more quickly. And the nice thing about it is it's something you can do in parallel to everything else as you're, you're installing systems, you're implementing systems, there's a lot going on. This, this piece of it can be done independently of everything else. You can get a head start. So when you're ready to implement, everything goes in at once.


Patrick Bosek

So, there's a lot here to unpack, Mark; let's, let's see if I can, if I can pull a couple of things out of there; we can always highlight them. So you guys found that the solution was gonna be implementing structured content, in this case using the DITA standard, and adding the right level of metadata to it with taxonomy. So that's where, that was where you wanted to be. Right? 


Mark Gross

Yeah.


Patrick Bosek

Yeah. And that was gonna be what you needed to really drive the next set of efficiencies and the ROI, what we're gonna talk about a minute here, which was was great, so you tracked all that stuff.  But in getting there, you realize that there was going to be a lot of work to find where there was similar content. You really had to upgrade the content, right? So you had to be able to take this content and pull out the sections that were gonna be similar and link them. Structure and properly harmonize them, really. I'm assuming that's where the name of the product comes from. It's not a big jump, right? So, you guys did this, you came in, you took the source of content, you identified where it was going to go, and then you ran this process that was able to really standardize and normalize the content but also, in the process, upgraded by picking out the similar sections.


16:09

Mark Gross

Correct. Right. I mean it's, it's exactly what it is. I don't know if I have to re-word much of what you said. That's exactly what, what happened over here. And the process is one where Harmonizer can take the whole stack of doc- okay, or 100,000 pages, and just process it once and provide all those reports, and also provides that, this ROI analysis. It can show you that 50% of your paragraphs in your documents that are identical to each other. So right away it lets you figure, you have a good sense of where your eye is going to be, if you can reduce the amount of content you have to manage in half or even more than that. That's, that should be helpful in understanding how much money their company's gonna save over time.


Patrick Bosek

So as a component of the solution, you're actually able to get to a reuse number, which is a major component of an ROI. And then analysis happens automatically through this process. 


Mark Gross

Right. Right.


Patrick Bosek

That's really interesting because that's something that a lot of organizations struggle with, is that when they get into this and they're doing their ROIs, they really find themselves having a hard time putting the actual numbers in place until they get way down the road, right, and when you're looking at annual budgeting cycles or quarterly reporting, not being able to nail down an aspect of ROI quickly is often really, really challenging, so being able to do that, I think, is really helpful and that's a really important point.


Mark Gross

Right. And it lets you do it against their actual content rather than using global numbers and that's where it really matters. 


Patrick Bosek

Got it. Great. So what did, what did this organization have to do to get ready for you guys to help them implement this solution? 


Mark Gross

So, I mean in this case, I mean once they get these reports back, I mean, it's, this gives them a tool that they can go and clean out their content and figure out, it's like that moving analogy: when you're moving stuff from one system to another, you don't want to move everything over; you want to be able to cut it down and be able to work with. So they start early enough in the process, they can, they can go and clean up their content before they do that. So they can go in and wherever things are, are similar to each other, they can go in and decide if that similarity is critical, if it's really similar or they really need different versions of it, and that gives them the ability to figure out what they need to design, what they need to do in their DITA conversion to be able to produce what they need. So it's a tool that really lets them start really early on and clean up their content and say "This is what we're going to end up with," and then the rest of it can be done in a more automated manner. 


Patrick Bosek

So what is the company, this company is thinking of going through this process, how much time should they be budgeting for curating the content? Like, what is the allocation to that going to be as a part of the solution? 


Mark Gross

Right, so it depends on how much material they have, but, but my sense is if, let's say a company has 100,000 pages of material, they really should be allotting, they really should start as early as possible and ideally allow something like six months to curate their material. I have to say that most companies don't provide that much and then you sort of have to finesse it a little bit. But my concern always is that if they, if it's an afterthought and they're not doing anything until the very end, they end up doing the conversion.


19:59

It'll work, but they're not, they're not going to get all the benefits of the, of, of the reuse, and many times they don't get their benefits of reuse. They get all the other benefits of, of having everything in components. So, so I think giving themselves time, and I think for the 100,000-page kind of content, I think a six-month, six-month time to curate is the right kind of time.


Patrick Bosek

Is it, is that FTE for six months or is it, I mean like, what's the level of, like, effort, do you think, there? 


Mark Gross

So, so I mean once we produce, for the Harmonizer piece, once we produce it, which really gets done over, over a couple of weeks, then they have the material they need to be able to work with it, and then we can start converting over time and make it a more leisurely process. I mean frequently a whole conversion is done in two to three months. So it's, you know, and then it becomes more of a process. So we want the ideal. It shouldn't be an afterthought because then you sort of don't get the maximum benefits that you'd like to have. 


Patrick Bosek

Yeah, I got it. Okay, so we want to move on to the, on to the results here because these are great results, but we do have a question that I want to grab from the audience before we move on. So, somebody wants to know: this time around, this organization seems like they had mostly InDesign, but what if you've got a bunch of different types of contents? We've got some stuff in HTML and some stuff in this, some stuff in that, InDesign, et cetera, et cetera. What kind of complexity does that create in the process of going through and standardizing everything, especially as it relates to the conversion?


Mark Gross

So, so in terms of identifying, in terms of identifying the materials, it doesn't add complexity at all, really. I mean, our process handles all kinds of XML content, Word content, InDesign content, even PDF content. With PDF we, we usually do some pre-processing because PDF is not really a print for– it's not really – it's really a print format rather than a data format, so not everything is, pulls out very well, so we do that, but otherwise almost everything will – can be handled in the Harmonizer process and provide all that information. So there's a little more complexity, but it's all designed to do that, and we do that all the time. And people typically have two or three or four different formats that they're being, that they're using.

 

Patrick Bosek

So, pretty standard, but it's probably gonna make it just a little bit more complex.


Mark Gross

Yeah.


Patrick Bosek

Okay, great. Yeah, let's talk about how this turned out. The – so here's the, here's "The Win"; talk through this.


Mark Gross

Okay, so, so actually it's, I thought this company was very, very impressive and that they really collected a lot of information because they really wanted to have some good, some good post-analysis on all this. But they, they, what, what they did is, we were able to increase tremendously the amount of work they were doing. And at the same time reduced their spend, which is the kind of, the kind of results you want to be able to show.


Patrick Bosek

They actually doubled their volume. 


Mark Gross

They, right, they just about doubled their volume in projects completed. They went from something like 60 or 70 projects they were doing to something like a hundred, 120, 130 projects. So over a period of that year and a half, I guess two years, they were able to increase, double, double the work they were doing.


24:07

The, one of the ROIs, one of the claims they made in their ROI is that they were able to almost completely reduce their desktop publishing costs because they were going into this kind of process that would automatically do the publishing. And in fact they, that's what happened, a 94% decrease. So they went from a fairly significant amount of desktop publishing they were, of course they were doing, to virtually eliminating it. And what they had left, things that really needed to be done in a particular way, which proves the case. I mean a lot of people are always thinking in terms of: how can I automatically publish things? It won't look exactly the way I want it. I won't, I won't have an art director moving an image to the top corner instead of the bottom corner. And the reality is that you don't really need that for most kinds of publications; the automation works really well. There may be some, some leftover sheets that have to be in a particular way, but otherwise it's not really needed.


Patrick Bosek

That's a really common result, is that desktop publishing is either, like, eliminated or virtually eliminated. And I think that's a really important thing to note on that too, is that a lot of organizations that are coming from, what's kind of a print or a digital print type of a background, when they get to structured content, part of that process is actually moving to web delivery. So they're going to a place where, instead of having the PDF for the printed PDF be the primary way that people are accessing their content, the primary way is now through a website, or through an app, or through their software, or whatever it might be. And the reality is that desktop publishing is designed to make print look pretty, and if print isn't the primary thing you're doing anymore, you don't need as much of it. So it naturally fades into the background. And your users don't want print anyway. Your users want web. They want in-app. They want digital, content that is digitally closer and more available than a PDF.


Mark Gross

Right. I mean, I think we see that in our everyday lives by the car, the car I bought right before Covid has – the big manual is not really available anymore. It's a small manual and everything else is in line on the computer screen in the car. The only way that's possible is that you've gone to this kind of component system where you're able to do those kinds of things. So, and the other thing we said about it is their, their spend on translation stuff has gone down dramatically. It's gone down by 40% or something that, even though they're doing more translation, they're, they're doing more products and more projects. But since there's so, there's less content that needs to be translated each time because they've already done it before, their, their spend on translation has gone down dramatically. So they're really happy and it's, and they've got the numbers to prove it.


Patrick Bosek

Yeah, I think that that's a fantastic result. And the thing that I really like about, about these projects: the savings on translation costs, like, it's really, it's, it's great, like it's always it's easy to make an ROI out of it, but I really feel like the savings on translation costs are what pay for the ability to do the big fun things. And I always, whenever we talk about translation costs, I always want to remind people that, like, it's just the thing you use to pay for the ability to build a truly modern digital-first customer experience, because that, in my mind, that's what's compelling for your customer, that's what's compelling for your employees, and I don't want to ever lose sight of that.


28:04

So we've got some questions that have come in. I want to grab a couple of these for you. So the first one, I'm gonna read it here, it says, I may have missed this when you're touting the benefits of Harmonizer, as you did very well. Does – "as you did very well" was me, by the way – does Harmonizer analyze graphics to identify screenshots that are used in multiple source files? That's a great question. I'm actually really curious to hear the answer about this. 


Mark Gross

Okay, so Harmonizer itself works with text, so it doesn't do those, those screens, those screenshots. So, I, I suppose you could OCR it and then have text and then you can work with it, but that's not, that's not what it's doing. We've done work with comparing of image analysis, which might be transferable to this, but it was more to identify plagiarism and stuff like that in, in technical documents. So we can identify images that are similar to each other, but we're not really that, I think the scope of, of Harmonizer the way we have it constructed now is to deal with text. Very large volumes of text.


Patrick Bosek

So I'll add something in here that I think is interesting. So, one of the things that you can do when you have your images in a CCMS, like Heretto, is that they'll be checksummed, which means that the image itself is run through a filter that gives a number out the other side that says this image is this. So if your images are exactly the same, it's exactly the same screenshot, that checksum will be exactly the same. And you can run a search to see, you know, do I have the same image uploaded in a bunch of places, by the fact that metadata is the same in saying it has the same checksum. So if that's the case you're solving for, there is a way to solve for that that I think is is pretty easy, frankly. 


Mark Gross

Right. Right. 


Patrick Bosek

So we're running over time here, but we've got a, we've had another question which is more of a comment, that I want to grab. It said: our education department has a website, but all the content is in PDF files, which I think we've all seen before, and is not ideal, as a creator of that website or a user of that website. So I think the underlying question here for this, this user is, you know, what if I want a better experience? What, what's the path forward on that?


Mark Gross

So, the obvious path forward is to move it all into, into components and into a DITA system. I, I think it's very difficult to deal with while it's still in PDF; I think it needs to be moved out of, out of PDF to be able to work with it, and that is what we do. I mean, we do literally millions of pages of materials in PDF that moves into some format of XML, And so, and I think that's the way, that would be the way to go forward if the budget is there to do it, obviously. It's, I think while it's in PDF it's very difficult to, to actually work with it, although if there's, depending on what kind of PDF it is, it's probably things that, that can be done. I mean, I think it gets into a whole complicated, it's probably another, a whole other segment of a show like this to talk about, what do you do with PDF files if you don't have all your budget yet to move everything over?


Patrick Bosek

Yeah, I hear you. The, I think some people have to move them over in pieces. Some people have to prioritize the things that are the best for the customer experience. And, you know, the inverse of what we were saying a second ago about the localization costs paying for the good stuff? The inverse of that is, is also a little bit true, is that sometimes it's harder to do a back-of-the-napkin ROI, the hard one without localization costs. I think it holds a lot of organizations back. So what you have to do if you don't have the hard ROI from the localization costs is you have to go and get your ROI from customer retention, customer experience, and then support costs. So here's the thing that's really crazy, is that the cost for solving a problem with a, with a help site, it's like 1500% less than with a phone call. So there's, there's another way to get to your ROI.


But we're running up on our time. We try to keep this really short and tidy and get everybody back to their day. Not take your entire lunch hour, if you're on the east coast, I guess, that is. So I want to thank you for being here, Mark. It was really a pleasure having you and learning more about Data Conversion Laboratory and this really cool project that you guys did. I want to thank everybody else for attending and joining us today. Please leave us a rating, leave us any other questions. We love feedback, comments, anything you can provide us. So I'm Patrick Bosek for Heretto and WinWithContent. Thanks again for being here, and we'll talk to you all next time.


Mark Gross

Thank you. Okay. Bye-bye.



bottom of page