DCL Learning Series
How Healthy is Your Structured Content? Diagnose Content Reuse Issues with Harmonizer
Hi. Welcome to our webinar; this is CIDM. We have Chris Hill with us today on "How Healthy is Your Structured Content? Diagnose Content Reuse Issues with Harmonizer." Just a little bit about ComTech and CIDM: ComTech has been helping clients produce better technical content since 1978. Whether you're looking for consultant services or you need to train your team, ComTech can help. We can help with a number of things: content strategy, information modeling, taxonomy developing, development process maturity assessments, user studies, benchmark studies, and DITA implementation. You can contact – on our slide – at comtech-serv.com.
Also, for CIDM, a membership organization facilitating the sharing of information about current trends, best practices, and developments within the information development industry. We do have networking opportunities, manage roundtable discussions, member Slack workspaces, face-to-face and online conferences, and we have a best-practice newsletter we publish quarterly, as well as the bi-week– bi-monthly "CIDM Matters" newsletter. So, we also have different membership levels to fit your specific needs. So, on the next slide we have Best Practices conference; this is coming up in September, 19th through 21st; it's in Baltimore. We welcome you to join us; it should be a very fun, interactive workshop style, and talking about a seat at the table, to get that from a manager perspective and how we can get everybody to join in the conversation. So, the upcoming workshops. We have a number of them: minimalism, editing essentials, publishing for DITA, and you can sign up for those, and our webinar coming up is from our survey, August 25th. Please stay informed with us; link us with CIDM, the Center for Information-Development Management. Please join us on social media.
So, for today's webinar, "How Healthy is Your Structured Content? Diagnose Content Reuse Issues with Harmonizer." And I'd like to introduce Christopher Hill, and he's from DCL, and he'll be letting us know all about this. So, Christopher, I'm going to stop sharing my screen and let you say hello to us.
Hello. Thank you, Trish, very much. Let me get this sharing going here, and we should be off. All right, so I assume you can see my slides there.
Great. So, as Trish mentioned, I'm Christopher Hill; I'm a Technical Product Manager as well as a Project Manager for Data Conversion Laboratory, and I'll talk a little bit more about what my company does a little bit later in the webinar. But the first thing you're going to see is, really, the product that I primarily manage, which is called Harmonizer.
And just by way of my background, I've been working in the content management world really since the early 2000s. I started doing training courses and then moved into some of the sales and marketing activities, and then actually got into the programming and now, now I'm doing more of the product and project management functions. So I've kind of been around doing a lot of different things with a lot of different products around content management, conversion, and analysis.
And so what we're going to look at today is a particular product called Harmonizer, and its focus is really to help you deal with the idea of content redundancy. So it's to look at your content and help you manage that redundant content, and hopefully turn it into reusable content. One of the big challenges when you go to some of these formats like DITA that allow you to do reuse is even just knowing what is available to you. So it's great to come up with a reuse strategy and have that going forward, but most organizations don't have the luxury where everything is new to them. Instead they really have to go through and bring over legacy content or existing content that they've created. Sometimes they have mergers and they need to bring new content in, sometimes they acquire new products, or sometimes they just go through their archives and slowly want to bring things in to a new way of doing management. And when you do that, you know, that's a big part of my company's function, is to do that: help you with that transformation and conversion of content. But a big challenge is always, well, how do I even know where to start with the reuse piece? And that's really what Harmonizer was evolved to do.
So, we'll be talking about a lot about redundancy and reuse; we'll talk about how important it is to get some metrics around this. So one of the big challenges is, you know, usually you know you have a lot of redundant content, but you really don't know how much. And quantifying that could take you, you know, forever to do. You'd have to go through by hand every document and try to figure out what's redundant and, and to do that just by a human means you have to probably be familiar with all of it. So, Harmonizer also helps you with those metrics. And then, really, we can also then point you to the exact and the close matches in your content. So one of the difficult things is that you can find exact matches using, you know, find, even just a find function or something like that, but what about those close matches where your authors are, where you have written things in two different ways, and they might be very similar, but they may not say things in exactly the same way. And that's another challenge that Harmonizer really will help you with.
Then we'll wind up today's session with some examples from Harmonizer so you can actually see what a report looks like, and maybe get some ideas of how it could be usable for you. And then I'll close again by just talking a little bit more in general about the company. So, so that's what we're going to do here. And let's dive right in. So, as far as content redundancy and reuse, they're really kind of two sides of the same coin, but they do have – there is a shade of difference between them.
So, we talk about managing multiple versions of the same content, and that tends to, to be where we focus on that redundancy piece. So that's where I have maybe two product manuals, and maybe I have to say the exact same thing in several different product manuals even though they're different products. So, you know, at the very simplest level, if you have an introduction to your company, or you have a copyright statement, or maybe you have other legalese that goes at the beginning or the end of your manuals, those are very obvious examples of redundant content where you might even just, in the old days, copy – or maybe in the current days – copy and paste the, the paragraphs between your manuals, and that's what we talk about when we talk about redundancy.
But where we get some really powerful ability is in that reuse piece, and that's where formats like DITA or S1000D or – there's a lot of different XML formats that employ a reuse strategy. And what those allow you to do are not only eliminate the redundant content, so that I can keep one copy of, say, that boilerplate copyright statement, and use it everywhere, and that way if I need to update it next year with the new year or, or new legal requirements, I can update it in one place and it can be reflected everywhere. So that's, that's really important. But you can go a lot further with these vocabularies; you can find things like, maybe there are paragraphs where you describe some procedure and the only difference is the name of the product, or maybe the name of the button or the label or the LED light that's on the product. So when there's small variations like that, you often want to find those as well, because most sophisticated formats like DITA allow you to even do some kind of variable text that you can put in there, and sometimes that's called applicability or, or publishing variables, or – it depends on the the tool or the format you're using, but there's almost always in modern languages some ways to use that.
So that's another area where you need to go look for things that aren't exactly reused, the redundant stuff, but you also want to reuse it across things, and then you might want to reuse it across channels. So, I may want to produce a print manual, but I also want to produce my web content, and maybe I want a PDF that I can put on a USB stick or maybe embed in the product, even, some of the data. So that's another example of that reuse across production channels where, traditionally, you'd have different people working on the website, a different person doing the print pages, and they would have to either copy and paste or whatever. If you start thinking of reuse in those terms, that also becomes an important area where you have to consider how you're creating the content. So, we'll look at some examples of, of how Harmonizer helps you find those reuse opportunities that go beyond just duplicated or redundant content. So it's really important at the beginning of a project to come up with some metrics around this – for most people. Sometimes you have the luxury of having an organization where you know you're either starting from scratch or where, already, the management has said, hey, we need to do – put our content in a format that's reusable.
But a lot of times you don't have that luxury; you're inheriting a bunch of stuff, or there's a bunch of existing stuff, or there are existing processes: we're writing everything in Word, or we're writing everything and publishing it to PDF files and we've got piles of PDFs. So redundancy metrics are usually where you start with a – if you need to do a project around this. A redundancy metric is basically trying to come up with: how much opportunity is there for reuse? How much content is duplicated or similar in my content set? If you have that number, you can help justify the the cost of the project, in fact. So if I find out that 50% of my content is redundant or almost redundant, then I have a ballpark figure to know that if I move to something like DITA, where I can do some real reuse, I'm able to put some rough estimates around that. So if it takes me, let's say I need to update one of the manuals, and let's say it's taking me maybe a week to do the the updates, when a new version of the product is released.
Well, right there I have a week of somebody's time, so I calculate that out, and then I multiply that for every other manual that might be updated or every other channel that I may need to update, and pretty soon I see that I've got, like – oops, sorry about that – I've got, like, some real numbers around how much time is someone taking to update the website, versus updating the print manual, versus updating, maybe, the embedded product files that actually feed the online help, or something like that. If I can get numbers around those, then I could start coming up with estimates of, if I was to single-source publish, that, that is, I create the content once, and I publish it to those channels so that future updates only have to be made on one copy, and then it'll automatically show up in the print and the online and those help files, then I can immediately start putting some real figures to, to my cost savings. This is also really useful if you do translation or localization, so you probably have a pretty simple calculation there because most organizations, you send out a manual and they charge you by the paragraph or by the word or whatever your, your charge is, and you can estimate then, if I was to translate this content once instead of multiple times for all the formats across all the languages, how much is that going to cost me for that translation?
So those are the places where you use redundancy metrics to try to come up with some hard numbers that can, you could then take to management, say Hey, I realize I'm asking you to spend, you know, tens of thousands of dollars on a content management project, but at the end of the day we're going to save hundreds of thousands over this period of time. And you can start putting together those metrics. So that justification of cost is sometimes the entire use of Harmonizer. So some people will use our report for one function only – and I'll cover that in a minute and show you what it looks like – just to come up with those numbers at the beginning of, say, a project. Then later in the project, you may want to do some cleanup of your content before you do a conversion. So one of the things is, if you have, say, all your content in Word or PDFs or some, some format, and you want to move them into a new content management system where you get all these fancy reuse features, you can certainly do that.
But it can sometimes help to, to early on get some insight into those pre-conversion cleanup tests. So, you can get started, even if you don't have a budget, you can start looking at your existing content and say What are the things we're doing that might prevent us from easy reuse in the future? For instance, if I'm in my content and I see the only difference in a paragraph is a page number reference, because I maybe use the same paragraph by copy/paste across multiple manuals, and then I say, you know, see whatever section on page three, that would tell me if I could start writing in a way where I was using references instead of page numbers.
Then when I move to a format where I'm publishing to different formats, I can not worry about having to translate page numbers into URLs or something else. If you start figuring out a strategy for those references, you could, you could start cleaning that up even before you ever have a budget for a project, or you might see things where your writers are arbitrarily using different language or, or not consistent terminology. You can start cleaning that stuff up and maybe even setting some, some authoring guidelines early on, even before you, you know you can move to new architecture. When you actually want to design your information architecture when you're coming up with – so when you move to DITA, yeah, you can convert everything to DITA, and it can be DITA but you can still have bad DITA. You can have DITA with lots of copies of things, so DITA in and of itself, it provides you the framework for reuse, but it doesn't magically make everything reusable. You have to do that, and as a part of that, you have to come up with what's called an information architecture, and that includes some of those reuse scenarios.
So when you're coming up with that information architecture, it can be very informative to know: why is it that our content is close but varying? Is it because of references, mostly? Is it because we don't use the same term, maybe, in some of our manuals? We talk about our "Notebook"; in others we talk about our "computer," and in others we say "PC." If we're using those three different terms interchangeably but we use them all willy-nilly, maybe we need to come up with a single one and start writing things in the same way. So, as you go through and find those things that can influence how you come up with an architecture for implementing your DITA environment or whatever other environment you're implementing, actually. It can sometimes be used during the conversion process. So sometimes knowing that there's a whole bunch of reusable stuff, you can identify that in advance, and as part of the conversion, you can come through the conversion with already some, some – you can think of them as, like, warehouse topics or boilerplate topics that you can have reused at the start, so that you already start getting benefits right after the conversion.
And Harmonizer is a tool that can help you find that redundancy in advance. That can also be reducing the amount of content to be converted, because if you've got that reused during conversion, then if, if the system or the conversion process detects that some block of text or some, some topic has already been converted, they can skip it the next time and just refer to that already converted topic as part of that conversion. So that can also make it more manageable in the conversion process.
But even if you don't do any of that, at the end of the day, even if you're using DITA, as I mentioned, just because you have the ability to reuse doesn't mean that your authors are actually reusing, or that they know what to reuse, so sometimes what happens is, and, and this almost always happens to some degree or another, you have a little bit of entropy sneaking in to your content as you move forward. So sometimes it can pay to do ongoing analysis, like, every few months, or every quarter, every year, or whatever, depending on how fast your content evolves. You might want to stop and assess whether or not you're getting the full advantage of your reuse opportunities in DITA or whatever, XML, you're using, and Harmonizer can help you go through that stuff as you're moving forward. And say Do we see a lot of duplication, still, even after we've moved to DITA and clean things up, maybe they're still copying and pasting text, which is still possible in DITA, and you can catch that stuff and determine if maybe there's a training issue, or maybe there's a – the tool isn't doing something you need it to do, or maybe there's some other way that you can you can improve that going forward. So there's a lot of places through the life cycle of content where those redundancy metrics are really important.
So, so let's just look at Harmonizer and what it does. So this is an example of the front page of a Harmonizer report. And what Harmonizer does is it's actually a server-based tool, so we host it on our own systems, and we can feed it lots and lots of content, and Harmonizer will break that content up into text blocks. And depending on what you're analyzing or why you're analyzing it, I'll talk a little bit more about those text blocks later, but those text blocks then get compared to each other. Every single one of them. So every text block is compared to every other text block within and between all of the documents you feed into Harmonizer, and then Harmonizer turns away at that and it says I'm going to look for all the text blocks where the text is exactly the same. So that's what this green part of the chart is going to tell you. It's going to tell us in this example that 56% of the content is exactly the same. So there's no variation in the text. Now, maybe there's different formatting, maybe there's different tagging, you might have different metadata, or something else if you're using XML, or, or in Word, you might have bold on, on some of the content but not on others. But Harmonizer isn't going to worry about that; it's going to say, I'm going to point you to where the text is the same. That helps you go to those files, then, and look at those source documents. And you'll know where to look and what to look for as the exact duplicate.
Then you can go look at it in detail if you want, and either clean it up or, or not, depending on what you're trying to do. So those exact matches, that's just a straight, you know, string comparison of the full text, and it basically ignores spaces and tabs and things like that. It can take some of that into account, but usually you don't want it to; you just want to know is, what this thing is actually communicating exactly the same, and if it is, Harmonizer puts it in this bucket and presents it to you. So in this example, and this is from a real project; I've changed the name to protect the innocent, but these numbers are are not atypical.
In this example, there was an almost 6,000 text blocks analyzed, and over half of them are exactly duplicated somewhere else in the content. This blue section does, really, what's the hardest work that Harmonizer does for you. So what it does is a close match, and it's a fuzzy match based on a natural language processing algorithm, so it's a got a little bit of smarts, and that means that it can look for text that is similar but not exactly the same, and the way it looks is language-independent, so it doesn't need to know what language you're in as long as it's a character-based language where comparing characters makes sense. Then it will work. And Harmonizer goes through and it looks for all of those similar things regardless of the position of the characters.
So, for example, if I was to tell you in one manual or in one, let's say I wrote one document and I put "Before servicing the machine, disconnect the power." Okay, so that's one way I might have written that sentence, but somewhere else, someone says "Well, the power is the most important thing; you really should say that first, so I can say 'Disconnect the power before servicing the machine.'" Okay, those are almost exactly the same; I've only inverted part of it, so even though in a straight text comparison none of the characters match up, Harmonizer knows that that's basically the same character sequences, just in a different order, so it will give that a very high match rating. That would be probably in the 90 percent above, 90 percent similar, even though the none of the characters are in the same place. And so that puts it in this blue dark blue bucket, which is our close match bucket, and this tells you that there are 1,677 of those text blocks you told it to analyze that are close matches, and then anything else that's found is a unique match, and the uniques are exactly what you think: they, we didn't detect anything that was close enough to anything else, so these are, are unique to their particular location, so they occur once. So that's what Harmonizer does at a high level, on that metrics level, is it will just tell you here very quickly how redundant is the content that you're looking at.
Now, one of the things about about this is we break it down here in this table, but then we also go through and say – we give you what's called the potentially redundant number. And what that is, is it's saying, okay, so if I was to eliminate all of the duplicates in this green so all of the exact matches. So if I was to get rid of, everything that replicated it, and all of the close matches, so I was to make all the close matches the same, which I realize is unrealistic, but we're talking about a perfect world, so if I was to get rid of all the redundancy and everything would fall into the gray category, how many of these paragraphs would disappear? Well, if I had five paragraphs that were exactly the same, four of them would disappear, and I'd end up with one copy referenced five times in five different manuals, right? That would be our reuse. So in that example, I would get rid of four – and Harmonizer does all that calculating for you and tells you here's how many paragraphs are in that – I would have to keep one so that I could reference it everywhere it shows up.
That's how many paragraphs in this content: 997 absolutely would have to stay around so you would end up, if you did perfect reuse, you would get rid of 900 – or you would keep 997 paragraphs, so you would add those to the unique. You'd end up with all uniques and you would get rid of 4,001 paragraphs. So that figure, this red figure here, tells you that it's possible that 4,001 paragraphs could go away in a perfect world. Now, it's not a perfect world; there are reasons why some things are similar but not the same, and some of them you probably wouldn't want to get rid of, so that number is going to vary, but you can see this is a basis for a real estimate of cost savings, so even if, in this case, let's say it cost me X dollars to do translation on this content, and I know that when I have to translate this, every time I make a change in the content I have to have it retranslated, I could take that translation cost then I could say, well, let's say maybe 20% of this this 67% were actually eliminated. If I just get rid of 20%, how impactful is that? So I got rid of 800 instead of 4,001, or maybe I get rid of 80, or whatever it is. You can play with these numbers and get an idea of, how impactful would getting rid of some of this content be? So how impactful would it be if you had 30% less content to deal with, and is 30% realistic? Maybe, depending on what these redundancies are. So that's really what you're looking at from a metric level with Harmonizer.
You're really getting an idea of what you can, what you can eliminate and, and what matches there are. And, and this is that high-level figure. Now, as I mentioned, there are people that will do a Harmonizer report just for this, so they have so much content that there's no way they're ever going to go through it one by one right now, but they're trying to justify an investment so that they can have a system where they could go through it and start making a dent in this. And to get started, they want to know, you know, how do I talk to my management about this? Well, you could take and run every document you have through Harmonizer and come up with lots and lots of redundancy here, and look at that figure, and it would at least give you an idea of if we have a problem, right? You probably know you have a problem with reuse if you're asking the question; you anecdotally know that. This lets you know in some very tangible terms how big the problem actually is. So, let's jump here – so, that's the overview screen of Harmonizer, and that's the first thing we show.
Now, we also break it down for you a little bit more, so when you run this report, so, in this case, this is an example of some old Gateway computer manuals that I just downloaded the PDFs off the web and just ran them through Harmonizer. So, I never worked for Gateway; Gateway is, as far as I know, out of business; they don't sell Gateway laptops anymore, but when they did, these were some Notebooks that they were selling in the late '90s, early 2000s, I guess, and I grabbed a bunch of these and I just fed eight of the PDFs in just to see what happened. And I told it to look at all the paragraphs. So it found almost 21,000 paragraphs. It – I told it to ignore paragraphs that were shorter than five words long. So, I didn't want to look at the really short things that might be just a heading or a number in a table or something like that.
So I just told it I want to look at the paragraphs that are at least five words long, and I want to set the similarity threshold to 70%. So one of the things Harmonizer lets you do is, in that, when I say a close match, you can say, well, how does it know how close is close? And that similarity threshold is how you can change and tell Harmonizer how close do you want them to be.
So if you tell Harmonizer 70% similar, then 30% of the character sequences can be different, and Harmonizer will still tell you it's close. If you raise that number – I could set it to 90% if I wanted – fewer matches, but I wanted them to be really closer in in matching, so I want them to be, the text to be more similar. Or I could lower that number. Now, in practicality, you really don't want to lower that probably below 50 or 60% or you get basically lots of matches that don't mean much because the text is so different. So usually we start around a 70% similarity threshold. So then what this table does is, it tells you, I looked at these documents you gave me and I analyzed all the text in them as you said. I broke them up into blocks according to how you configured it. So a block could be a paragraph, but if you have a vocabulary where you can identify blocks as bigger things, then it could be, like, let's say in, in DITA you could you could do this at a topic level, you could do it at a file level, you could do it at a, some sub level, you could look at just notes and warnings and cautions if you wanted; you can tell Harmonizer if you have an XML format as the source, all kinds of things to break the content down into. For PDF, you're pretty limited because it's just text-based. So in this case, we're just looking at the paragraphs, and that's a fine way to look at it as well.
So you can see, for each of these PDFs, you can see how many blocks were examined, how many were ignored because they were less than five words, but then what's really important is it tells you how many exact close and unique paragraphs occurred in these documents. And what that does is, even if you don't care about going through the matches right now on this run because maybe it's too many, this will tell you: which documents should I start with or start looking at more closely? So you see, if there's 1,453 exact matches in this Gateway 350 user manual, and the Gateway 305 has 1,419, those look like pretty good places to start. So if I was going to really want to just start on this, I'd be looking at these ones in the thousands as my starting point for my reuse. This one here, this Gateway 500 manual, or the 520 manual, they only have 400 – here we have 410 exact matches and only 142 here, and close matches are pretty low as well. So that would tell me those manuals, maybe I don't want to focus on those first because if I'm really trying to justify reuse, I'm not going to get much bang for the buck out of those. There's really a lot of reuse going on in these. So these are the ones I probably should be looking at.
Now, just as an example, I included one of the – actually didn't know this at the time, but I included a PDF that I got off the internet that was just a scanned one where they had scanned the document and they were all basically image files, so we don't do the character recognition. I could have converted that file using character recognition but it actually became my test file to make sure that if there was nothing in there, everything showed up correctly.
So, so that file actually had just scanned pages that hadn't been OCR'd, so Harmonizer didn't find any text. That's all that's telling you. Anyway, a lot of times customers will come to us just to figure out: what documents should we even start with? And again, this part of the report gives you those metrics that help you figure that out. So, so far we've seen how Harmonizer can just give you a big picture if you're just talking in general terms and trying to come up with maybe a rough cost savings or some cost justification for, for a DITA project or something like that. And then here, it can give your, your editors an idea of which manuals really do have all the reuse going on, and maybe we start with those because those are going to be where we have the biggest opportunity for impact, right?
Now, how do you get this report? So, I'll get into a little more detail, by the way, on those matches, but I want to save that. So, if you want to do this, there's a couple ways to, to run a Harmonizer report. The first way is, you call me, I do a project with you, you – I'll probably set up a secure FTP site or something, you'll get me your, your files, whether they're PDF, they can be FrameMaker, they can be Word, they can be any XML or text format. Really, we're a conversion company so even in the worst case scenario we can do a conversion. So, like, we can do InDesign, for instance, but we don't automate InDesign; that's a manual process that we do, but we can do all these formats. FrameMaker, we do a lot of FrameMaker projects, all of those formats. We can take your source files, run them through Harmonizer, and get you those metrics right away, and that's easy to do. If you have some of the the more approachable, I guess I'd say, or, or less proprietary formats. So if you have Word, PDF, any XML vocabulary, so DITA, S1000D, or any other XML format you might have, or text files, then you can use our self-service portal. And our self-service portal allows you to go in, and when you log in you'll see all the reports you've run, which is what you're seeing there, so this is just a window into four of the reports that I've run, and it's got the little graph over there so I can kind of visually see which ones are, have a lot of reuse just without even opening the report, and, and then I can click here to view the reports or download them if I want.
The reports are formatted as, in two formats. They come to you in HTML, which means that each page of the report is kind of like an HTML page, and those are static pages so you can download those as a zip file, unzip them to your local drive, and use them locally. You don't have to be connected to a server; it's not, not tied to our server once the report is created. You also get, if you download it, a version of this report in Excel, which I'll talk about a little bit later, and those formats you can open locally on your machine or, or host them on your own shared server at your company, and, and use them as long as you need to. So, once you create a Harmonizer report, it's kind of like "Wheel of Fortune": once you buy the prize, it's yours to keep; well, once you get the report, it's yours forever. It's not tied to any, any server or anything. So this portal is where you can go set up and upload your own jobs. So here's just the simple form you fill out to upload the job.
And when you upload it, Harmonizer will churn away. Sometimes it can take minutes, sometimes if you've got a lot of content, it can take hours, and I have seen the latest one I did that was the biggest one I've done: took about three days to process. So for those bigger ones, we, there's a limit to what the shared server will let you do, but those bigger ones happen to be projects that we do internally, but for most people they tend to be in the minutes or hours. So Harmonizer will let you set up the job, the job gets queued on our server, and then you get an email when the job's finished, and you can go download or view the report then without waiting from then on. So that's our portal that allows you to run these Harmonizer jobs yourself.
Then what, what else is in the report? Well, you get that screen with the overview, you get the screen with the metrics, you get the screen with the, the document breakdown to show you where all the reuse is, and then you actually get screens that show you the reuse. So this is a visual view of all of the text that was pulled out of your document, and you see that on the left there, so this left side is showing you part of a document, line by line, all the text we pulled out, all the text blocks, depending on how you configured it; that depends on what the text block is, so in this case these are paragraphs because it was a PDF, and you'll see over here on the left just a listing of the first part of each paragraph that we pulled out, and that helps you orient yourself so you can follow along in your own editor if you wanted to.
So, I could open on a second window, say, Word or whatever this was written in, and I can follow this along in order, and then I can click these to view what matched. So if I want to see what matched for information about connecting to a wired or wireless ethernet, whatever, I click that little number next to it, and over here you'll see the match. So in this case you see a match that has some text: the black text, this is called a close match, and in this case you'll see that there are different variations of this paragraph that we found, and they all closely match. That means they're within that 70% threshold that I set. So 70% of this text is considered similar, or the same.
So if I look, this black text is all the text that doesn't vary between any of the variations. So this black text showed up in all of these documents, and you'll see there's two documents here: the Gateway M305 and the M350 manuals, that has this paragraph in it. And this paragraph is "Your computer includes help and support..." and then continues on, right? And they use the word "computer" here, and up here it's showing you, here's "computer"; it's in red because that means it didn't appear in all these variations; it appears sometimes. Sometimes in this position the word "Notebook" appears, and if I look at variant B, you'll see that's where the word "Notebook" appeared. So in the 520 user guide, Gateway did not use "computer," they used the word "Notebook." And "Notebook," right, if you look at – there's this in Windows XP that sometimes shows up at the beginning instead of starting with "Your," and in this case we're ignoring, we're being case insensitive, which usually is what you want. So in this case, in Windows XP then, "Your Notebook includes help and support," so it's the same as this, but it has "In Windows XP" at the beginning. And again, they're using "Notebook" instead of "computer."
And then you'll see there's a variation here; this is actually just a variation – they have a little bit of extra text or different text here at the end, so whoever wrote this, when they wrote it for the 305 and the 350, they were a little wordier here at the end. They didn't just end with "...Windows Notebook"; they said "...Windows and help you quickly discover and use many features of your Gateway computer." So, so this helped you find all the places where this kind of paragraph occurred; it shows you where all the variations are, and that way, as an author or as an analyst, I could go through and say, Okay, why are we using "computers" sometimes and "Notebooks" sometimes? That's blocking us from reusing this. Or if we're doing translation, that means that this gets translated at a different, an additional translation.
Most, most translation vendors, I don't – as an aside here – most translation vendors use translation memory, and the translation memory means that the system will detect if there is a paragraph that has already been translated once and it won't charge you if, if it comes up again. Well, in this case, this is in the same paragraph, so you're getting charged twice. Actually you're getting charged three times or four times. And if you look, there's actually two more off the screen here, so six times I'm being charged for this similar paragraph. If I could write it once the same, I've just saved five translations of that paragraph. So right there, you can put a dollar figure on this pretty easily. So that's what Harmonizer does is, it shows you all that stuff, and as I mentioned, it, it can show you this as Excel or as HTML.
There's another function of Harmonizer which I'll talk about a little bit later, but they're sequence matches. Harmonizer also gives you one more page where it shows you when there's multiple paragraphs in a row that match, and that's an attempt, if you don't have a format like DITA where you can identify a topic, for instance, or some other subsection besides a paragraph where it tries to find you bigger blocks of reuse. So again, if you've got a big reuse problem and a lot of content to go through, sometimes you'll use those bigger blocks to start with: clean those up and then maybe rerun the report and see what's left. All right, so let's look at a couple examples from – some real world examples of Harmonizer: so this is just a basic analysis I did on S1000D content, and this is actually the example content they give you with the S1000D toolkit. And S1000D is usually used for, like, it's used a lot in aviation manuals or military equipment or stuff like that.
Here, this was for a bicycle, and what I told it to do was instead of analyzing paragraphs, I thought, you know, to start with, I'd just like an idea of – so these were already broken into what they call modules in S1000D; that's the same thing, basically, as a topic in DITA, but it's basically a reusable block that's already been created, or a big block that you can use in multiple outputs. And so this was already broken up into modules, and I thought, I wonder if there are places where they could do more reuse in these bike manuals. So what I did is, I ran it through, the bike manuals through, at the topic level; that means that every bit of text in that topic is being analyzed as a single string. So this is every last bit of text in this module for DMC brake and DMC S1000D bike.
And you'll see that this paragraph occurred or this module is exactly duplicated in two of these places, so in two of these files I found an exact duplicate of all the text, so I could get rid of one of these and reuse it wherever this reference is.
So I could clean that up right now if I wanted to just by making one of these a reference to the module wherever this one of them was used, and you'll see there's several places where that's in there and it's mostly between this brake and this bike manual, so I could immediately tell, okay, I'm already using S1000D, but I can improve on it because I've got a lot of duplication going on here. So that's an example of doing a module level text block instead of a paragraph level text block. Here's an example where I found – actually, at the module level, this is the same example, but I actually found one match where there's only one bit of text, "clean with water," and that's the only difference between these two modules. And they're used in different places, so I could go back to the source file. Now, I don't know how this is tagged, exactly; to do that, I would open up each source file and I would look at them, and I could compare them as an editor, and then I could see why is "clean with water" here and why isn't it here? Why do they mention that here? Is it truly a difference, or maybe I should add this to this one and make them the same, or maybe I should just get rid of this one and reuse that one. So that's sort of how you're actually doing the work of, of cleaning up the content.
Here's an example from those Gateway manuals that I was showing you, and talking about where you'll see product variations. So these are legitimate variations in the text; all these PDFs say the LED light should be green to tell you if the battery is charged on this, on these Notebooks, but on these two, the LED turns off if the battery is fully charged, and on this one the LED turns blue if it's fully charged, so this could be a legitimate difference. So I could look at that, and one, I could double-check: is it true that this is the case for these manuals? Did I get it right, or did we maybe copy some text wrong, and maybe one of these is supposed to be in this other slot. So it actually could help me verify that the product is documented correctly, but beyond that, if I was looking at reusing this or using the block around it, I would think about, well, maybe I need to use some kind of a variable here where I insert it, publish time based on the product which value for LED needs to be put in there, blue off, or green, right? So that's a way you could look at reusing that, and then again, you have another LED match down here with orange on, or purple to tell you it's charging, so I could look at that, those product variations, and maybe come up with a reuse strategy that would accommodate that so I could write this once and put the right value in at publish time.
Here's just that same example I already went over, so I'm not going to go over that, but it, it shows you how they're using "computer" versus "Notebook," so they don't have standard terminology going on here. This example is what I mentioned earlier about page numbers: if you put page numbers into manuals, you're going to have a problem when you try to publish to the web, right, unless you publish as PDF with pages, which most people kind of hate, so, so usually you want to look at then coming up with a strategy for getting rid of page numbers or inserting the page numbers at publish time. So, so putting a reference here in your, your XML, or your source, and then that reference gets translated to either a page number or a hyperlink or whatever it is in the target format at publish time.
Here you'll get some product variations so, you know, some products come with a CD burner; back in the early 2000s some of them come with a CD/DVD burner if you if you upgrade, maybe, so some of these paragraphs talk about both. Well, maybe I should consider writing these paragraphs in a way where I don't have to add on the "and DVDs"; I could do some other phrasing of this to clean that up so it's more reusable. And then sometimes you have localization differences, so you'll have areas where maybe the way I write words is different. So, using "colour" with a "u," which maybe you're doing in, in Canada or the UK or something, or, or maybe not in Canada and maybe in the UK; this can show you those variations, but it also helps you check to make sure you're using the right one. So you'll see here, the UK edition has "colour" with a "ur," and here the Canada and the main one has just plain "-or." So that's another area that that can play into your strategy.
And I won't get into localization; that's a huge topic that probably CIDM maybe even does training on, but, but when you do DITA, localization is another reason to do this, is you can reuse in a localized way, but that's a part of your information architecture strategy. So that's something that comes up a lot. Harmonizer can help you, show you, hey, this is coming up, this is why these are different, and that then you can take it into account when you're actually implementing a project. So that's really Harmonizer in a nutshell. I'll get to – I see we have some questions popping up, so I'll get to those in a second, but I want to just wrap up real quick.
You know, Harmonizer really emerged in my company as a tool we used internally, and one of the reasons I was brought into DC Lab was, I joined the company as a – primarily my function was to be Product Manager for Harmonizer, so that's what I've been doing for the last four or five years, really productizing this and making it so that you can run your own reports and, and making it so that the reports were more readable and, and that's what I'm doing in a going-forward basis. But my company does a whole lot of this stuff, so we use the tool that I manage as part of our, our own activities. Data conversion is a big part of what we do – is, we'll transform and convert data so that if you want to move it from PDF or from Word or from FrameMaker or InDesign, or whatever you're writing it in today, if you have some format that doesn't allow you easy or standard reuse and you want to move to DITA or some reusable format, that's a big part of what we help you do. So, we don't tend to do the information analysis, but we work with the people that do that, so we work with the consultants that you might have to do the actual leg work of doing these transformations.
We do semantic enrichment, so sometimes we will have processes where we use machine learning or we use other techniques to be able to, maybe, auto-tag stuff, or maybe we give you a framework so that as part of a conversion you can go through and manually tag things or make sure that you have the right semantic data associated with your content. We'll do a lot of entity extraction, data harvesting, which is sometimes pulling it off of existing sites if you have stuff published in, maybe, web pages or something.
Validation – again, the reuse piece is important, which you've seen, and then we work with delivery a lot. So we do a lot of migrations between delivery platforms. Sometimes we come up with training sets for for AI projects, and then we'll do content analytics as well to see if there's metadata or other things in your content that's maybe not being properly identified or missing. So there's all kinds of activities we do around the the content conversion and transformation, and that's why we're called Data Conversion Laboratory. We help you get to those structured formats. So that's really Harmonizer and my company in a nutshell. And I do see we have some questions, so let me see if I can pull...
I can help you with that here.
Okay, that was really, really great. We have one question about a clarification: what exactly constitutes a text block: a line, a paragraph, a section?
That's a great question. So it depends. By default, if you just feed a PDF or a Word filing, it's gonna come out as paragraphs, so we're gonna treat everything that's at the, what Word would call the paragraph level as a text block. Now, we're all working on a new feature where in Word it can detect headings and stuff and break them apart by heading. That hasn't been released yet, but that's one of our road map features. But if you're in, already, an XML-based format, it can be any tag that you can identify. So as you saw in the example earlier, I had analyzed a bunch of content, S1000D content, as module. You could do the same thing in DITA as topic, and that would treat the text block as all the text within a module or a topic. So, really it's flexible. XML is great because XML gives you lots of levels to play with, so you can have a lot of flexibility there. Obviously when you get into PDF or Word, you're more limited in that, but, but typically a text block is, starts off as a paragraph.
Another question: do you support markdown format?
So, we don't have formal support for markdown. It would be easy to add stuff to ignore the markdown pieces. We have never actually been asked to do a markdown project; we already support text, though, so there's a couple ways you could do markdown: you could either export the markdown as HTML and just analyze the HTML and then tie it back to your markdown files; that would be, you could do that self-service right now, or export the markdown as text, and it would get rid of all the the markdown elements. I think you can do that in markdown. I've, it's been a while since I've done markdown, but we could also add an ingestor for Harmonizer that could easily get rid of the known markdown text. So, so that's not a huge deal, but as I said, we support almost any format if you want to export it, so that's a good way to do markdown right now.
Okay, another one: what's the largest number of objects Harmonizer can handle or has handled?
Yeah, so I just did a project that had over 700,000 text blocks.
Oh, my goodness! Wow.
And it was, it actually does open the – you can open the report; it's hard to say in advance because it really depends on Harmonizer's presenting you the matches, and the amount of time it takes will depend on how many similar things there are, because it can do a pass where it can get rid of things that are obviously different right away with almost no overhead. So if you have 700,000 paragraphs and they all are matching, we might, I – it's possible we could not be able to handle that at some point, but in this case it handled it fine. It took, it took three days for this particular project for Harmonizer to come up with the report, but then the report was openable.
The problem we're having now is, the customer is like, well, wait a minute. That's – how do I go through 700,000 paragraphs? And I'm like, well that's your problem; you have a lot of content. [Laughs] But what they're using it for is, again, that last page: to look at which files have the most blocks. Now they're going to re-run some Harmonizer reports focusing on some subsections of that, so they got a big picture with their 700,000, and now we're looking at, okay, you've got to come up with some manageable way that you're actually going to be able to do something with all your content. And there was as an example; theirs was a giant. It's a website, so we actually crawled all of their web pages to do this project, so this was a bigger project than just Harmonizer.
It also involved a large web crawl because they have so much content on their website that they had no ideas, you know, where we even start. Where are the things that we should start worrying about reusing? So that's why they did that project. So there's no real limit to how much Harmonizer can handle it; it's just dependent – there is obviously a place where we're going to run out of, you know, database space, or, or memory, but I don't know where that is.
A long time in the future, hopefully. So, another question: how about tables analysis? Identical, similar content but in different table layout?
So, if you're using a format where your table like HTML or XML where your tables are actually in a table tag, there's a couple ways to do this. If you, if you wanted to do table analysis, you could tell it, hey, I want you to look only at the tag table, and then it'll take all the text that's in the table including the headings and the body and everything and munch it all together in one long string and compare that as a string, and so that'll tell you. And because Harmonizer isn't order-dependent, your similarity setting will allow you to see even if the table is not in the same order but the date is the same. So that, that would work. In a lot of cases when we do the table analysis, you want to lower the similarity setting further than you normally would because sometimes the order is really important in a table. Like, it's not meaningful if it's a table of numbers; it may not be very meaningful to do the Harmonizer on the whole table.
The other thing you can do is if your markup is, if your, your content is well-tagged, then you have t-head or t-body and you could focus, you know, maybe I just want to see where where tables with the same headings, where are all of those, or where all the tables with the same numbers, and maybe the headings are different, or the same body, whatever's in the body, so tables can be treated at any level of the table. They can be treated as a block, they can be treated as just the headings, or you can treat it as just the table cells. It just depends on what you want to do.
Or you could even, if you have multiple paragraphs in table cells, which is sometimes the case because a lot of websites will do formatting in tables, you can just analyze the table cells by paragraph, even. So, so there's a lot of flexibility with tables. But yeah, you can do that.
Okay, another question: you mentioned that Harmonizer can analyze multiple file formats. Does that mean one time? For example, I have HTML files, Word and PDF files that I suspect have redundancy across them all. Are you saying Harmonizer can look at all of these at one time and compare?
Yes, so one of the things Harmonizer can do is it, it's format independent. So, we have, Harmonizer has a process called ingestion. When that happens, when you feed it content, Harmonizer can send the different types of files to different ingestor routines. So it can say, okay I'm looking at XML here, so I want to look at these tags; I have PDF here, so I'm going to look at the paragraphs. And then I'm going to put them all in the database and everything will get compared to everything.
So even the paragraphs in the PDF get compared to the tagged content, and that way you'll see across formats what's similar or different. This can be really useful if you have some DITA content already and you want to bring in you, you're looking at maybe, just, we have some legacy stuff and it's like, do we want to convert this? Do we spend time to convert all this FrameMaker stuff to PDF? Do we just save it to PDF and put it in the library? Or is there some value to bringing it in and doing some reuse on it? If you have that, if you ever ask that question, Harmonizer can do that: you feed it your DITA, you feed it the new FrameMaker files, and it'll tell you where the reuse might be. So that can be very handy.
Currently, the online version of Harmonizer only allows a user to upload one format at a time, but it's not a limitation of Harmonizer; it's a limitation of the the front-end UI. We do have a road map; I am working on that as an upcoming feature to allow you to add multiple ones through the UI, but for now, if you want to do that, there's two ways: you can do a project with us, so you can say Hey, I want to just sign up and have you run the Harmonizer report, or you can do a conversion where you say Okay, I have a bunch of FrameMaker, I want to include it with my DITA; you give us the FrameMaker, we'll give you back some files along with your DITA files; that'll let you run them all together as a single format. You can also do it on your own self-service if you export the stuff. So if you have FrameMaker and you can just export it all to HTML or something, and then you have DITA and export that all to HTML, you could make it one format yourself, run it all through, and then go back to your source files as you wish. So there's like three different ways you can do that right now.
So, another question: this is great, but how does my editorial team take action to clean up the redundancy?
Yeah, so I, I kind of walked through a little bit of that earlier when I showed you, you know, when I saw the word "computer" or the page numbers; that's really the kind of thing you end up doing. It isn't – you know, a lot of times, people will run Harmonizer and then they'll say, well, now can I push a button and have it do the redundant, you know, get rid of all the redundant stuff? Push a button and get rid of 4,000 paragraphs? It'd be great if we could do that; we can't because we don't know why the stuff is redundant, right? And only a subject matter expert really understands it.
In fact, during conversion we often have to consult and come up with some very specific rules about when we can reuse stuff, because sometimes you don't want something to be reused even if it's similar, or sometimes even if it's the same, it can't be reused for – sometimes there's legal reasons, or sometimes there's other reasons, business reasons, why those can't be. So Harmonizer doesn't automatically magically clean it up, but it does make it easy to find where you can go clean it up or where to look, and that's a huge part of the task. I mean, if you think about, if I had a hundred or even the eight Gateway manuals, right, I had eight Gateway manuals, they were written over a ten-year period, nobody remembers what's in all the manuals, so if I was going to go figure out what I need to clean up in these manuals, your starting point would have been to read all eight manuals, which is dreary, and then try to remember all the paragraphs or all the blocks that you read, and hopefully be able to find all the similarities. You can imagine that's not a quick or even reliable task for a person to usually do.
It's a huge time-saver to be able to do this. What sort of price point is it for this software report? I don't have a huge budget but can definitely see how it would deliver on my ROI.
Yes, so, one of the things when I, that I've been really pushing for since I've gotten on with Harmonizer: so, Harmonizer used to be just a – it actually is often a part of our conversion projects, so some of our customers that do conversions with us if they're doing conversions will, it's very low cost or sometimes no cost if it's a big enough conversion; it's just part of the conversion, so they they get access to Harmonizer as just part of the deal, so it can be very low cost if you're already doing the conversion.
Obviously I can't survive on that if, if you aren't doing a conversion, so I wanted to make Harmonizer available to everybody, not just our conversion customers, so you can come to us with a project. Usually if it's a normal-sized project or a modest to mid- to even on the larger side, it's a few to several thousand dollars to to get a report on the larger content sets or the medium content sets, and that's having us do it. So that was also a – and and oftentimes companies will just want us to do it because, you know, getting the report saves that much time easily, but the self-service portal came about as a way that we can reach a lower price point for people so that I don't have to get involved and set up an FTP site and go through all this stuff of uploading it, and then having our people run it and do our quality checks, and on and on and on.
That's what adds to that cost. So to help lower that price point we've created the self-service portal, and when you use the self-service portal it can be a few hundred dollars for a report, a small report. Most of the time, those plans are around you can get a monthly subscription, so if you want to go and run Harmonizer every month, they run as low as 500 a month depending on volume, then you're allowed to upload. So it really depends. It's kind of negotiable for sure. I've seen occasionally we've done reports for a few hundred dollars and sometimes we've had reports that we've done ourselves in the 10 to 15,000 range. Those are those gigantic reports where we're doing multiple passes.
So we're running one that's, you know, 700,000 paragraphs or blocks, and then we're breaking it apart and we're doing a lot of interaction with the client to, to break those apart. That tends to be where you get in the tens of thousands. So that gives you at least a sense, and if you're interested and you're not sure if it's valuable for your content, we run free reports, just small ones, as part of engagement, so if you contacted us and said Hey, I'd like a demo and I'd like to see if Harmonizer is right for me, you can set up that demo, and usually at the end of the demo – well, always at the end of the demo – we'll offer, you know, if you want to send us a few hundred pages, we'll run it through and send you a report, and you can at least just look at it and see what you think, on your own content. And that way you'll get a sense of what – it'll tell you about the bigger set if you want.
Okay, I think we already answered this, but I'll go ahead: how much content can Harmonizer analyze/compare at one time? Is there a limit?
So, yeah, again, no limit that I've – that's been put in the software. Everything has a limit eventually; the computer will smoke or whatever or the world will end or something will happen to, to put a limit on it, and the reason is it's all – people always ask me, actually my boss always asks me, well, how long is this going to take? I'm like, I don't know. I don't know how long a report takes to run, and the reason you don't know is because you can imagine if the content has a lot of close matches, Harmonizer has to do a lot of that natural language processing to figure out exactly what that percentage of closeness is, and that takes a lot more time. So that'll start eating up a lot of time, and then if, if you have hardly any close matches, Harmonizer might buzz through a larger content set faster than it would a smaller one. So it really is variable.
But like I said, we have never – right now we're talking to someone where they think they have, I think it was over seven, maybe six or seven hundred thousand pages of content. Again, they're doing that high-level analysis; we should be able to handle that, but again, I don't know for sure until we do it, but, but yeah, so there's no limit. It's probably gonna, you're gonna be, you're, you're gonna hit your limit well before Harmonizer hit its limit.
Well, we're looking at about 15 minutes left on our clock, and just to remind everyone, this will be recorded and the link – it is being recorded – and the link is going to be sent out to all the attendees who signed up for the webinar. And also, if anybody wants to put some last-minute questions in the Q&A, we have time for those briefly. Just also a quick mention about CIDM: we do have workshops on strategies for reuse, we can help with information modeling, so all of that kind of going hand-in-hand with the best way to organize your content and really improve your, your profit –
...which is an important thing, a big ROI for everyone.
So I'm going to give another a little bit, five more minutes, because we have answered all the questions, but I also want to thank you, Christopher. You did really a lot of, a lot of great knowledge, and very interesting.
Oh, good, well, I'm glad you enjoyed it.
And so – and actually, also in the Q&A you had a compliment of that very thing. Very interesting.
That was really nice. So, we'll give it just another, maybe, minute, and we may get to wrap up early here. I also want to mention to make sure and note that all those links to get in touch with Chris and see what he can do for your company. So anything else you're, on your mind to, to let participants know?
I'll put a plug in for your shows. I always, I'm often at the CIDM shows and I really enjoy meeting people there, and, and we can do live demos and all that stuff, but I'm also happy to meet you online and set up a demo of Harmonizer anytime, so.
Oh, that's good to know. And yes, very, very wonderful to have those network, networking opportunities. For sure.
Yeah, I'm glad they're back.
Aren't we all? Glad the world is getting back to a normal – new normal, mind you.
That's true, very true.
I know. It looks like we've come to the end here, and I, again, want to thank you and DCL for this very informative webinar. Okay?
Thank you, everyone.
Hope to see you soon. Okay, bye. Thank you.