DCL Learning Series
Illuminate the Blind Spots in Your Content Strategy with Harmonizer
Stay informed, as always. And today's webinar, "Illuminate the Blind Spots in Your Content Strategy with Harmonizer," presented by Christopher Hill with DCL. Welcome.
Thank you, Trish. Let me just share my screen here. So I think you're seeing it. So today I wanted to share with you one of the tools we use at DCL in order to illuminate blind spots in your content strategies. So we have a number of techniques for doing this, but some of them we've made available outside of our organization, so they aren't just part of our own services. And I'll talk a little bit more about that later. But a little bit about me. I'm Christopher Hill, I'm a Technical Product Manager as well as Project Manager at Data Conversion Laboratory. And again, I'll share a little more about my company a little bit later on. My background really goes back to the early days of XML. I left public school teaching and went into corporate training in the late nineties and ended up in a company that did XML training courses. So for many years I developed and was on the ground floor of XML and the rise of XML. And things have changed a lot since then but I'm still talking about the same old themes. So here we are today.
So let's talk a little bit about what I'm going to be focused on today, and that's really a tool called Harmonizer. And one of the things we do is content conversion and projects around updating content. And over the years we developed tools to help us do this internally. So things like being able to automate these conversions or find things in the content that might be hard for a human to find. And Harmonizer really emerged from that effort and was productized a number of years ago. And then I have carried on with that product going forward. And it's offered in a number of ways that we'll talk about a little bit later. But it's software that basically takes any document collection.
It can work with multiple formats at one time. So that means things like office documents, PDF files, DITA source files, HTML. You can feed it text documents if you want. And then because we're a conversion company, we can support all kinds of other formats, as well as any XML format that's out there or really any markup language. But what you do is you feed all this stuff into Harmonizer and Harmonizer will take and identify all the text blocks and then it does a massive many-to-many comparisons. So it says "I'm going to look at all those text blocks and group them together by which ones are exact matches and also which ones are closely matching. So I will give you a whole bunch of match groups that give you groups of those texts that occurs throughout your content that are either duplicate or near-duplicate." So what we use this for is we can identify redundant content. That can help us with migrations. Sometimes we can create some business rules around ways to, say, automatically create some topics during the migration, if you were going from something like Word to DITA, for example.
But it can also do a lot of other things that you might not immediately think of and all that really helps to improve the consistency of your content. Harmonizer is also a spotlight into: how much redundancy is there in my content? Because a lot of times we get customers who come to us and they said, "Hey, we've been hearing about DITA." Or "We've been hearing about component content management and we think this might be useful to us. We just don't really know how useful." And it's very hard to quantify that just by looking at your content. What Harmonizer does is put some real figures around that.
So in addition to locating the potentially duplicated content, which is the obvious use of Harmonizer, and being able to give you those statistics to say, hey, we need a reuse content strategy. Or we need a component content management system. Or we should look at DITA because we could save some money, it can also find things like inconsistent language in your content. So if you have the need maybe for some standards or style guides that need updating, or maybe need created in many cases, then Harmonizer will help you see where are our styles varying? What kinds of things are varying and why? And it can help you uncover some ways to create those style guides and improve the consistency of your content. And then it can just save time in general.
If you've got – either you're planning to do reuse and you want to do some work in front of that migration or you've already implemented a reuse system, it can be very hard to know if I'm actually reusing the content properly. Are my authors actually going out and finding the warehouse topics and using those or are they rewriting stuff and I don't know about it? Periodically you can use Harmonizer as a tool to check your content repository against your warehouse topics and it will uncover if there are a lot of places where things are being rewritten or written again. So all of these things really are some of the activities supported by Harmonizer. And then you'll probably think of some of your own as we go through it. So some of the questions that Harmonizer answers then is, if I'm in the early stages of looking at reuse and component content management, this can help us know that there's enough redundancy in the content to even justify the investment.
So you run one Harmonizer report, it gives you a nice overview of how much of your content is redundant or potentially redundant, and you can start thinking, okay, is this going to save us – is this effort to really move to a new system justified? It can tell you where the duplication is so you can see which files have the most duplication and maybe focus on those as you build out your topics. So it can tell you where the hotspots are so you can get focused on those right away and you don't have to go looking and trying to somehow dig through all your content and find those hotspots. Instead, you zero in very quickly. It can tell you why some of the content isn't matching. Is it because your authors use different language structures? Is it because you haven't got consistent language around your product names or other parties, product names that you may reference in your content? All of those things can help you figure out why we have those mismatches and maybe clean that up and create some rules or style guides.
Again, I mentioned earlier, is your editorial team even taking advantage of the reuse tools you're offering them? That's a big question that often is overlooked and it's hard to do that. It's hard to know without really reading everything and then trying to remember everything that you've created over the last year to know if your editorial team is generating content by duplication or if they're actually using the reuse tools available to them. Harmonizer can be a quick way to at least get a sanity check on that without doing a lot of effort. Again, content governance. Again, those inconsistencies may be pointing to a need for stronger content governance. And then finally, inconsistent terminology or style that may be detracting from the clarity of our content. Sometimes these inconsistencies can actually make the user experience poor because what happens is if you've got a lot of topics but you're using inconsistent styles across them, when you assemble them into larger documents, that can become confusing without the authors actually realizing it because they're working with those components. So Harmonizer can sometimes help suss those out a little bit and help you improve that consistency.
As I mentioned before, Harmonizer can work with a lot of different formats. And we offer Harmonizer in a variety of ways and I'll talk about that a little later. But basically, almost any, really any format, any digital format you can provide us. And we can find a way to get it into Harmonizer, that's usually not a problem. We are a content conversion company, so that's easy to say. But there are some formats that are easier for us to deal with. And out of the box, Harmonizer can deal with those in any mix or match. So you can take Word, PowerPoint, Excel, HTML, PDF text and then XML and any XML flavor, DITA, S1000D, DocBook. All of these formats, out of the box, Harmonizer can take and combine together. So another interesting thing that we see come up with companies that have already adopted component content management is they might have moved a bunch of their current content into their component content management system.
But then later they might realize we have a bunch of legacy content. Should we bring that over? Is it going to be useful? How much does it share? Are these older documents maybe useful to bring over and maintain now in the new system? Well, one of the things I've seen customers do is they take their current warehouse topics or their current library of content that they're reusing and they run it against a bunch of old Word documents, say, to see if there's a lot of stuff in those Word documents that's already being – matching up with stuff in your current warehouse of topics. And that might tell you that there's a lot of value in moving that Word stuff over to the new system because it would benefit, those documents would benefit from the newer language and improvements you've made through your component content strategy. So that's another nice feature of Harmonizer being really format agnostic. So it isn't just limited to a markup language.
Harmonizer can really handle anything that ultimately boils down to text somehow. How does it work? Well, it's pretty straightforward. Harmonizer takes all of those source files and it just funnels them into its own internal database. And when it does that, it breaks them apart into text blocks. So it breaks them into pieces and then it takes all those text blocks – and I use the term "text block" because for some formats, that's what you would typically think of as a text block, like a paragraph, a heading, something like that. So in a Word document, every block – or in a PDF. But in a markup language, a text block is a little more flexible. For instance, in a DITA project, you could compare all of your entire topics to all of the other entire topics, grouping them together by similarity. So you're not limited in the markup languages to just a single small piece of text or a paragraph of text. It could actually be a much larger thing you define.
Or, for instance, if you're working with HTML, we've had customers do analysis of some of their knowledge base and they will actually compare the entire body tag. So all the text in the body of each HTML document can be compared, and that'll group your HTML documents together as documents. So there's a lot of cool things you can do in Harmonizer if you've got a markup format. So what happens is those text blocks are then fed into the database and every text block is compared to every other text block. And then they're put into groups and these groups are called match groups, and the match group will show you which text blocks are grouped together, which ones are exactly the same, and if any of them are close matches. So you'll see these groups of text blocks and then you'll see where those were in the files so that you can go back using your editor or whatever tool, content management system, and go back to those files. So those matches are then reported to you in a number of ways.
We show you how many exact matches – how much of your content has one or more exact matches. And that's what's in the green area of that chart you see there. And then the blue area of the chart shows you how many of your text blocks don't have any exact matches but they have some close matches. So that's a whole different category that's blue there. And then finally you get your unique text blocks. And that can be very interesting because sometimes people are very interested in why some of the content is completely unique and doesn't match anything else we have and that gets put into that gray category. So those are also reported to you as strict numbers and it gives you an estimate of how much potentially redundant content could be eliminated if you were to get rid of all of the close matches and exact matches. So if you were to go to an extreme level of reuse, which is never really possible, but this gives you a ceiling here. So if you look at where it says "matches needed," in this particular set that I'm showing, that's telling me that 32% of the content would remain as warehouse topics theoretically. And then 58% of the content could disappear if we could reuse it all.
Now that's obviously the ceiling but this gives you a really nice way to at least start building some estimates of what impact component content management might have on your organization. So Harmonizer provides you then this report that you can download and keep. So Harmonizer runs once and it runs on a heavy duty server. That's why you can't run it on your own machine. So you run the report, Harmonizer churns away at this big comparison project, and then when it's done, it gives you a set of HTML files as well as an Excel spreadsheet. And those contain the charts and all of your match groups available as data that you can then use on your own. So once you have a Harmonizer report, it's yours to keep and share as you wish. It's your property. So the server is really only involved in the generation of the report. And then there's a little bit we do, for instance, I'll talk a little bit later about our Harmonizer subscriptions, which we've provided Harmonizer with a front end on the web that allows you to go in and run these reports on your own without asking me or involving me and download them yourself. And we host those for about 90 days if you do that. So we'll keep your reports safe for a while and then you can download them and store them internally wherever you want.
So let's look at a Harmonizer report. So this is really the first page of a Harmonizer report and these screenshots are from the HTML version but the Excel version is very similar. And what you see here is we'll give you a menu across the top with the various pages of the report and then you give it a name of what your report is. You can add some notes to this page and a few things like that. But the real bulk of this page is just to give you an overview of that chart that I talked about earlier and the reuse potential table on the right. So that's giving you really a broad summary of what you're looking at now. If this was a project that I was working on and I was wondering "Do I need component content management?" I would look at this and I would say, yeah, it looks like, even in a low reuse percentage of the potentially redundant content, we still have quite a large percentage of this content available for reuse. So there are 84% of our content has some kind of match going on. So it's probably well worth it if I'm trying to reduce the footprint or the cost of managing this content. Or maybe I want to reduce the number of translations I have to do. It would be very worthwhile to look at a component content strategy here.
Then the other page that tells you those hotspots is the report information page. And this just tells you what's on the report. So you'll see there that in this particular example, which I actually ran by taking a bunch of old Gateway computer manuals from the early and mid-nineties I think, maybe late nineties. I just downloaded them from the Internet Archive, ran them through to see if their manuals really shared a lot of similarity. And what I found was it did, and if you look down at that table at the bottom, it's a little hard to see, but that's giving you a list of the files that you had analyzed. So in this case it's a list of the PDFs and it'll tell you how many blocks were in the PDF that we found and how many exact close and unique matches were found in that particular file.
So if I was trying to start looking at building some topics, I would use this page to identify who should start, what manual has the most reuse or looks like it's a hotspot for reuse, and then I could focus in on that one. So you'll see some of those have thousands of matches and some of them have hundreds. And if I was to scroll down, some of them don't have very many at all. So that would tell me at least where to get started on building out some topics. Now when you look at the actual match group page, you'll see the Harmonizer matches. And what you're seeing here is down the left it's just showing you all of the text blocks that were extracted from a particular file in order and they're just numbered. Those numbers are by line number, or if it's word or PDF, we don't really have line numbers, they're just relative numbers to uniquely identify the block. And I can go through that document in order, and by clicking those numbers, I'll see what matched with that particular text block. So on the right you see there's a match group with some text blocks and I'll talk a little bit more about that in a minute.
All of this is offered to you through the Harmonizer portal. So our Harmonizer portal is that front end I talked about earlier that allows you to go run your own Harmonizer reports. When I started in my position as product manager for Harmonizer, Harmonizer was a service really. We had it as a tool that we used internally and then customers could come to us and say "I've got these files. I don't know what to do with them." We would have them pay us a one-time fee. We would get their files through an FTP drop, I would go run the report and then when the report was completed, I would send it back to them through FTP. It was a manual process. What the portal does, it allows you to go in and create your own Harmonizer jobs without talking to me and upload the files into the server and get back a report.
Now these reports can take a while to generate, so usually it's under an hour. In some extreme cases, if you've got a large number of matches or a large number of text blocks to analyze, it can take upwards of up to 24 hours or more. But typically, for the limits that are set on the self-service portal, we promise you the result within 24 hours. And Harmonizer, the portal manages this. So it tells you where your job is, if it's being processed, and then when it's complete. And once it's complete, your list of jobs appears here and you can download them. So you can download the reports or you can view them right through the portal if it's more convenient.
So that is a service that was built up in the last three or four years since I've been here. And all you have to do to create a new Harmonizer job is you put together the files you want analyzed into a ZIP file, you choose a file type format, give it a job name. There's some optional settings that you can set. So for instance, you can control how similar the text blocks have to be to be considered a close match. You can also control how small the text blocks can be. Sometimes, if you have lots of really small text blocks, like maybe a table of numbers or data,
that can create a lot of noise, so you can filter that stuff out by setting a paragraph size that you want analyzed. And then you just submit the job, you upload the file through the browser, submit the job, and Harmonizer will email you when the job is complete or you can go on the portal and check it. So that's really what Harmonizer is.
I wanted though to dig into some of those match groups so that you can see some examples of the things that Harmonizer is going to tell you in those detailed match groups. So let's look at some of those. So this is actually from the sample bike data set of the S1000D sample content. So if you get the – I think it's the toolkit, it includes some small bike files, bike maintenance files. And so I just ran those through and there were a few matches actually, duplicates in that content, which kind of surprised me at first, but it's because of the way they're structuring the example. But you'll see here that there are three match groups being shown here. These are all exact matches. So you look at match group one, this is telling me that there are two places where this exact block of text occurs. So this paragraph about the break system. And I ran this, by the way, against an entire topic, that's why it runs together as text. But this is telling me that that topic occurs in two locations, in two files.
So you'll see there's one about the break and one about the S1000 bike file. And that both has an exact duplicate of this text. And you'll see there's some other exact duplicates in those same files down below there. So that would be telling me, huh, if this is DITA, which this is, that would be telling me maybe I'm not doing reuse. Why do I have copies of this stuff? You can go figure that out in your content management system and maybe clean that up to reuse that particular text. Here's a close match example. So one of the things that really surprised me was there was a place in the dataset for the S1000D content where there was actually a variation in the text. I wasn't expecting that because it was an example of reuse, so I thought everything would be the same and reused, but it wasn't. So if I look, there's this statement, "clean with water" down here in this version of the procedure. And in the first variant, it doesn't say "clean with water". So they're exactly the same except for the phrase "clean with water". Harmonizer found that, and up here it gives you what's called the Harmonized paragraph.
So when there are some variations, Harmonizer will show you all the possible variations in this top version of the paragraph. This version shows you that "clean with water" is optional. It's shown it's in some of the matches and not in others. So then I can go look at the variants and the variants are going to tell me this one doesn't have "clean with water", this one does, and I can see which file uses "clean with water" and which one doesn't. So this file has "clean with water", this one doesn't. You can see how that might be useful because I might tell myself – Let's say I have aviation manuals where life and death can be involved in these processes, it might be good to know that I'm missing a piece of the instructions in one of the topics or one of the versions of my content. So I can go investigate why it doesn't say "clean with water," right? We had a actual aviation customer where I was doing a demo and in the demo I happened to pull up a match where it said "Turn a knob right and turn a knob left."
And it was supposed to be right. And one of theirs had a typo in it and they were very glad to find that. In all their editorial processes, they had never noticed that, but Harmonizer sniffed that out right away because it showed an unexpected variation that they did not think should be there. Here's the Gateway computer manuals. I took some examples from that. So just to give you an idea of, as I said, I just took, I don't know, half a dozen or a dozen PDFs of Gateway computer manuals for different models. And you can see that Harmonizer found – that question mark is an unknown character. Because these are PDFs, that's probably a bullet point. And then you can see, LED "green," "off," or "blue," battery is fully charged.
So this particular phrase is being said in three different ways. It's either green off or blue. You can see here, variant A is where they used green there, variant B is where they used off, and variant C, you can see they used orange on purple. So that's a different one. You can go back to that source content and have a look at that. But here you can see LED green was used in these particular models of the laptop. So the Gateway 200, 305 and 350. Here, the M520, the LED turns off when the battery's fully charged. And here for the M675, the LED turns blue when the battery is fully charged. So I not only find out what manuals are using which version but I also can find out where those are. I could click that and it would line up over here with where it was found in the order of the document.
Then I can go back to this original PDF for, hopefully, I'm actually using a source editing format and I can go back and edit that if needed or check it. So this is an interesting match also because it's actually helping me check the accuracy of the content. Because if I was a subject matter expert, I'd probably know if these are correct for these models. So that's another example that you might have there in Harmonizer. Here is one where you can see that there's a very similar word here to talk about, "Make sure that your notebook disconnects correctly from your internet account." Okay, this was in the days of dial up, I believe. So you can see in one place they said "even if you are not using your notebook."
And in another they said "even if you are not at your notebook." So they used some different language there and I could go check out why that was if I wanted. Here, you can see that there's some information about help and support, which is pretty general across all the laptops, but I was surprised at the number of different ways they talk about this help and support and how close a lot of it was. So that black text is all the same across all these variations, but in some they say in Windows XP, in some they talk about a computer versus a notebook. Sometimes they mention instructional videos but for some they don't. Then they have this long phrase here that might appear or the shorter version here. They say about questions about Windows or questions about notebook or questions about Windows and help you quickly discover and use the many features of your Gateway computer or notebook. That was what that was.
So I can actually go here to the variants and I can see exactly which version they chose for which particular files. And this can help me decide, huh, we could probably clean that up and make that a reusable piece and forget about all these effort being done to make these minor changes to this particular content. There we go. Here is another example from the Gateways where they talk about volume, the volume controls, and it also has a page reference. Now, if I was doing a conversion project, if I was taking this content and going to turn it into DITA, one of the things you have to cope with are these page references. You can't hard code those. So Harmonizer can help you discover where those are and then you can replace those hard-coded numbers. So if I did a conversion to DITA, these numbers might just come over, but they don't have any relevance in DITA. I'm supposed to replace those with an actual link of some type that links to a topic that on rendition actually becomes a page number.
If Harmonizer is running this against DITA and you have references here, Harmonizer won't report those as differences because those are references in the markup, not in the actual text. So Harmonizer, by ignoring those references, helps you find where the text is potentially, got hard-coded references and things like that. So that's another thing you can use that for. If we go back to the days of CDs and DVDs, you can see that some of the notebooks had a DVD burner apparently or reader and some of them did not. We can get a little history of computing, at least Gateway computers, and see which models had the DVD option. It looks like most of them did except for this 305, for whatever reason.
This is an example from even older computer manuals. These are from Commodore computers. So depending on what region you were reading the manual, they were using different versions of the word color. So for the countries like England or maybe Canada, they had a "u" in "color" back in the late seventies, early eighties. So that's how those were written. And you can go through, there's places where there's different key presses that you might use for different models, and that's where the variation is. If I was going to reuse this stuff as a topic, I would have to come up with a way maybe to provide these correct key press sequences as part of a publication variable that got fed in and inserted at published time based on which model I was publishing the content for. Those are all things you have to really strategize about when you're working on moving to a component content management system. So Harmonizer can help you make sure that you're starting to take those into account as you do those conversions.
So that's basically what Harmonizer is. That's how it works. How do I get it? Well, I mentioned this before, but I'll go over this again. We offer a traditional model where we can give you a price quote based on the size of the content. So if you're doing something really enormous, like I had one customer who was migrating their entire confluence knowledge base, it's an internal knowledge base, it was huge.
It was developed over decades. They were going to migrate that and they just wanted to do a giant analysis that was going to be very involved. And then do a whole bunch of subset analysis and do it in a lot of different ways. They worked with us on that as a service. So they didn't want to try to manage all that themselves, plus it was too large to work in our self-service option, so they worked directly with us. We had a project manager that worked with their project manager and it was treated as an actual project. That's the traditional or our model that we were pursuing until I brought on the new models. The self-service model is a subscription model and there are a number of subscription models that we've been working with.
As we've been evolving this, we're trying to figure out which models work best for people. And as it turns out, there's some different models that work for different people. So we offer an on-demand model, we offer a reduced-cost model if you only have single formats to compare and smaller sets of content and users. Or we have an enterprise model if you want to do comparisons of many different content types and have the most flexibility. Those self-service models come to you in some different ways. We have some customers who are using on-demand, which means we've made a contract with them based on projected volume and such where we set a price for a single report. So they basically pay every time they create one of their reports and every month they get billed based on the number of reports they've run and that's just a pure dollar per report model. Then the standard model is actually annual pricing. So you get a subscription and you can have up to five user accounts who can go in anytime they want, run as many reports as they want, and they get 10 content sets per month. So there is a limit to the number of content sets but then they have unlimited cloning. Which means if I've got, say, 10 PDFs that I upload and create a report from, cloning means I want to create another version of that report with different settings but the same content. If you want to do that, that's covered under the unlimited cloning. So 10 uploads of content set per month is the limit on the standard.
And then the enterprise is really, if you have the need to compare multiple formats, you want to be able to have as many content sets a month as you want. And these tend to be priced so that they're attractive to move to the enterprise if you're going to hit over 10 content sets a month. So the standard is really for lower-volume use and enterprise for more volume. So a little bit about DCL. I'm going to close with that just so that that we aren't just Harmonizer. As I mentioned at the beginning, Harmonizer was really created from our other work to help us internally be able to be more efficient and faster at finding and helping our customers create reuse strategies. And we are a company that has worked on structuring content and data since 1981, so we've been around forever. We were working on those old – well, I wasn't, but someone was working on those old tapes back in 1981, which still just amazes me because I guess I was 13 years old.
So I was getting started on that VIC-20 computer that we were reading the manual from right about that time. But anyway, where we've really been known or made, our name is in data conversion, so being able to take data and move it between all these formats. But we aren't just a commodity data conversion vendor. We have a lot of expertise because of our long history and the depth of our employee talent where we can enrich those conversions with a lot of other things like semantic enrichment. We do projects that bring in semantic information and do natural language processing as a part of that data conversion. They can extract entities and do all kinds of magic on that content.
We also have built a strong expertise around data harvesting, being able to go out and find data either on the web or on your internal systems and automate the process of scraping that data. We've done projects where we will go out and take public information sets, like legal information or other things, and be able to put those into formatted versions from the HTML as an example. We do a lot of sophisticated QA validation. We have a whole QA system that I also am managing that allows us to quickly assemble very sophisticated quality assurance on content that would be hard to do manually or would take a lot of time to set up.
Because we have this history, we've set up all these things in reusable ways. And then we've got platforms that allow us to support all these things of creating training sets, doing content analytics and structured content delivery, which is a big deal. These CCMS migrations, we do a lot of work in that area. So we've been doing this work. We're very knowledgeable. And additionally, we have all of these internal tools as well as some external tools like Harmonizer that we share with you. So anyway, that's really all I have. I think I hit my planned time just on the nose, but you might have some questions, so we can open that up for questions. You can also learn more about us. If you want a demo of Harmonizer, there's a number of ways we do demos. I can do a generic demo anytime for anybody. So just email us and we can set something up. That's very easy to do and quick. You can also bring some content.
Oftentimes for a demo, I'll ask you if you have some content where you think there's some reuse or where there's some reuse and you just want to see your content. I'm happy to run those demos with some custom content if it's a small amount, and I can't sit there for hours, but maybe a hundred or 200 pages of content, I'm happy to run in a demo mode and show you how it works. So feel free to contact us. You see our website there. Email is probably the way that most people will want to contact us but there's also a phone number there if you feel like calling. And with that I'll see if we have any questions.
That was wonderful, Chris. Just a few logistics. I did forget to mention that we are recording and I will send out a link of this webinar to all those who registered.
I also will post this webinar, as with all our past webinars, on the CIDM website. So that will be available to you also. Please, I saw a hand up. If you would post your questions in the Q&A and we can get to them now. So we've got quite a few questions.
Yeah, I see a few. I'll go ahead and just read the first one.
Very good question: "Are many of your users taking advantage of these features to identify changes for localization and similar purposes?" Actually, I'm glad you asked that. I mentioned it in passing, but localization is a big use of Harmonizer, actually. We have customers who aren't even considering component content management. They just localize to a lot of languages and they have all these files that they supply to their localization vendors. Their localization vendors rely on translation memory systems to be able to remember blocks of text that have already been translated. And if the block of text matches exactly, then they aren't charged basically for that second translation because it's using the same translation twice. So even if you aren't going to component content management, but you're localizing, if you're using translation vendors who use those translation memory tools, then it is very beneficial to make sure that, even if – in your copies of your content that you're using the same exact language.
So Harmonizer helps you find that in those Gateway notebook examples, which we could go back to here. If I could find a way – Right now, if I was to translate this thing about CDs and DVDs, I would be charged for two translations there. And so they have to translate this phrase twice. All these exact matches presumably could be reusing the translation memory. If I was to find a way to rewrite this so that I didn't have to have this text missing, then I could combine this to an exact match and get rid of a translation. Now, one translation isn't a big deal, but if you do this through the entire report and you get rid of half of the translations, that could go a long way towards reducing costs, especially if you've got multiple translations.
So that's a good question because that's an area where we have used and customers have used Harmonizer. I see there there's another question that says "Do we want to create any specific rules to classify the reasonable content?" So basically, Harmonizer is just the report piece. That's what it offers. So what you do with that report and whether you want to create business rules will depend on what you're doing, why we're using the report. But it is certainly one thing we do. For instance, when we do conversions, sometimes customers will use the Harmonizer report to give our conversion team rules and say "Okay, we'll clean this up and then we'll create some business rules that'll tell you how to automatically create some topics." Harmonizer is a tool used to help create those rules. It doesn't do the rules itself.
Third question I see there is "I've worked in DITA and CMSs since 2009 as an information architect trainer, project lead, and people manager. So I'm a believer."
So I think a lot of you out there are believers or you wouldn't be involved with CIDM. You must have some knowledge and interest. That usually leads people here. But here I say, however, I've also seen reuse conref topic and map level, and filtering implementations become so complex that the overhead of managing them are not worth the gains of reuse.
"Are your users using Harmonizer to steer reuse governance towards more of a balance? And if so, can you say more how that has worked?" Well, that's a really interesting question. Harmonizer certainly could play a role in that. I can see that quite obviously. But I haven't personally worked with anyone on trying to create those balances, at least not in this job since I've worked with Harmonizer. I'm more on the front end of those conversions. Our conversion team is very experienced in that balance. And they do look at all the data that they have available to them, including the Harmonizer report. The other thing you have to balance, and I am glad you said a people manager, because I think anybody in any of those roles that you listed is really ultimately a people manager. Because as you know, and as you imply here, you can create a really complicated reuse but people aren't going to use it if it's too hard for them.
And again, that's a place where Harmonizer, if you run that on your existing DITA project or your existing CCMS data, you can see where people are doing reuse and where they aren't. And looking at where they aren't might tell you that you have to adjust your reuse strategy. Maybe it's too hard to do that. For instance, yes, I could put in published time variables to say whether or not DVD support is in this particular computer and then add this as conditional text. They'd only get inserted if the published time variable tells me there's a DVD burner. Is that really worth it or is it better to just write it as CDs and/or DVDs and allow the customer to know whether or not they have a DVD burner? Maybe that's better, or maybe that's bad because a customer's going to think they have a DVD burner. All those things are considerations that you make. Again, Harmonizer is just a tool you can use to help you manage that reuse strategy and achieve that balance.
Okay: "We have a lot of data converted and configured into CSDB. Can this Harmonizer be integrated with CSDB to identify the reusable content?" I actually must admit, I don't know what CSDB is. A content services database or something? But anyway, regardless of what it is, any CMS that has data in it, Harmonizer is not integrated right now with any CMS. So because we're so format neutral, what we rely on is an export. So almost any system, you can export the data. If you export it into a format that we can read. So that can be, as I said, any of the supported formats. And I see a clarification there that I am right that CSDB is similar to a CMS. So if you can export from CSDB or any CMS, then you can run the Harmonizer report, then you have to make sure you can go from those exported file names back to the files in your CSDB system or your CMS. We can help you with that if that's something that is complicated.
Most modern CMSs, it's very easy to export the stuff for use by an external system. And you usually want to export the source formats. You don't want to export published formats in most cases because the source formats are going to ignore those places where you have conditional text or things like that that might cause the produced renditions to mislead you about reuse.
So I think that's all the questions I see. I don't know if anyone else had any. You can type them in real quick. Otherwise, I'm also happy to answer follow-up questions. Sometimes your questions may be very specific to you and you don't want to share them publicly. I'm happy to answer those or work with you on those. So you just feel free anytime to reach out to us and someone will put you in touch. If it's about Harmonizer, you'll talk to me pretty quick. You'll go through one person to do it, but it's pretty fast that we can set something up.
I do see some questions coming in here. So real quick, "Can Harmonizer be run on non-English content?"
Oh, that's a great question. I did not even think to mention that in this presentation. Harmonizer is language independent. So let me talk a little bit about how it does the close matching. The exact matching is a character-by-character match, so it's considered an exact match if all the characters in the text block match. That's easy. Anyone could do that. You could do that with Find, although it'd still be cumbersome, but you could theoretically find every paragraph in your content and see if there's a duplicate. That's the exact match part of a Harmonizer. The close match, I didn't really talk about that algorithm, but it's a very powerful algorithm that's character-based. It is not syntax-based. So it doesn't care about the language at all. It just cares that there are characters in the language. So it only works on character-based languages. But what it does is it looks for sequences of characters that are the same between two text blocks and it uses a rather sophisticated algorithm that is also positioned independent to some degree.
So let me give you an example. If you were to write "Before servicing the engine, disconnect the power," or if I was to write somewhere else "Disconnect the power before servicing the engine," those would be hard to find with just a straight character-by-character match. What Harmonizer does on its close match is it looks for character sequences that aren't necessarily in order but that are the same. So it will find those with a high level of matching, even though none of the exact characters appear in the same location. Now it does that in English but it can do that in any character-based language, because again, it's not looking at the syntax, it just needs to have a language. Most western languages and some of the eastern languages also are character-based and those will work just fine in Harmonizer. It can even mix them. So for one project I did a series of English textbooks for German students, so they had a mix of German and English in them, and Harmonizer doesn't bat an eye. It's totally fine with having a report that has German, English, anything else in it.
Yeah, that is nice. "Is there a maximum file size for Harmonizer?" And she says she has many large files.
Yeah, there is not a maximum file size. Harmonizer theoretically can deal – we haven't put an artificial limit on the server itself. So the server, we've handled some incredible sized files. Really the limit you start to hit is what a human can cope with. If all you're interested is in the overview page, that page that gives you the nice chart, if that's all you're interested in, so you're just trying to get an overview, you can run an enormous amount of content and Harmonizer will produce that page. But if you're going to dig into the actual matches, obviously at some point, if you have a million matches, it's going to be impractical for you to go through them all. So then you have to start figuring out how to break your content down or reduce the amount you're going to look at. But there is no artificial limit to the server. There is a limit on the self-serve model as to the file upload size. So there are some size limits. They're very generous. Usually if it's at all practical for a human to use the report, you won't hit those limits for file uploads. But if you want to go over those file upload limits on our self-serve, then that's when you have to go to a project where we actually work with you in the traditional model.
Another question here. Excuse me. "Did you say Harmonizer will analyze mixed formats at the same time? I can mix PDFs and Word and HTML; it runs against all of them at the same time? And is there a limit to how many file formats it can analyze at once?"
Yeah, so it can.
Basically, the way Harmonizer works is the ingestor – there's multiple ingestors that know those formats and take the text out. And then it puts it in a database. And at that point, the database doesn't care what the original format was. So the ingestors are all responsible for that first part of getting the data out of those formats. And you can have as many ingestors activated as you want for a job. Now that's with the enterprise plan. The standard plan typically is only provided for single formats. We have some customers who occasionally do some PDFs. What they've done is they actually, when they want to compare the PDFs, they do the conversion themselves if they're on the standard plan. Our enterprise customers don't have to worry about that. They just upload everything. And any mishmash they want, as much as they want.
That's great. One more here. "If we use Harmonizer to find things like page numbers and other inconsistencies, can DCL help us clean up and convert the content?"
Absolutely. That's our main business. So really Harmonizer is a product that I manage, so I'm kind of a weird fish in the company in that I have my own little side thing going on. I support our conversion services people. So they're one of my customers actually because they use Harmonizer sometimes as part of their conversion projects. But they do those conversions and they will look at that data. So the more information you can bring to a conversion vendor about the nature of your content, the better your conversion typically will be and the smoother the project will typically go. So if you can avoid surprises, like all these page references or other things, and you can strategize before you start a conversion, obviously that's going to be a lot better for everyone involved as you go through a conversion project. And again, that's where Harmonizer's roots really were. We used it to find stuff like that and to make the conversions run smoother. So it's still part of our process for our conversion practice and we certainly will help you get those taken care of.
Well, that's all the questions I have. We'll give it just a minute for anybody to think of something and quickly get it down. But of course, there's your contact information; feel free. I will also send out Chris's email to those on the link so that you will be able to have that as well. Well, this has been a great webinar full of a lot of interaction, which is always wonderful. And I want to thank you, Chris. It's been a pleasure. I've learned a lot and I'm sure all those who've attended have as well.
Yeah. I always love working with you guys, so thank you.
It's been a pleasure. So anyway, we will, as I said, this is recorded and it will be posted on our CIDM past webinars page, as well as the link provided to all those who registered. So with that, I think we'll end. And again, thank you so very much. It's been a great webinar.