top of page
  • Christopher Hill

Harmonizer: The First Step in Identifying Content Redundancy

Harmonizer, Data Conversion Laboratory's product for finding duplicates and near-duplicates in large volumes of content, is now well within its second decade of existence. Even with the wide adoption of componentized content management and XML standards like S1000D and DITA, there is still plenty of opportunity for copies and near-copies of content to appear throughout repositories. Manually finding these duplicates in even a modestly sized content collection is a real challenge. This simple premise has driven the evolution of Harmonizer.

Since taking on the role of Harmonizer Product Manager just over a year ago, I have had the opportunity to dig into the topic of content duplication and have been given the resources to apply what I have learned to the product. The development team was able to give Harmonizer a bold new look that made interpreting the results much easier, as well as Excel output that provided additional flexibility in operationalizing the data.

Harmonizer Overview Report - A Dashboard for Content Redundancy Identification

Improved Natural Language Processing

The underlying engine driving duplicate detection was reworked to take advantage of the latest in the artificial intelligence field of natural language processing. The new NLP algorithm allows Harmonizer to find a greater number of fuzzy matches between text passages, uncovering many more near-duplicates than our previous product could. The product is also much more configurable, giving us the ability to tailor Harmonizer's processing to more situations.

The feedback from customers on the changes to Harmonizer were almost unanimously positive, but we continue to look for ways to help our customers get more out of their Harmonizer reports. Last month we released Harmonizer 3.5, which includes a number of improvements including adding

  • greater clarity to the reports

  • a number of user interface refinements to better navigate the reports

  • additions to the Excel output to bring it into parity with the data offered in the HTML presentation

The most prominent change in Harmonizer 3.5 came not as a specific user request, but instead from our observations of user behavior. In previous releases, to get context as to the location of a particular item in a match, the user had to navigate to the Document Map on a separate page of the browser. Users who were adept in the use of tabs in their browser could set up the two reports side-by-side, but most stuck to a more time-consuming back-and-forth navigation between the two pages of the report.

Harmonizer 3.5 introduces an integrated sidebar to the paragraph match report that allows anyone to easily view the context for the text without navigating away from the match.

Before opening the sidebar, Harmonizer 3.5 operates in the same, familiar way as previous releases. Those accustomed to this approach can continue to work as they always do. But clicking on the sidebar tab in Harmonizer 3.5 loads the integrated Document Map and Paragraph Match pages together in an easily-navigated window. Once loaded, you can rapidly move through the matches in document order, or in the match group order familiar to users of previous releases.

When updating a product, it can be tempting to focus entirely on the direct requests of users to determine what to do. While those direct requests continue to be critical in driving a product roadmap, I believe that some of the best improvements come from looking beyond the direct feedback and working to understand the underlying goals and motivations of our users. A direct request from a user is an indicator of an underlying challenge they are trying to overcome. Until that challenge is understood and addressed, the product may never achieve its full value and potential to its users.

Want to see Harmonizer in action on your content collection?



bottom of page