top of page

DCL Learning Series

From Unstructured to Structured Content: Transforming Legacy Aircraft Documentation From PDFs to DITA XML

Marianne Calihanna

Hello everyone, welcome to today's webinar. This is a joint production from Componize and Data Conversion Laboratory, or DCL, as we are also known. My name is Marianne Calihanna. I'm the VP of Marketing at Data Conversion Laboratory. And I'm here to welcome everyone today and introduce "From Unstructured to Structured Content: Transforming Legacy Aircraft Documentation from PDFs to DITA XML." Today's webinar is being recorded, and it will be available on both the Componize website at componize.com as well as the DCL website at dataconversionlaboratory.com. You will receive an email with links to both the websites. And I am so happy to introduce today's speakers. We have our partner at Componize, Dipo Ajose-Coker. He is the Product Ambassador at Componize. And my colleague, David Turner. David is our Digital Transformation Consultant at Data Conversion Laboratory. Welcome, gentlemen.


Dipo Ajose-Coker

Hi, there.


David Turner  

I think that my screen sharing went off, actually, here. So –


Marianne Calihanna  

It looks like it, yeah, we're gonna pull this back. 


David Turner

Let's try to just join, get those slides back up, and we'll get rolling there.


Marianne Calihanna

During today's conversation and presentation, if you have any questions, please feel free to submit them via the chat dialog box here in the Demio platform. We have saved time at the end of this presentation to answer anything you'd like to ask our two content structure experts. All right, gentlemen, turning it over to you.


Dipo Ajose-Coker

Thank you so much.


David Turner

All right, Dipo. Why don't you tell us a little bit about yourself, and we'll get rolling.


Dipo Ajose-Coker

Hi, everyone. Hi, yeah, um, well, Dipo Ajose-Coker. I work with Componize as a Product Ambassador. I'm an ex-English teacher from the UK, and in 2005 moved to France and took an MA in technical writing. And I've been working as a technical writer ever since until, well, a couple of years ago when I joined Componize as Product Ambassador creating content. I've got 16 years writing documentation. And like 14 of those have been spent in medtech or fintech, i.e. regulated environments. I've got experience in authoring structured and unstructured content and migrating technical documents into DITA and a bit of content strategy as well. Componize. Let me just introduce you to Componize, who are we and what do we do. We're built on an Alfresco platform, and Alfresco is an open-source CMS, so you get all the features from Alfresco including user management, permissions, customisable workflows.


And as Componize sits on top of Alfresco, you get all the benefits of that platform including security, cost efficiencies, sustainability and so much more that you get from an open-source-based platform. Componize is also based on open standards and open APIs. Open standards mean that we are able to accept pretty much any XML document type. We don't change your source code; your XML remains pure DITA, for example, so as not to lock you in to our solution. We are not about that. We are about helping our customers. And with open APIs so you can pretty much plug in anything you want into Componize. Move on to – next one.


3:58

Well, with Componize you can manage your legacy content including non-XML documents. So things like Word files, InDesign, PDFs, digital assets, CAD files, and all of that can also be version-controlled, you can have previews of this, you can have access management, make sure that certain people cannot touch certain types of files. For example, engineering can manage all their CAD files in Componize, and users can see previews, writers can then make crush references to the version that they want. Marketing can store datasheets, and other content in Componize, making it easy to collaborate and reference that kind of information. The advanced search capabilities of Componize also take advantage of the metadata stored within the files as it was applied to files, so you search within Componize, you know, you can find, pretty much anything you want.


You got fine-tuned permissions and access controls. With the metadata management, I just mentioned that for a second, and it's basically you are able to extract all your metadata, work on it in your taxonomy tool, and then inject those back at the right level to the right file. So, if your taxonomy's changed, you don't need to worry about, like, having to apply it one by one to individual files; Componize can handle that for you. As I mentioned before, so it's pure XML, and we can handle S1000D, DocBook, MarkDown, all of that sort XML. And if you use your content across many different platforms, it's an ideal solution, because you can, like, for example, plug in your chat bot to segmented information, and have that interact with that chat bot and your end-user. So if you want to keep the flexibility and cross-platform ability of your DITA XML, well Componize is your best solution, without having to write special rules, just control that.


David Turner

Outstanding, outstanding. Well, quickly about me. If I sound funny, it's because I'm from Texas, I live in the Dallas area in a suburb called Sache. And you can see here a little bit about my background. I've been in the structured content digital publishing world for a little over 15 years after kind of stumbling into that. I'm very pleased to get to be a part. And I guess the thing I'm most proud of, though, is that I've been married to actually the most beautiful girl in the entire world for more than 30 years. So, but that's a topic for another time.


Dipo Ajose-Coker

You tell a lie. That's my wife. [Laughs]


David Turner

[Laughs] All right, but let me tell you quickly about DCL. Basically, our job is to solve problems with content formats. And we've been doing this for more than 40 years. So if you have content that has lost flexibility because it's in Word or it's in PDF, or it's in InDesign, we help get you into smarter and more flexible formats: HTML, XML, DITA, S1000D, DocBook, whatever. Instead of a do-it-yourself kind of a solution where you're faced with doing the content prep, the cleanup, steep learning curve, all that kind of thing, we actually offer a full-service solution. And so that includes, you know, identifying and implementing your new smarter content through a reuse analysis and reuse services. It's conducting really rigorous QA. We're trying to minimize the work that you have to do. So that's, you'll get a better sense that as we talk a little bit about the the project itself, and I think that's what you're here for, is to hear about that project. So let's get in, let's talk about the story. Dipo, what's the background here?


Dipo Ajose-Coker

Okay, so, well, we thought let's do a webinar, let's show people how to convert content, but then what everyone else does is like, you know, they take the best kind of format of content possible, something that's like, you know, going to be really easy to show up.


8:01

We thought, let's not do that, let's take a real world scenario. And so we contacted Liftify. It's an aircraft manufacturer, and they provide us two types of manuals: service manual (amendments manual), and a user manual. So that's what, you know, pretty representative. Now, if we got a real manufacturer's manual, that means we've got real content to convert. 


And as such, we're able to test the limits of like, you know, the process, limits of the tools. And, you know, we're not provided with, we've not got proper tables have been set up to work exactly right for our tools. So we've got two large documents and they're legacy PDF, because why PDF? Well, that's one of the hardest forms of contents to convert, you know, it's maybe a little bit easier if your content is already in XML. And so, well, we said, Let's go from the PDF, so that we can show you that even if you've lost the source documents, for example, in this case, Liftify had taken over, had bought the company from another manufacturer, they didn't have all of the source files, and we were ready to show them that you don't have to have that, you just have to have, like, you know, the PDF files that, we're able to then convert that. So this demonstration is focusing on something that does not have semantic context in it. In Word files, you use styles to show that this is a paragraph and so on, you've got a little bit of semantics going on in there. With your PDF it's just like, you know, straight text, and it's often the most difficult content.


David Turner

Yeah. So for those of you who haven't done this, and I know there's a lot of experts that are attending; if you haven't been here before, you know, when we talked about structured content versus unstructured content, a PDF document is really an unstructured document. It's, it's got text, but it doesn't really, you know, tell the computer all the things that it needs to do. So what we're going to do is we're going to try to convert it into a structured format. And there's, I mean, there's just a ton of benefits here, if you just look, you know, industry-wide, you can version at the component level inside a document, instead of just, you know, the full document, you can get multiple output formats from a single source. A big, big benefit is this whole idea of, you know, translation, and we'll talk more about that individually, because that did mean something here.


Another big one is search, right? So with a PDF, you can search for text and the computer, you know, knows, it can recognize a text if you search it, but it doesn't know that, you know, this particular piece of text is a detail, it doesn't know that it's a, you know, this is a model number or this is, you know, whatever. And so we're going to add tags around them. But I think the biggest one, and Dipo, I'm gonna pass it back to you here in just a second here, is this idea of content reuse, right?


So a lot of times you have, you know, content that's the exact same across multiple documents. In this instance, we had two documents, it was relatively small, but you know, when you think about it, a lot of organizations have, you know, 15, 20, 100 documents that all might have similar text components in there. And when you change something at the source level, you have to go through in all those individual documents and find it. Whereas with a structured content format, with reuse, you can make the change and automatically propagate it. So, anyway, Dipo, I mentioned, you had a specific example here from from this group, so I'm gonna pass it back to you here to talk.


Dipo Ajose-Coker

Yes, I mean, one of the advantages with switching from Word or any other unstructured content format is the improved ability to share content, he just said that, now you can granularize your components as far or as big or as small as you want.


12:00

And in this example here, we've got like, you know, in a document, you can have modules, sections, components, all the way down to paragraphs, figures and tables. And so that granularity depends on you. And one of the major things that would share across document would be like figures, images, tables, as well, like, you know, if you've got a big table with like, you know, 253 lines in it, and you've got to change lines 78 and 34, how do you do that across the five documents where you've copied and pasted it in Word? 


Well, if you, if you've got that content as reusable content, you're referencing one table document. And so you just make the change in that one document. And the next time you publish, all your documents have been pulled up to date. Another very, very important one, and this is coming from my years working in a regulated sector, is safety notifications. Now, in this example that you can see on there, you can see that there's a turning radius of 5.5 meters on there. Now, as a manufacturer, you're always trying to make improvements. 


And if you then make a change to that, to the equipment, that reduces that wheel radius to 3 meters, for example, you would have to go into each of those Word documents to change it. And you could end up missing one, or you could end up typing, you know, 3.2, or you know, mistakes happen. If you're reusing safety information, you just have to make that change once. And the next time you publish, everything's done for you. So, you know, those are all kinds of reuse mechanisms that you can put into place. And safety notifications is one that I advise everyone to always reuse. Do not ever, even if you're in Word, Word does have a very rudimentary reuse mechanism. But the main thing is to like you know, single-source: make the change once and have it replicate.


David Turner  

Yeah, absolutely. Well talk to us too, about this other benefit that Liftify was looking for, the paperless cockpit.


Dipo Ajose-Coker  

Yeah, so like, the problem with PDF and print is that you know, it costs a lot of money to keep these things in the plane. A dozen manuals are heavy. Take a look at that bag on there. You know, there's like, you know, what, 6, 8, 10 manuals in there. A lot of, you've got to save weight. Weight is a big factor in an airline and the aircraft manuals that we actually converted, the XA41/42, has only 10 additional kilos of baggage allowed on there. If five of those are taken up with your manuals, that means you've only got room for your underpants and like you know, maybe a couple of changes of shirt in your baggage on there. 


So, in 2011 the FAA approved the use of the Apple iPad and Other Suitable Tablet Computing Devices as Electronic Flight Bags (Class 1 EFB) so that pilots can use them for display approach plates, terminal procedures, and airport diagrams. And then like, you know, in 2017, that was actually generalized you know, so a lot more companies were able to start transforming their paper documentation into electronic formats. So by using electronic charts and manuals, safety and efficiency on the flight deck is significantly enhanced. It's one of the advantages of moving over on into a paperless document, your searches will be faster, you can show contextualized information as and when you want it, rather than having to flip through pages. I don't think I want my pilot flipping through like, you know, 600 pages to find out how to safely land on water.


David Turner

Well, and you know, and the naysayer out there would say, well, okay, so you got it into PDF, you can read a PDF on a tablet. But the truth is, is that you're still having to page, right, and it makes it more difficult.


16:00

And what if you have a smaller device, you know, what if you have, you know, more like a mobile phone or something like that? So, you know, that really takes us into, it's really a lot of industry specific, you know, benefits here, first of all, you know, users need the same content in different contexts, you know, a pilot might need just that paragraph on a mobile display, right, he might need to know, hey, because this error is going on here is the, here's the solution, and it needs to be prompted. Structured content lets you do that; PDF doesn't. But the maintenance crew might use that same text on the ground, and they may want a page-based version, they may want an iPad version, or they may want a print version; it may make sense to have it laid out in the larger context. 


I think another big thing is, you know, the idea that different users have similar content, but they also require very specific things. You know, in the past, if you have, you know, one version of the manual for this particular model, but this model, it's all the same except for these few things, well, you're still keeping up an entirely different model. When illustrations and examples might be the same throughout most of it just are different in a few few places. And so the structured content is going to let them have that that ability to to reuse, I think another big thing is voice assistance, text-to-speech applications, chatbots, all of those kinds of things, they're driven by being able to have these structured content components. 


And then of course, the translation, because, you know, airplanes are, you know, flown by people around the world. A lot of times you have to get things translated into many different languages. And if you're doing a document by document, you're translating the same content over and over. So anyway, let's dig in, let's actually talk about the project and how we approach this and how we would approach a similar project. Here's the high-level overview. You know, basically, we're going to do some steps on the front end that are, you know, analysis related. And then we're going to start actually taking care of the conversion, get it delivered. And we'll walk through each of these step by step. So let's just jump in and let's, let's hit at the start and talk a little bit about the actual requirements that that we got from the client.


Dipo Ajose-Coker

Okay, so like, project requirements, basically, what did we have to set up, and what did we need to do on our side? And so one thing we want to be able to do is for DCL to be able to deliver the conversions directly into the Componize repository. We want to stop all that sending of zip files by email, or having to send me FTP details, and then I log in and everything. So we could set up a Componize repository for DCL to directly connect to us, remember, open APIs and so on. We want to be able to identify each delivery, we want to be able to validate XML content as and when it comes into the repository. 


And so if there's an error in one of them that gets flagged, and you know, we're able to quickly request, while the pudding's still hot, quickly request like, you know, early delivery. We want to be able to extract and apply metadata so things like the aircraft model, keywords and tags, find that what you're able to, like, you know, optimise the use of that, and want to be able to manage access rights and permissions. So only certain people are allowed into the project at this stage, because we don't want somebody messing things up.


David Turner

So you've got a fantastic tool to be able to help meet these requirements. But, you know, it takes more than just a tool. It takes some other steps. And so we were really pleased to get to be brought in to kind of start the process. And so really the first step, and the first step in all the projects that we do, is we do some analysis and we include our tool Harmonizer that I'll talk about more here in just a second, but, but really, you know, I think the difference between us as a full service vendor,


20:00

what helps make our projects really successful, is all the analysis and configuration and customization that goes into the projects. I was telling somebody yesterday, you know, you can take, there are transforms all over the internet, you know, going to DITA, you got DITA Open Toolkit, you can create valid XML pretty easily. But the truth is, how useful is it going to be? 


You know, one size really doesn't fit all. I mean, yes, you can get XML, but are you going to get the XML that you need? That's going to actually drive your your results? So we spent a lot of time, you know, digging in and talking about, you know, what are the different document types? In this case, we had a flight manual, and we had a maintenance manual. In other instances, we might have, you know, 10, 15, 20 different kinds of documents. What are the formats we're working with, you know, in this example, we had one format, PDF, but in several of our projects will do multiple formats, you know, we'll have Word or we'll have InDesign or we'll have other flavors of of XML, I think for one of our clients, we did 13, actually, the project we did with you a couple of years ago, I think, was more than one format. Back in, you know, and just the different formats, they're all handled a little differently, right. 


So if something's a PDF image, or it's been scanned, you know how we're going to extract the data from that and get it tagged, it's gonna be a little different than like, say, a regular PDF. Obviously, how languages are approached is going to be different. The level of of QA that's needed, complexities, math, tables, cross references, compliance, all of that. So we're going to spend a lot of time on that. And then we're going to dig in, and we're going to analyze the potential reuse, because reuse is that big use case that we were talking about, right. And so for that, we have this tool called Harmonizer. It's an industry-leading tool. And it does really two key things, right, first of all, it summarizes the reuse of text blocks across large repositories of documents. 


And it's really useful for organizations that are trying to build a business case for moving to a Component Content Management System, or moving to a structured content management type of, of a solution. So many organizations have this idea, you know, I know we've got a lot of reuse, and they try to sell that to management, but they're really just, you know, kind of, their finger up, you know, trying to, you know, make a guess. Harmonizer comes in and tells you hey, this is how much is in this repository. And you can start to build some real numbers with that. But it doesn't just tell you, you know, hey, here, you've got some reuse, it actually also shows you, within each document, where that reuse is, both exact matches and close matches, so you can make the decisions about how you want to take advantage of this, right.


So as an example, I worked with some banking documents a few years ago, in another company, and we had these four documents that are only about 20 pages each, there was one for each geographic area. Some of the sentences across the four documents were exactly the same, some of the paragraphs, some entire sections were exactly the same. But in other cases, these different you know, text blocks had minor variations. For example, in the US, maybe you had to have this legal statement that wasn't in the other three, or in Mexico, it used a different, you know, product name, or something like that. 


So we sat down, and we laid out all the documents on a desk, and we got a spreadsheet out. And it took me two solid weeks to analyze this content manually. And this is four documents, 80 pages, right, total. And it's something that Harmonizer, using its algorithm and its character sequences, could have done in less than a minute.


23:58

And it could have done 400 documents, or 4,000 documents at once. And not just just four. So anyway, for this particular client, we ran the report, it didn't have between those two documents, it didn't have a whole lot of reuse. But the reuse that it did have was was was pretty useful.


So a couple examples from Harmonizer. This first one, what this is showing you is there's a paragraph here. It's actually, what, three sentences, and these three sentences are used in the same manual in four different places. Right? And so this is an important thing because, you know, this, this could be a text block that we might use as its own component. Or it might have a paragraph before, a paragraph after, and we could create a topic or something like that. This is the kind of thing that –


Dipo Ajose-Coker

Sorry, looking at it, I mean, like why would you want to have four versions of this paragraph, floating around in the same manual? Basically, you're looking for trouble here. And if this was like, you know, crucial information, and you had to change something on there, it is just so easy to forget to change one of the instances, you know, even using the search that's available within whatever, in a PDF, you search for one word, you get more with 2,300 examples, and then you try and make it more precise. And because of the way PDFs might, you know, chop off the end of sentences, if you happen to take something that has a line return in it, then it won't find it, even though it does exist in there. So like, this is a perfect example of like, you know, a paragraph that is reused four times, well, you might as well just create one block.


David Turner

And this is in the same document.


Dipo Ajose-Coker

Yeah.


David Turner

The next example shows you the same sentence that's actually in two different documents. And this is more of a warning operating procedure thing. Now, this one here, you can see there's a little bit of red and green there, that's indicating that it's slightly different. The only difference here is that in the maintenance manual, this block is part of a bulleted list. So there's no punctuation at the end, there's no period. And in the flight manual, it actually is part of another section, another sentence. And so it does have a period. But again, this is something where –


Dipo Ajose-Coker

A full stop is an example of varied content, isn't it?


David Turner

Yeah. So this this kind of thing, you know, you see, this is the kind of thing that a lot of times you see will affect your chat bots and things later, because you don't have standardized content. One writer writes in one way, another writer writes in another way, you know, somebody, you know, they shorthand one word, or, you know, they change another one. But this is the kind of example that we might find here. Again, this is just a sentence, sometimes it's an entire paragraph that you can look at. And we can give you, anybody on your, if you just contact me, I can give you a Harmonizer demo, and really go through these in detail. We only have so much time today. 


Third example I have here, this is one that these are two things that are in the same manual, but they're for different procedures, one's for takeoff, one's for landing. Sometimes we see this where you'll see a procedure where it talks about, you know, turn the screwdriver clockwise. And then there's another one that says turn that, you know, turn it counterclockwise. It's very, very useful for capturing those things. And helping you to figure out when maybe where you can use conditional content. Or check to make sure that, you know, if you have an A, you have a B, you know, in a set of steps, or make sure if you're supposed to have it in all your documents. One, a client in a regulated industry a few years ago sent me 17 documents, and we ran a Harmonizer analysis and we found an exact match. 


28:00

It was like 16 different times. And they were like yeah, that's such a that's that's an all of our documents. And I said, Oh, well, you sent me 17 documents, and there's only 16 here. And sure enough, one of their documents didn't have it. You know, and so that was something I had to go fix. We had another example where, you know, a procedure in 40 different instances for another aircraft manufacturer all said turn the dial clockwise. And one of them said, turn the dial counterclockwise, you know, that would have been, you know, a significant –


Dipo Ajose-Coker

The one that didn't come back!


David Turner 

Yeah! So you know, this, this tool can really use that. So it's a tool that we use to really try to set the stage and really sets us up for this next step, which is, you know, putting together the strategy. So with this, this strategy, we're going to take that initial content analysis, our understanding of the formats, our understanding that the use cases, right, what metadata is needed to be able to drive your use case. And we're going to take this content reuse and talk about how we're going to imply it, how we're going to implement it, what's our, what's our reuse strategy going to be? Why don't you talk a little bit here about the annotated topic list, and then we'll jump in, I'll talk about the conversion configuration.


Dipo Ajose-Coker

Yeah, the annotated topic list is something I discovered actually, when I, like you said, you know, we've done some work together when I was at GE, or when I was in charge of the migration there. And the annotated topic list is basically a way for you to help DCL set up the conversion matrix, i.e., you start making rules, so that you can identify, create a topic for every time you meet a chapter or every time you come to a paragraph. That will be like, you know, really tiny segments, but also things like if you encounter a list, and it's a numbered list, then most likely this is going to be a task. So take that segment of content and create a task topic from it. 


And so once the like, you know, first analysis is going on, and like, you know, you produce that it's an Excel file, basically, the annotated topic list, they then get sent back to you as the client, and you go through it and either confirm that, yes, these are all topics, these are all task topics. These are all concepts, these are all reference types, you know, and so on, and so forth. So that you then standardize the conversion, you always make sure that anything that's a numbered list is most likely to also, make sure that the task DTD is applied to it. And that, you know, the harder the edit type, the doc type is task and so on and so forth. And that way, you're starting to set up your content to be structured with semantic meaning.


David Turner

So the idea here is that we're, instead of just, you know, kind of taking a software, create some XML so we can get it into their system. We're really creating this, this plan, we spend a lot of time on the front end getting things right, because, as our friend, Regina from Content Rules says, you can't just shove your content into a new tool without a plan, and still expect to have a smooth and flawless transition, right. So at this point, you know, we take all this stuff, and we're going to start taking, you know, configuring the automation, we're going to get a conversion spec together, we're going to run samples with the client, we're going to make sure that it looks right, that's what we did here, make sure that what we were told was what is going to come out on the back end, we're going to, you know, address special things that we expect to come up, we're going to create a whole QA plan, we're gonna create QA software, we're gonna know which QA software to put where, and then put that, that ramp-up plan together. And then meanwhile, you guys are going to be over there with your tool, getting this repository configured.


Dipo Ajose-Coker

Yep. And so, you know, on our end, like I said, you know, we can accept pretty much anything; however, like, you know, when you don't want to just dump everything into a repository.


32:00

You want to, as the things, as the content come in, you want to start applying rules and properties and things like that. And so what we've got to do on our site is create things like folder rules. So DCL delivers a zip file containing all the thousands of topic files, well, we want to unzip it, and then put each of the sub folders or you know, all the different content into the correct location, we want to apply aspects. 


Now aspects allow additional functionality to already existing content type, so that they're properties, and so examples would be like, you know, to enable versioning on a particular content type, enable XML validation so that you're able to like, open it up and make sure that it fits and is valid according to your DTD. Enable link management, extracting metadata fields, and even sending an email as soon as, like, DCL makes that delivery, I don't need to have them send me manually an email, the system Componize will just send me an email saying new content delivered, or however else you set it up. 


We also want to like, you know, create publication pipelines or modify the ones that come within Componize. So we've got standard ones for PDF and HTML, we've got to make it look like the manual that we're converting. And so that means a little bit of XSL-FO and connecting it to the publication platform as well. So if we're using, for example, we partner quite a lot with Fluid Topics, you know, if you're connecting with that as a delivery platform, you've got to set up all the pipelines so that as soon as your content is complete, approved, when you hit publish, it goes directly on to your publication platform. And also several things like translatable, publishable and things like that to further, you know, make sure that you have control of your content. 


So if I take an example of like, you know, the rules configuration is that as soon as content comes into this source folder, I can drag my rules and make sure like, you know, that I put them in the order that I want it to be applied. So here, I'm going to enable XML validation straight away, metadata and link management and also enable versioning. And on the right, you know, I can do a little bit more of like, you know, setup on that, edited a little bit more of maybe run in the background, apply to sub folders, or just the folder, and so on and so forth. 


Next slide, please. And so here's an example of like, you know, the aspects we've got, so we've said, like, you know, aspects are properties that you apply to particular files. And so, I set up a rule and I say, apply some aspects and then what are the aspects? Well, I want to make a file versionable, I want to apply XML validation, allow metadata and link management. And I also want to extract the metadata that's a, that's a very particularly useful one, extracting metadata. So you can work on it in a clear place where you've got all the metadata converged. A taxonomy tool.


David Turner

All right, so we've got our plan set up, we've got the repository set up, now we move into into production. And production doesn't necessarily have to be something where we're doing all of your content at once. In this particular example, they had two documents. So we were able to deliver that as one. But it could be in other instances that you want to divide things out, because we're going to want to run an initial publication, want to make sure that everything is the way that it's supposed to be done. 


Now, we've already done some sampling along the way. So we should be in good, good shape. So when production hits, we just start taking in documents, we start running them through the automation, depending on, you know, the particular industry and this one, we did some reviews kind of in the middle,


36:00

before we actually did the DITA conversion. But then sometimes after, after the DITA conversion, we have XML validation, we have automated QA checks, we actually also add some human component-type text, because again, this is very important information. And it has to be 100% correct. 


Then we package it up, put it into a zip file, and we're able to deliver directly. And one of the great things about Componize is they have a terrific system where you can load the repository folder within a site and then automatically unpack and with that repo set up the way it is, you know, get everything into the right place, just exactly how it was planned. So we do that, and then we at that point, we turn it back over to you guys to to really start during the publication, do any additional cleanup. Now, we do everything we can to minimize that cleanup, but there's always a little bit. And usually after the first round, there's some conversion that, some different steps that you might have. So walk through this and talk a little bit about some of the things that you do here and that you did on this project.


Dipo Ajose-Coker

Yeah, so um, like I mentioned earlier on, you know, you've got to first of all, like, you know, look at things like when you publish with the out-of-the-box conversions, you don't get exactly what you had. You just want, if you were starting from a brand new document, fair enough, you could accept exactly what's out-of-the-box. However, if you tried to replicate something that exists, you've got work on the pipeline to do, so for example, working in the on the XSL-FO to fine-tune the PDF version, Componize comes with its own HTML5 CSS, and so you know, you can just take that file, work on it, new images that need to be referenced, new classes need to be created, and so on, work on that, you know, as just the HTML. And then when you transform, it looks like what you want it to look like. 


Oxygen, our preferred editor as well, has its own WebHelp Responsive customizations that you can do when even the out-of-the-box one is, you know, pretty good. But again, if you want it to look like, well, in this case, we were converting an existing manual. And so we had to make sure that, you know, the logo was in the right place that the image of the airplane headers and footers, the way that, you know, footers, text, references and things like that were treated, were all like, you know, standardized to make it look like the original file, both of them, like, you know, got to review image sizes. Now, you could apply a global approach to image resizing. I did that while I was with General Electric Healthcare, and it worked to a certain extent. 


However, if you've got a group of authors, and they've not been following the style guide to the letter, and apart from having robots working from you, there's no way you could have like, you know, 40-50 writers and have every single one of them respect the rule 100% of the time. And so like, you know, you've still got to go into things and like, you know, look at it and then create rules so that if an image size is, for example, over this size, and over this resolution, then reduce it to this and you can create, like, you know, rules like that, and then use your tools to automatically resize. You've also then got to review metadata and tagging up there are there like, you know, tags, metadata elements that are inside of a document that are not really useful.


Well, we want to review them and start removing them. So in this example here, we can take a look and see the actual XML and looking at it in an XML editor, you can see that everything that's a keyword has automatically been picked up by Componize and is now applied to the content unit as well as tags.


40:00

So, high-speed maneuvers, XA41, XA42. So I know that this file actually applies to both models, you can see that straight away from that, even without opening the file to look at it inside of an XML editor, things like the file name, the title, you know, the author, file IDs and language, you know, all of those other things are available to you directly within the Componize interface without having to even open the file. 


And that makes the search facilities within Componize even more powerful, because it's not just searching like, you know, for content, but you can also like, you know, class that search by tag, so I want to look for everything that has XA41, and has this particular word inside of it. That means you rather than coming up with 600, like, you know, search results, I'll come up with like, you know, 22.


David Turner

And there's some other things to check as well, I see, here's the annotated topic list again.


Dipo Ajose-Coker 

Yep. So, again, you've got your first annotated topic list, you send that back to DCL. And like, you know, they run it through again, and then you get a next version, and then you compare your first annotated topic list with the results that you've just got. And you can fine-tune that. So it's like, you know, it's a long Excel sheet. But it's worth the effort, because what you're doing there is making sure that your conversion then fits with the information architecture. And that also feeds into modifying the information architecture. It's about, you know, applying like, you know, this semantic doc types for topics, you know, if you find that, oh, there's some places that been using numbering, but they're not tasks, then maybe you want to either revise your rules to say, well, certain types of numbering can be used not in a task. 


So we'll change the way that the numbering is done so that it's standardized across everywhere, want to verify cross references to tables and figures and things like that. So internal cross-referencing needs to, you know, check all of that. Footnotes, especially table footnotes are very difficult to handle. So you want to check that, you know, the rules are applied properly as to how to identify them and where to put them. Some special characters might not convert automatically and equations are one of the biggest headaches in the in the DITA world. You're either going to have to invest in a tool that will help you like MathML that will help you write proper equations, or you sort of like you know, do something, and then take a snapshot as an SVG file. 


So during that conversion, you've got to make that decision, do you want to SVG file it so that, you know, people can still select and pick up text from the equation or copy the equation, we don't want to as an image file, or you're going to use a specialized tool, such as MathML to help write and then publish that content. Table formats as well, you've got to make decisions, individual decisions, deciding on how some table spanning might occur, you know, some table attributes like table width, this is a perfect example here. You know, how do you automate telling, there's three columns on there. However, on that first header row, there's actually two rows, dimension and unit. 


So you've spanned that. And you've still got two columns on the right where conversion factor is, however, you've got two rows, and you've spanned that first one, all of that has to be done. So you see the arrangement of the complex information at the bottom, you've got spaces at the top, on the on the far right, in the middle, the content is at the top. And on the left, you've got all the content, and there's a little bit of a space in there. It's just impossible to automate this.


40:00

So it's things that you have to go into, and then sometimes work on manually.


David Turner

And we actually did a webinar just not too long ago about, you know, it's called "Tables are Tough." And it's because they are. Now, we're doing some cool things. We have some interesting partners that are doing things, but it's always an issue. I will say also about this whole cleanup and review process. Since we only had the two documents in this one, it was a little bit of a condensed process. But if we do the, if we really do the full project, right, with a larger document set, a lot of these things that we're talking about here, we actually identified during that sampling process kind of at the beginning. 


And we do a couple of test runs with small amounts of documents to get those things right. So that then we can build the QA checks, either programmatically for those things, or where it creates a workflow step that sends somebody, you know, hey, here's one of those instances; check it. Here's one of those instances; check it, as opposed to just finding it later or having to, you know, just kind of check everything. But this initial publication cleanup and review, it's really critical.


And it leads us then, you know, now that we've we've done this conversion, you know, how do we fine-tune that process, and just a couple of ways that you can fine-tune it, some of our clients will go back, and they'll look at their Harmonizer report again, and they'll maybe make some revisions to some of their assumptions at the beginning. You know, so that's why a lot of times we'll do the conversions in, you know, different pieces, because you'll learn things along the way. And so we can, you know, make some adjustments for the next set of content about how we're going to apply taxonomy or, or, you know, what terminology did we leave out here?


Dipo Ajose-Coker

And like, you know, I mean, if you take a look at that, you know, you're improving your architectural decisions by like, you know, being able to specify, take a look at, like, you know, warnings, for example, you know, reuse, where do you want to reuse it? Is this something that is unique? Because it can only happen in a workshop? Or is it something that could happen in the workshop? But also, while you're flying, you know, and you make that decision for the same warning, you could decide whether or not you want to duplicate it, or reuse it.


David Turner

Yeah, absolutely. And then another thing with Harmonizer, you can give it to us and have us run the Harmonizer report, but we can also offer it to clients as a subscription. And many of our clients are starting to take advantage of that. And so they'll, they'll run additional Harmonizer reports along the way, they'll say, okay, so we applied it, boom, let's run it, what do we get? Did we miss anything? Do we want to adjust our strategy a little bit more? You know, maybe they do it quarterly, maybe they do it weekly. Maybe they do it on different content sets. And then, you know, they start thinking, okay, well, now let's bring in maybe this next piece of content, now let's bring in this next group of content, and, and so then we start to convert more, and we really run through the process again, which is kind of outlined here, overall, so, Dipo, why don't you walk through this, and then we'll we'll jump into some of the benefits.


Dipo Ajose-Coker  

I mean, so it's just like, you know, the simplified workflow of what happens once content gets into Componize, you know, so DCL's Harmonizer's over there on the left, and now, you know, there's an FTP push into Componize. And then Componize, because of the rules we've set up on the delivery folder, first of all, ask the question, does the file or folder exists? Does that exist? Yes, well, create a branch and deliver in a new folder or file or overwrite it depending on the rules that you want. If it doesn't, if it doesn't exist, well, deliver, and then apply the folder rules that I talked about earlier. So unzipping the file, applying folder permissions and apply aspects to enable versioning. 


So I'm gonna go through the list, again, sort of like you know, running behind on time as well, but you can see on there that there's, like, you know, quite a lot of automated actions, you know, and then analyze the delivery and check the log files, that's a manual part.


48:00

We take a look and see, are there errors there? Some of those, some of that analysis can be automated, if there's a file in there that is not valid XML, then shoot off an email. Remember, we said we could add in aspects and email, well, "if XML is not valid, send email to DCL, with file name," and so on, you know, set things up so that you've got this automated workflow, and if delivery's okay, then start cleanup tasks, otherwise, back to send the email or the content back to DCL, then that work gets done over there to, like, you know, fix whatever issue it is, and then it comes back into the, into the workflow.


David Turner

So that's kind of how we approach this project. And we do these projects, we did actually help realize several, several benefits. You can see here that you know, there was definitely some, some reuse between the two documents. The green here shows content that's, that's exactly the same the yellow shows kind of that variable content. So we're able to kind of hit on that, we also were able to hit on this idea of, you know, multiple formats and channels. So talk a little about this, Dipo.


Dipo Ajose-Coker

Yeah, I mean, so if we're looking at, we're in a new age now not everything's on paper, we've gone past like you know, everything been on paper, so stuff can get delivered on a PC and that might be for like, you know, studying for your exams for your pilot exams. You know, you're looking through the manual, you're looking through the maintenance manual or you know, as a garage technician. And then as a pilot, you want your information on a tablet, remember the electronic flight bag, well that, the same content can be published to the web, for your tablet, but also then in the workshop. 


The engineer, though, the aircraft engineer will, remember, what's the word for it again,  actually he will not want to be like, you know, clicking around, they've got greasy fingers, and so on. So he wants to like, you know, have the print version, or you know, he just wants to be able to like, you know, scroll and go from page to page. And so he wants a PDF version, and you make the choice as to what format you want. And moving to structured content gives you that ability to publish in multiple formats to multiple channels.


David Turner

And it doesn't even have to be a document, right? It doesn't have to be a full document, it can be segments.


Dipo Ajose-Coker

Yeah, you can publish content segments. And so you can, like, make it so that all the conversion factors, you know, something like, you know, just tables, you just want all the reference information, create a publication that contains just the reference information. So that flight, flight checklist that the pilot has to go through every time, you know, click, boom, you've got only flight checklists that come up. And you don't have to scroll through a description of, first of all, you know, the reasons why you have to, which you would get in a PDF document.


David Turner

And you can have cool things, too, like, you know, if you're flying in at one particular temperature and you get an error message, instead of having to look for that, you can set up your content to automatically give you what the answer is: check this. And it will be based on, you know, that temperature, but if you're in a cold temperature, maybe it gives you a different, a different result. 


Dipo Ajose-Coker

Yeah.


David Turner

You know, and maybe you're just getting just the segment that that you need, or maybe you want to, you know, do set up 3D things, you know, structured content gives you that, that ability, I mean, there's just really tons more functionality, you've got flexibility, you got the ability to be interoperable with different systems, you know, you can connect with the other big systems that you've got, because you've got this, this XML, you've got, you know, the voice assistant that in the chat bot and the text-to-speech applications.


52:00

You get the possibility for real-time updates, right, instead of having to wait for that, you know, content to get the new PDF and replace the old PDF, you're getting that.


Dipo Ajose-Coker

I mean, one of the use cases that I saw was, like, you know, creating like, you know, task lists and list tower, linking them to content. So when you send an engineer to repair something that's gone wrong with a machine, they get the actual topics and procedures that they need to repair that particular item. So you've got, like, you know, the error code that's been added as metadata. And so that error code then pulls up these other procedures. And rather than just, just the procedure, because like, you know, you get there, you know, you open it up, I'm already, I've got my tablet information, oh, I forgot to bring the actual tools that I need. So your parts list is also on there as well. And it'll pick up the parts that you need to replace them with the procedures and you've got your bag, ready to repair the exact problem. And that saving time, you know, service costs are one of the you know, the highest cost that you know, organizations incur.


David Turner  

Here, we're just about out of time. And we've got to do questions here. But I know, compliance is something that's near and dear to your heart. So take a minute here to talk about this. And then we'll move into our questions.


Dipo Ajose-Coker

Yeah, compliance and change management, 14 years in regulated industries is just like, basically got me looking at, you know, I spot error messages that are like, you know, nonsense, I spot things where there shouldn't be, so like, you know, "Where Used" reporting? And what why would that be useful? Well, you've got a safety message. And somebody's pointed out that you need to change this, rather than just blindly change it what you should be doing is, where have I used this content? And to make sure that if you do change it, it does apply in those other contexts as well, rather than just blindly changing a figure. 


If you find that, you know, okay, well, it only applies when, when this thing is superheated, or when it's at a certain voltage, then you might want to duplicate it and leave the one that is a Danger as a Danger rather than just blindly changing. And then oh, if you were in your Word document, it's copy, paste, copy, paste, copy, paste, and like, you know, you put it everywhere, and you've got improved content auditing, as well, you know, you know what version you used to produce what. When the auditors come in, for those of you in regulated industries, you want to be able to write, you know, just lay out a sheet and say, well, this publication, these are the exact components that I use, and they were all at this version. 


Oh, was something changed? Well, an Engineering Change Request was raised and inside of the Engineering Change Request, well, we said it has impacted this topic, because we we've done a well used reporting it impacted this topic, this topic and this topic, and those are the ones that we change, take a look for, we produced it with only these. And if you compare to the last version, those are the version on the audit form. Access controls prevent unintended change, we don't want marketing department working on like, you know, some of your, I always attack the marketing department although I work as a part of them now. But like, you know, you don't want them changing stuff to make it like, you know, we're the best, you know, unless you can prove that you are the best, that is not what you want in your documentation.


David Turner

All right. Well, I know we want to have a few minutes here for questions. Marianne is back on. But you've got another webinar coming up. You want to talk about this?


Dipo Ajose-Coker

Yeah, I'd like to invite everyone who's on there. We got to be like, you know, going further, let's let you know, show you more. Now, Toyota Motors Europe uses Componize. And they use it to create service bulletins. So how do you automate and better control the production and maintenance documents with dynamic forms?


56:00

Dynamically assembled documents based on templates and variables and using data from, you know, third-party systems, you're able to just like, you know, fill in a form and click a button and you create most of your service bulletin. So join us for that, it's on the 11th of May at 16:00 hours, that's the Central European Time, you'll have to do the conversion to your timezone.


David Turner

Love it! All right, Marianne, what questions we have?


Marianne Calihanna

We have a number of questions, I don't know if we're going to get through them all. But I want to let everyone know that I know between Dipo and David, they will personally follow up. So let's try to hit them. And I do also want to share both of our websites here. So I'm gonna go ahead and push that. So, Dipo, the aircraft maintenance manual is heavy on the part information. Was parts specialization an aspect of the project?


Dipo Ajose-Coker

Yeah, now, parts lists were mainly tables, and for this manual was, because the the actual aircraft had passed between so many manufacturers, this was about the third or fourth owner, and they didn't always have the the full, the full files, we have some of the tables as images. So we had to work with DCL, as like, you know, was it was worth the effort to run those images through and extract the text, then create tables. And for most of the case, you know, that was it. So they had to like, you know, do that extra processing to convert image-based tables into, you know, proper tables of parts lists? Hope that was the question, that was the answer to the question.


Marianne Calihanna

Great. And, David, can you speak to converting InDesign files? We had a question, does it generate a PDF, the same layout? Why don't you just quick, could you quickly kind of talk through that process?


David Turner

I'll give my favorite answer that you always hear on these things. And that's "it depends." Right? So InDesign is going to depend on a couple of things. We're going to look at, you know, how many documents that we're dealing with, we're going to look at, you know, is there any consistency in terms of InDesign styles that are used, you know, is that something that we that we can use, we're going to look at, you know, how, you know how regularly the text is done, and we may go directly from the InDesign file. Sometimes, if there's not a lot of consistency, sometimes it actually is easier to just produce the PDF, and convert from the PDF. So it all just really depends. I'd be happy to talk to whoever he is about it, get one of our project managers on and we can we can dig deep, take a look and and figure that out.


Marianne Calihanna

We have one minute, okay. Is there a content reuse sweet spot where conversion pays off? Maybe a one-word answer?


David Turner

I don't think it's the same for everybody.


Dipo Ajose-Coker

Yeah.


David Turner

I think it does.


Marianne Calihanna

It depends.


David Turner

But you know, it's something that you want to be aware of. Right. And I think you'll know, in your particular use case, you know, it doesn't make sense or does it not based on how much pain you're dealing with in terms of fixing things. We have some clients who takes months because of the regular regulatory things, to do simple changes, because they have to find every place and to make sure it's right. Resubmit it. And so if you could take five months off of a process, why wouldn't you?


Dipo Ajose-Coker

Yeah, and depending, I suppose it also depends, like, you know, how many products have you got, and how often do you change it? And how many writers do you have, you know, and those are the sorts of things, you know.


59:58

If you're in a tiny little startup and there's just you there, then maybe, you know, you're managing to keep things on track. However, if you think about the future, what happens if you leave the company, then you've left like, you know, a whole load of, for, like, the person following you. So if you had your content already structured in something that had, like, you know, versioning, and you know, we're able to manage all of that. So it might be worth making that conversion. Or you might just think, well, I've, I keep my folder structure really well done, I add notes to things, you know, and so on. And that's enough for somebody to follow. It's really variable.


Marianne Calihanna

So we have come to time, we're a little bit over. The final question is a Harmonizer question. So I know David will reach out directly to Frederic. So thank you, everyone, for spending your time with us today. You know, these webinars are really put on to help educate and support our industry. We provide a number of different services such as webinars, newsletters, blog posts, we invite you to visit both of our websites. Please stay in touch. And we look forward to having you at our next event. This concludes today's webinar.


Dipo Ajose-Coker

Thank you very much, everyone. Thanks, David. Thanks, Marianne. And thanks to our moderators who are in the background, you know, doing all the tech work, taking a look through the chat and pointing out the questions and all of that. Don't forget to visit our websites, especially if you want to, like, you know, have, if you have more questions, feel free. You've got my LinkedIn, follow me on LinkedIn. And so you know, just get in contact. And we can always maybe schedule a session, and well remember the webinar on May the 11th. Thanks.



bottom of page