DCL Learning Series
Automated Auditing of S1000D Aircraft Manuals: A Case Study
Hello and welcome to today's webinar, Automated Auditing of S1000D Aircraft Manuals. My name is Marianne Calilhanna. I'm the Vice President of Marketing at Data Conversion Laboratory and I'll be your moderator today.
Before we get started, I wanted to let you know we will allow time at the end of the webinar for questions and answers. So please, write your questions in the chat area as they come to mind. If we don't have time to answer all of the questions, we will get in touch with you personally after the webinar. This webinar is being recorded and you will receive a link to that recording after our broadcast.
At Data Conversion Laboratory we provide data and content transformation services and solutions, using the latest innovations and artificial intelligence, including machine learning and natural language processing, DCL helps businesses organize a structured data and content for modern technologies and platforms. DCL works with businesses across a number of industries and we're really proud of the relationships we've developed with leading global organizations. Here on the screen are just a few of the customers with whom we've worked.
So today Naveh Greenberg will discuss automated auditing of S1000D aircraft manuals. Naveh is the Director of US Defense Development for DCL and is an industry expert with S1000D conversion. He has a long history with structured content, .xml, as well as other types of conversion projects. And now I'm going to pass it over to Naveh.
Thanks Marianne. Hi everyone. So today as you know we're going to discuss how to apply automation during QA process of legacy conversion to S1000D. It's going to include how to find the right balance between automation, between automated QA and manual QA. We'll go over how to properly plan the implementation of the QA process and the methodology you should use to do it successfully. After that, we'll go over the case study and I will show some examples but in some of the slides I obviously could not use the real data so I ended up using samples from different projects or change the data a little bit, but overall the idea is exactly the same. And at the end I'll probably have some time to answer some questions. And if after the webinar you think of additional questions, you can always email or call me and I'll be more than happy to answer. So next slide please.
So one question that we always get is how do you know what to look for when you're planning to implement QA or even for any data conversion projects? So regardless of the XML standard you're going to, there will always be similar challenges. There will always be schemas and business rules that are more complex than the other standards. In the case of a more content driven schema there will probably be some additional challenges that you will have to deal with. S1000D is an example of a standard that falls under the category of being a more content driven and a bit more challenging to implement than understand it.
So what are some of the issues that you need to deal with? I'm actually going to start with the table and the images because I want to first talk about the generic items that apply to any standard, and after that focus more on S1000D specific items. With any XML standard that you're dealing with, you always have to make sure the tables are tagged properly. You have the table headers, entry and row spanning, column width, borders, all of these islands are just few of the things that you need to worry about when dealing with tables. You need to determine what to visually inspect and what will be done manual and what will be done by automation.
In some standard, like the Army 40,051 or the Navair 3001 and one and any other content driven standards, you need to take the legacy table. It's actually now called the Oasis table model and restructure it into a content table. It can also be the other way around. Your legacy data can be in HTML or XML and you may need to take a content table and map it back into a regular table. The same applies to images. An image is an image regardless of the standards, but it may be tagged differently. Dimensions may be different and maybe you need CGM versus raster images. So all of that applies to any kind of migration regardless of the standards.
So going back to the content issues and my assumption is that most people here know what a DM are really. And for the people that don't, I'm really going to always simplifying here a lot for the sake of moving on, but it's basically a spreadsheet that consists of all the legacy titles or paragraph headings where you map them into data module types and assign a data module code, and again, I'm going to oversimplify what a DMC but ... And I know that people that know XML make arrangement, they hear the definition but it's basically assigning a file name that has a meaning, that a machine and even a human can look at the file name and simply know what the file is by reading the data module code.
But we actually use the concept of a mapping spreadsheet well before S1000D. Regardless of the XML standard, we always use this concept even to less modular standards like DocBook, MIL-STD-38784 or DITA. And we use it to do a high-level mapping. And this is very useful when you do a QA because you have a way to see how the data will be mapped before you get into full production.
The second point is that you can assign the spreadsheet to a subject matter expert and add the map all the anomalies to the correct data module work package or whatever type element regardless of the legacy data structure and hierarchy. You can also add any kind of metadata in the spreadsheet that is not in the legacy data, but you can incorporate it during the conversion. So that's a great way to enrich your existing legacy data. But most importantly, you can use automation to check the consistency of the legacy data, and doing it manually will be a very difficult and time-consuming task. A lot of automation can also be done to make sure that appropriate data module code and illustration control numbers were used.
Applicability usage can get a bit tricky. An incorrect applicability usage will lead to incorrect or no text to appear when publishing. You need to invest some time building a solid plan that will cure applicability properly, and I will show some examples are laid on. Missing unrequired or required text is always something that scares people when migrating data. Missing text is not only drop text during the text extraction. It's also text that is missing in the legacy data that is required in S1000D, like all the preliminary requirement data.
Extra text that doesn't fit the S1000D structure, for example, if you have a partly stable with multiple columns that don't fit the S1000D data structure or maybe text that refers to pages or chapters or sections that no longer exist in S1000D, or even paragraph like introductory paragraphs that applies to multiple procedures, all of that extra text when developing your business rules, you need to take all of that into account and develop a clear ... like a real clear rule that can be implemented as to how to handle that extra text. And I'm going to show in the case study examples.
The business rule is obviously required with any kind of conversion, but especially in S1000D you can ... And I can definitely guarantee that the legacy data will always have issues mapping to any XML standard. So developing the business rules on how to end all the anomalies is very critical. With S1000D it's even more critical and actually S1000D requires you to have a level of business rules. And again, I'm not going to get too deep into that.
So the last item on the slide is automation consideration. It does relate a bit to business rule, but it can also be a standalone topic. Some legacy is not ... and some of the legacy data is not required to be migrated because the style sheet can also generate the text. So for example, you may not need the title procedure if you tag the entire data set as a procedural data module. If your legacy data is HTML or XML, auto-generated text gets a different meaning. You can look at the published PDF of the legacy data and see that text that appears in the legacy in the published PDF does not appear in the XML because the style sheet generates it. So when doing your analysis, it would be very helpful to know what element caused the style sheet to generate the text. Next slide please.
So during the analysis phase and when you start to develop your QA plan, you need to document what you are going to QA and what you ... and how it would be done, especially what would be done by automation and what will be done manually. It is not that trivial to find the right balance between manual and automated QA, and it depends a few things. For example, what exactly are you checking. If you're checking that the correct image is used, you might need to do it visually. But there are ways to also incorporate automated QA when you're dealing with images, for example, resolution, and just I mentioned the naming convention, even hot spotting can be done by automation.
Volume of the data, if you have only few pages to review it will probably not be smart to spend a lot of effort developing automation. But if you're dealing with a large amount of data, automation can be very useful to speed up the production but also to make sure that you do it consistently. The legacy format is critical because a lot of the consistency of the legacy data depends on how much you'll need to invest upfront to normalize the data. Paragraph numbering, hierarchy, style that was being used, keywords are all things that can be checked using automation, but to accomplish that you need to review and understand the legacy structure.
The approach of implementing automation is very important. You can automate almost anything but if your rule are very vague, you will miss a lot. You can't just say okay every all caps underlined text that is labeled with Arabic numbering conversion is always the title of a data module. It's not what you see is what you get. That's why you need to be very specific when you define your business rules. Creating a hint example can be very helpful in that case. It's really a way to show how the results will look like before you get into full production.
I mean, before I move forward, I keep saying QA and a lot of people get QA and QC confused. I'm going to continue to use the QA, but I always see QA as sort of the master plane that makes sure the production is developed and run correctly. QC I see more about the actual matrix of quality. So basically checking the conversion files and finding real issues. If you detect any issues, you can go back and revise your QA plan and improve the conversion process and the QA checks. So really the main goal of QA is to improve the development and checks during the conversion so the converted files are in good shape. And the main goals of QC is to identify issues after the data was converted but before it was released to the client. So moving forward it's probably going to use just QA. Next slide please.
This slide I have in all my presentation, I like this slide because it applies to any type of conversion. And when people ask me what is the main reason why migration projects fail and really any kind of conversion, I always go with poor planning. So regardless of the task, you always have to develop a robust and efficient plan. And to get to that point you have to invest a lot of time upfront. The more planning and analysis that you will do upfront, the less issues you will have later on. It will also speed up the production process because you don't have to stop every time you find a new issue or worse, not even be aware that there were issues and only find them after you deliver the data. So the quote that supposedly President Lincoln said, "If I had eight hours to chop down a tree, I'll spend six sharpening my ax," best describes the approach you must take to have a chance of a successful migration or a successful QA implementation.
But how do you plan properly? So first of all, you can start by asking some initial questions. Who are the stakeholders? So anyone who's involved in the project is a stakeholder. It's the management because you need to buy to get their buy-in and to make sure that they understand the importance of the project. It's the production team, the subject matter experts, the graphic team, the tech writers, basically anyone that has anything to do in the project view stakeholder. The critical stakeholder that people tend to remember only late in the process is the final user. They must be involved in every step of the process.
Knowing the bad budget and scheduling sounds like an obvious thing but it's not that trivial. You need to be aware of other milestone that you may not know initially but could have effect on the projects, so for example revision cycles, company shutdown, milestone in other ongoing project that will take some of the resources. All of these are some of the few so called external milestone that are important to take into account.
So what are some of the things that you need? Try to get as much of the legacy data as possible. You can start analyzing it and look for any inconsistency, document them, and start developing methods to pick these issues and other issues throughout the legacy data. What you also need is what I call the right people. But it's really the team that will carry out the entire project. In S1000D they don't need to be necessarily expert in S1000D, but they need to know enough to understand that the new look and feel of S1000D and how in the future at least the sustainment will take place because it does not look like legacy data, and data will move around from place to place.
So now we really get into the planning the QA process and agreeing on business rule is one of the most critical one, and it involves all the stakeholders. It's not only agreeing but it's also making sure that everybody understands the rules the same way. And that's why you need to document as much as possible. When you're done documenting and showing example, you will need to walk through the document and in this case the business rule or the conversion spec and make sure everybody understands and it's understood by everyone the same way. It's also very useful to create a sample file that goes together with the conversion specification because it allows you a few things, first of all, you see how things would look like before you get too deep into conversion or into the project, and it also gives all the stakeholders a better chance to understand how the final data will look like. If all goes well, you can move into a small pilot and then a limited production run. Next please.
So one more slide before we get into the case study. With any project, whether it's conversion or implementation of automated QA, you need the solid process. Our project methodology uses five phases. Phase one is where you should do the analysis. With legacy beta is reviewed and analyzed. You try to understand the data and find any inconsistencies. Depending on the XML standard you're trying to fit the legacy data into, those are the things that you're going to look for. With S1000D especially it's like trying to fit a square peg into a round hole. So you need to get creative and sometimes think ahead in order to make that happen.
In phase two you should create the conversion specification or the project specific business rules and create a hint example. It should go over the conversion specification with the stakeholders and make sure everybody is on the same page. Based on the conversion specification you should produce a tagged sample and show the stakeholders. The difference between a hint example and the sample that it comes through the process.
In phase three it's usually where you customize the conversion software according to any issues you find after the sample is created. And phase four is when a proof of production should occur. You should run a larger and a more representative sample of the entire legacy data throughout the process. When everybody is happy with converted data, you should have a sign of and move forward. And phase five is when the project enters the production chain.
So throughout the entire process went for continuously improving the process. So for example, if during any of the phases you find issues, you may need to go back to the previous phase, improve the process so any future results will have higher quality items and those issues are not going to appear again. Next please.
So finally we're getting into the case study. This is a specific case study where we incorporated automation in the QA process of legacy data that was migrated from HTML in this case to S1000D. There are a lot more project like this but not unnecessarily S1000D but everything here still applies to any kind of automated QA from tag data to any kind of XML sample.
In some of the slides I said it before I could not show the real data so I used examples for different project and manipulated some of the data in order not to show the real data. So in this case we have a client that the OEM, really the OEM converted a lot of content data to S1000D and the client wanted an independent audit to confirm the accuracy of the converted data to report any issues with the data and provide any kind of recommendation to improve the process and also efficiently correct any kind of issues with the XML files that cannot be revised through the process. In some cases though it was done on a fully converted data set and in other cases the QA was done in parallel to the conversion process.
The feedback that we gave was being used to update the conversion process and that led to converted files that were in a higher quality than they would have done without the QA. So in the next few slide I'm going to show some of the QA checks that we did.
I'm going to start with a big item of S1000D. Applicability is always a thing that people sort of shy away when they go to S1000D. They think it's too difficult to implement and definitely too difficult to QA, so it's not an easy thing but really the main reason for going to S1000D is the use, and applicability plays a big part in that. So even if you already have legacy data that is tagged and use some sort of profiling like ATA but uses effectivity, you still need to take into account two things.
First, to make sure that the effectivity was done properly in the legacy data, and the second thing is that you can just do a one-to-one mapping of the applicability in the legacy data to S1000D because in S1000D applicability is not allowed everywhere. So in this slide you have a case with a parent element which is the fact that it's limited to specific models. But the child element, the torque has totally different model of activity. And finding these cases without automation is almost impossible. And during our proper QA of the legacy data and finding this issue will save you a lot of headaches later on.
Now there's another issue here that has nothing to do with the legacy data but will actually affect the conversion. In S1000D inline applicability is really limited to cross references maybe simple, but it's not allowed really as an internal element. And here unless you create some repository of the torques which in this case could be ... It's not really even a torque. I mean they were using it for distinguishing some of the data over here, the inch pound and the SI units, really the only way to do it is repeat the paragraph and assign the appropriate applicability to that paragraph.
Now if you migrate to S1000D, it's critical to deal with these issues upfront because fixing it later will be extremely difficult. So our report basically told you where you have those issues, looking at the first of the parent add element and then showing you later on that there's some conflict in the low element. Now the system of the legacy data may have been okay with it and automatically overwrite any kind of inconsistency, but if you go to S1000D due to conversion, the effect will be devastating because you will miss data. Data will be missed because the applicability was not used properly. Next slide please.
So in this case again it's not ... It can be defined both as a legacy issue and conversion issue, but what we were looking for was that in the legacy data you have two graphics. And when you closely look at it, you actually see that it's exactly the same graphic, the same figure in S1000D with multiple sheets. And if you would have done conversion to S1000D, you would have done as two separate graphics, which it's fine but it's really one figure with two sheets over here.
And our automated conversion, our automated software detects those kind of inconsistencies. And it's not limited to this case. We actually check that the titles match. In some case you may have ... not have the mdash as an entity. You may have it as a physical dash in the XMS. Then when you convert it, you actually think that it's separate graphics but in reality it's one figure with multi sheets. All of that has to be done upfront and when you during the analysis picked up those items. Next slide please.
In this case it is really the only sample that I show without showing any data in the ... Sometimes legacy data is very forgiving. And sometimes can be streaking and sensory stuff, very content driven, with schemas, limits you to what you can use. In some standard almost everything was allowed. And even the publishing of the rendering software allows you to do stuff that when you publish it, you won't see the difference. So in this case you have a step and you open multiple paragraphs and then followed by a note. When you convert it to S1000D, there's two issues. Number one, it's really a ext step and the note probably has to go with either the next node or the step before it. But when you go to S1000D then, you can have just a step.
In this case we found two issue. Number one, the conversion did it incorrectly and give validation error. But also logistically it doesn't make sense to have a node by itself. And this would have been even a voltage, and it happened in many projects where you have a warning and a caution. And when you deal with warning and caution and you build a warning and caution repository, if you pull out the warning and caution, you end up with basically a warning and a caution that applies in the wrong location.
And that was one of the tests that we did. And to pick, again, to pick up those things would have been almost impossible because it wouldn't give you a validation error in the source, when you viewed the published PDF, you would not see that this is just an amplifier. If you could see a node, then maybe they output. If you see over here, the next step actually has a label attribute three, so the numbering would have been okay. So manually reviewing it you would have never picked up this issue. Next slide please.
In this case, this is again mapping. When you do mapping of legacy data to S1000D, there's some elements that I think are not allowed in S1000D like a break. This would have actually, if you would have done the conversion process in a sense that every time you see a break you either replace it with a space or open or close the previous paragraph and open a new paragraph, you wouldn't have this problem. But in this case this break element actually was dropped and actually you see the XML, the convert to XML, the space is missing.
Once this is found, you can go back to the conversion process, revise the rule that tells you, "Okay, every time you see a break, replace it with a step, or a paragraph," or let's say even if you are in a more content driven data module, if let's say a break in fault isolation indicates that you have a question or whatever, that can flag the conversion software to treat it differently. So this is just an example of how automated QA found a way to improve the process. Next slide please.
So content reuse analysis is a very important subject especially in S1000D. We actually use it as part of the initial analysis. We do it comparing legacy to converted data and we use it also especially in S1000D to detect places where applicability might be used. So in this case, in the first paragraph, in the first box, we see a base paragraph. So what the tool does it goes to the entire data set, fix your base paragraph, and you can define what the paragraph, whether it's 10 words or 20 words and also define the threshold and say, "Okay, if it's more than three words difference I can see there's a similar paragraph," and what it does is at least any paragraphs that are similar.
So if you do this analysis upfront on your legacy data, you can find any kind of typos or you can find any kind of similar paragraphs that throughout the year technical writers instead of using one standard paragraph kept writing the same paragraph in 20 different ways, you can normalize it into one paragraph making your legacy data more consistent. At the same time, if you're preparing for S1000D and you want to see what kind of a repository you want to build, whether it's warning, caution, even tools, you run the content analysis tool on the data and you will see any kind of similar warning and caution. This is especially powerful when your legacy data is tagged data and then you can limit it and do the run only on specific elements.
Another thing that we did in this specific case study was taking the initial legacy data and the converted data, normalize it and bring it into something similar to a word document and comparing the file. And what you find is any kind of missing data, drop data that was dropped during the conversion, any kind of extra data. Let's say warning was moved up for whatever reason because when you do a fully automated process, there are some rules that are vague and may misread in this case, but you see in the example below you actually see data that is missing.
And we ran back to all the data and the only difference was that depending on what data module you used, the rules of normalizing would have been different, but the end product would have been comparing normalized converted files versus a normalized legacy data and see any kind of draft data. Next slide please.
So what we did in other projects was taking it a step further. What happens a lot of time in the aerospace, take [inaudible 00:35:23], you have data that was done in ATA, ATA light tagging. And what we did is actually compare. Once we did the news analysis and found candidates for paragraphs or even files that are similar, we took files that met the threshold of similarity and compared them side-by-side.
So what happened, and you can do it on any XML. We did it on the legacy data and we did it on the converted data. And the benefits of doing it in the legacy data is that you can go through the comparison and determine where they okay. These files are duplicated and the only reason why there are separate is because what, well, they used the word written a little bit differently and then you give it to the technical writer and you assemble it into one document. You can also say, "Okay, this is a candidate if the legacy data allows you to do profiling," and in their case can use the effectivity. You can take that, combine into one if they are similar but make one file with multiple profile and then you deal with one file.
In the converted data it gives you a chance when you find, the whole issue of finding the right balance between automated and manual work is that you want, you can determine that you know this happens in 5% of my files. Let it go through the process, create data modules, run that and have technical writers use applicability to improve my content reviews on the document. Next slide please.
So the tagging accuracy is something that in this specific case, that it is for the legacy data, a lot of time the converted data depends on the quality of your legacy data. Now if the data comes in as PDF, as broad, even though you have additional steps to prefer the data to conversion, you can manipulate it and make it a little bit cleaner before getting to conversion. When your legacy data is tagged data, it's a little bit more difficult to manipulate the data before conversion.
So it's very critical to check the quality of the source data, which may not be an issue when you publish in your legacy data, but it's a problem if you develop automated rules to go to a different standard. And if it's wrong in the legacy data, for sure it's going to be wrong in controlled data. So in this case something simple as tables. In this case you get a report that you have a table that is declared with four columns but in reality only declare three of them. Your spanning is off and your entry will actually when you publish it will actually appear in the wrong entry.
This is very critical if you actually take a table in the legacy and break it down into a content table or even something as simple as a partly stable. If you have a six count table and you do an analysis only on a small percentage of your legacy data and you say, "Okay, column one is my item number. Column two is my part number. Column three, and so on," in this case, and again, this is just a cols table, but in a partly stable, a part number will appear in the wrong location if this is defined like that because it's going to think, "Okay, I'm in my first column. My first column is [inaudible 00:39:24]. My second column is center. My third column is on book. And then the fourth column which may have impact to some different content element is empty. So it's appearing in the wrong location. And beside this claim improperly, it's tagged incorrectly. So when you do a global search of your part number, it's not going to be correct. Next slide please.
The whole checking of tagging accuracy was done in numerous way, but an example of cases where, and again, the PDF snapshot is a PDF of the published HTML. But if your legacy data is PDF, yes, you have the extra step of extracting data, but at some point you bring it into an ASCII format and you can do a comparison between either the extracted data and the PDF, the converted data and legacy data comparison between numerous legacy data.
In this case, in the right corner of the DIFF report you actually see the comparison of a converted data and legacy data. And in this case you see they are force-fed. There are three steps over here where you can determine if you did not have the difference report that is done by automation and you see that this is not matching all the blue data, when you look at the tagged converted data, you see that the text is missing. But in order to see it, you manually have to open the XML in the XML editor and compare it, which besides being a very difficult path, it's almost impossible to do it manually when preparing a PDF and then converted XML file. Next please.
All of that, and again, to add to it, all the data enriching aspect of it, which is not part of the QA but giving you that option to take the legacy data and enriching it before you get into conversion, into converting the legacy into S1000D, gives you a data that is a lot more rich and a lot more content driven and a lot easier to sustain that you would have done if you've just done a one-to-one conversion specifically in S1000D. Next slide. Okay.
So thank you Naveh. We do have a couple questions that have been submitted, but I think we're running out of time so I'm going to give one to you and then we can follow up with everyone after. Can you tell us why you think there are no commercial off-the-shelf software that will allow a fully automated conversion from legacy to S1000D?
Yes. Well, number one, legacy data is not the same. It's different. Even within the organization one technical manual or whatever services, let's say, any manual or service can be totally different than another manual in the same organization. So I think I mentioned before. When you define rule and when you do a conversion, especially towards S1000D, it cannot be vague. It has to be very specific. And that's why you need to invest so much time upfront to make sure that you cover as many variation as possible. So you can't just define a rule and say, "Okay, every second hierarchy level title will become descriptive of data modules." That's not always the case.
Business rules it's not the same form from one organization to another. Again, its format is different, the legacy format is different, and if there were to be off-the-shelf software and some people do claim that they can do automated process like in this case study, the final product is not going to be consistent because you try to take legacy data that was not based on S1000D and try to fit it into S1000D. Again, if anybody has questions after the webinar, they're more than welcome to email me or call me. I'll be more than happy to answer.
Great. Thank you. Well, thank you everyone for attending this webinar today. This does conclude today's broadcast. You can access many other webinars related to content structure, XML standard, S1000D and more from the webinar archive section of our website at dataconversionlaboratory.com. We do hope to see you at future webinars and hope you have a great rest of your day. Thank you very much.