On-Demand Webinars
DCL Learning Series
Fluid & Powerful: Adopting NISO STS at the American Water Works Association
The American Water Works Association (AWWA) was one of the first Standards Associations to convert content to NISO STS XML. AWWA had no prior experience with XML and the conversion involved PDF to NISO STS. In this webinar Daniel Berger, Publishing Operations and Digital Product Management Leader at AWWA, will discuss:
-
Content structure: tips and best practices that ensure the free flow of standards content. Discussion around XML as well as workflow processes that enforce the NISO STS XML standard.
-
Content conversion: critical components of analysis, specification, conversion, QA/QC, timeline, and budget. The content included some challenging situations that required some non-perfect decisions in terms of MathML, images, tables, and footnotes.
-
Content interchange: how the new format helps with indexing and searching and supports improved discoverability of AWWA’s content.
We offer this video in bite-sized pieces so that you may zero into a specific section OR watch the entire recording.
Webinar Transcript
Marianne Calilhanna
Hello, and welcome to today's webinar. Today's webinar is titled "Fluid & Powerful: Adopting NISO STS at the American Water Works Association." My name is Marianne Calilhanna, and I'm the VP of Marketing here at Data Conversion Laboratory, or DCL, as we are also called. This webinar is part of the DCL Learning Series, and our mission here is to structure the world's content to better support our customers and facilitate worldwide communication of scientific content. Before we begin, I'd like everyone to know that this webinar is being recorded, and it will be available in the on-demand webinars section on our website at dataconversionlaboratory.com. Also, we invite you to submit questions at any time today. We're very fortunate to have today's speaker, who is an expert on NISO STS and publishing standards. So feel free to submit questions at any time. We'll save some time at the end.
And speaking of our speaker, I am delighted to introduce Daniel Berger, Senior Manager of Product Development at the American Water Works Association. Dan provides strategic leadership and technical expertise to drive digital transformation and product innovation across various content areas such as academic research, technical books, multimedia, consumer news, and more. Dan, thanks for joining us. I'm going to turn it over to you.
Daniel Berger
Great, thank you, Marianne, and thanks for the nice introduction. So, our, our talk today will be about an experience that American Waterworks went through a couple of years ago with regard to our, publishing our standards. And I just wanted to let everybody know that this, the talk I'm gonna give today is a version of a talk I gave a few months ago at this year's JATS-Con, and that talk was based on a paper I co-wrote with Mark Gross, who's the President of DCL. Next slide, please.
So here's what we're gonna go through today. We'll talk about AWWA, and what an SDO is. We'll talk a little bit about how AWWA once made standards, and then why we decided to make a move to, to build out a digital workflow and new products. I will talk about our plan, you know, how we thought it might go, or some hypothesis, our experience while going through the project, the messy middle. And then what happened when we were, finished the project. And we'll finish up with some, some future questions for thought, and then we'll open it up to your questions. Next slide, please.
Great. So AWWA is a standards developing organization. So, what is a standard? I like this definition that I have here on the slide. A standard is a documented instance of rules, guidelines, or characteristics for activities and/or their results, established by consensus, and approved by recognized body, aimed at the achievement of the optimum degree of order in a given context. That's a bit of a mouthful, but essentially a standard tells you how to do something in an accepted and acceptable way. And it tells you, it tells you how to do it, in a way, and in an agreed upon manner by a committee. So standards are traditionally not written by a single person, but they're written by a committee and approved by a committee. So they are a true consensus document.
4:09
AWWA has been publishing standards about the water industry and the water sector for more than a hundred years. Our standards cover everything from pipes and valves to water meters to the executive management and leadership of a utility. We also cover, cover other issues like cybersecurity and risk and resilience. AWWA has 180 active standards and more than 1,200 historic standards. So the active standards are the, are the version of the standard that is, that is active and accepted now. And every five or so years, standards need to be revised and updated as the world and technology, regulations, those sorts of things, change. So, that's why we have a historic standards. The historic standards are still valuable, because they tell you how things might have been done in the past. So, for instance, if you're looking to upgrade a water main that was installed in 1984, you wanna look at the standard that was used at that time, not necessarily the standard that is in use now. Next slide, please.
AWWA for a long time created those standards as paper documents. Those paper documents were assembled into a standard and those standards were assembled and put together into a series of eight binders and then you would purchase, an organization would purchase, all eight binders. You could purchase a subscription, meaning that if a new standard came out, we would mail you the new paper version, you would have to go to the binder, find the old one, pull it out, put it in there, or hope that somebody was nice enough to go ahead and do that when the new standards came in to the office.
Eventually, we made the standards available as a PDF download, which you could go to our website and purchase the download of them, and then they would be available for you to download, and hopefully you wouldn't, you wouldn't lose them in your inbox or on your hard drive, or, or lose them when your hard drive crashed or you got a new computer. So what that means is that we had no, no XML. We had no structured content for these. We had PDF documents and paper documents. We had no standard workflow, because it was very, there are many different ways to make a paper document. So we often changed our workflow, or depending on who made that particular document, they may have their own workflow. And they weren't available on any, any sort of a platform. You know, you, you would get the binder, you would get the binders or the updates in the mail and the PDF would be a download from our website. So, next slide, please.
And this worked for a long time, but, but as the world started to change and people's demands changed, we decided that we were going to have to change as well. So, we started to think about how we might be able to accomplish a better product while keeping the standards and the content in the standards the same. We were looking for things like better search and easier access to the standards. We wanted faster production. How do we make these things and, and turn them around and increase the, the go-to-market timeframe? We wanted to make them available on more channels. Not just paper in a, in a large binder, or a PDF download that you have to have a login to our website to get. Next slide, please.
8:01
So, we came up with a hypothesis, and that hypothesis was that we were going to convert the legacy content first. This allowed us to take our time and not interrupt the flow of new standards being produced. We knew that we had a lot of content, which I'll go through in a little bit, and, and that we're more than likely going to encounter problems that we did not expect to encounter. And our thought was that we would use the lessons learned from this experience to provide us the experience and the ideas to create a new workflow going forward, so that we could create the best workflow and be able to seamlessly transition from, from the old paper-based workflow into a digital-first workflow, taking with us all the lessons learned from the experience of converting our legacy content. Next slide please.
So we decided to use the NISO STS and, and this is, the STS is the Standard Tag Suite, and it is a standard for standards. It's an XML standard. It's a derivative of the JATS standard, which was developed for journals and journal article content. It's also based on the ISO STS version 1.1, which was developed as an XML format for standards some years ago. And it was released in October 2017 and AWWA was one of the first organizations to formally adopt it. So some of the examples of STS-specific elements you can see here on the screen, and if not, the slides, I believe, will be available. They really focus on, on the structure and some of the more unique elements of a standard that are different than a journal article because of the historic nature and because of the preceding preceding nature of standards, where you've got an active standard, and you want to make sure that there's a digital link to the, the historic standards. So, there are specific elements in here, specific to standards, but again, this is a derivative of JATS. So, there's, there's some similarities between those, those standards, which, which make, which make the adoption a little bit easier and faster. Next slide, please.
So, our plan was, first, we needed to find a partner to help us do this transformation work. It's not something that, that most societies are able to handle on their own. We needed to work with that vendor to implement a conversion plan. And then we also needed to define and develop our future workflows. So we decided as stated in the last slide that we would use the NISO STS for the standards and we would use the BITS XML for our manuals of practice. And our manuals of practice are also consensus documents. They're not quite the same as a standard, but they're similar in that they're consensus document, documents. And the BITS is the, is the standard XML for books. Next slide, please.
So as stated earlier, we knew to expect the unexpected. We were converting more than 25,000 pages of content. The content was going back more than 100 years. We had metadata inconsistencies and metadata gaps. The way we captured metadata was different through the years. It was in different formats and in different systems. In some cases, you would go 20 years with, with no metadata. So we had a lot of metadata inconsistencies and gaps that we knew that we were going to have to address.
12:08
We also knew that we were going to be converting content that was never intended to be converted to XML because a lot of it came out before there was even such a thing as XML. So, we knew that, that we were, we were going back in time to a world before, you know, there was any idea of the notion of structured content. Next slide.
Thanks. So, what I'll do now is I'll go through some of the challenges and some of the solutions that we came up with as part of the process where we went through converting the historical content. The first one we came up with was equations. You know, how do we handle equations? As with many standards and technical content, our standards have a lot of equations. There is quite a bit of math in those, there's a lot of engineering and a lot of chemistry. So, there's, there's a need to be able to properly convert equations.
One of the ways that sometimes this is handled is by, by simply taking a picture of the equation and publishing that as an image. And we knew that was a good fallback method, but we also knew that that was not an ideal method, because what we want is we actually want that that equation to be in live type, so that people could, could actually use the equation and replicate it. So, the challenges were, how do we handle these equations, and we decided to use MathML, which is another XML, a standard that can be embedded into other XML and it's a way of representing math and equations as an XML in an XML format. The problem is, not all browsers support MathML, although that is changing, and so many of the main and major browsers do now support XML.
And then one of the other challenges we came up with was, was fractions. Fractions are really a little, tiny little mini math equation. There are characters for some of those, you know, more common fractions such as one half or one quarter. But when you get into fractions such as, you know, three thirty-seconds or seven thirty-seconds, there is no, there is no character equivalent for that. So how do you represent that? Do you want to use MathML for such a little equation, and does that create any sorts of problems? So we decided to stick with the MathML, but for fractions, we were going to use a simplified HTML as a way to represent – hold on a second, apologies – as a way to represent the fractions. And so we would then wrap them in an inline formula so that we could easily find equations in the XML documents themselves. Next slide, please.
Another of the challenges that we encountered were images and figures. Again, a lot of our content has, has images. They've got figures representing the, you know, cross-section of pipes or, or, you know, physical features. A lot of times that art was old. A lot of times, the art had the, the figure labels and the figure captions embedded as, as text inside the image, so it was not readable. And a lot of times, the art was, was often the exact size in pixels as it was in print, which meant that sometimes the art was extremely small.
16:06
So, again, you know, we used DCL's programmatic tools to extract the art out of the PDF document, so that we could pull the PDFs out and then save them as images. Some of those PDF, some of those images, had to get cropped, re-cropped, so that they, so that the, the source and the figures in any sort of text was removed from the actual image itself. And then we had to key in that text into the XML, so that we captured it as, as live text. And that of course took a lot of, a lot of review. So, next slide, please, Marianne.
Thank you. Some of the other challenges we found were cross-references, links, and validations. A lot of times in content, in standards, you'll see a note within the text that says see figure three, or see table two, and we wanted to, to really add the value to the user of having these be digital, in digital versions, by adding links into the, into the XML and into the text itself so that when you had a note, an inline note that said see figure three, you could, you could click a star or some sort of indicator and be taken right to figure three. Same thing with a table, and the same thing with footnotes so that you had linking within the document, and of course that wasn't there in the PDF at all. We also refer to other documents as well, so often we referred to other standards that may be related or other documents. Some of those standards may be AWWA standards, and some of those standards may be from other organizations. So we really wanted to be able to add those links within the content itself.
So the way we went about doing that was we used some of the software tools that DCL has, to, to read the text itself and determine what was a footnote or a reference to a figure and add the links in to the XML itself. And then we also use some, some natural language processing and some machine learning to identify standards and other documents that were referenced that were outside that document and add the, the references to those documents when, when they were available. Next slide, please.
Tables also created a large challenge for us. One of the biggest challenges with tables is that we had no set style. And as you can imagine, over, over many, many years, and, and literally tens of thousands of volunteers helping us to develop these standards, there were many, many different ways and approaches to tables. So, trying to extract those into a digital format so, again, we didn't have to, to use images, was a, was a challenge. There was a lot of inconsistencies. There were, there were a lot of just odd table styles that, that we had never, that, that we, we didn't have an easy way to handle. You can see a couple of examples here.
On the left, you'll see table C-2 is actually a table that has two columns: a distance and, and a variable A. But because this was set up for print, that table was broken up and wrapped into three columns. Well, a parser might see this as a table with six columns, but that's actually not a table with six columns. So we had to unwrap that, that table C-2, and we had to look for instances of that and, and do that unwrapping. On the table on the right, you'll see that this table is really a mixed table. Inside the table at the bottom, there's a figure, and also inside the table at the bottom, there's an equation.
20:17
So these are not necessarily straightforward tables. So some of these took some, some, you know, quite a bit of review, but we got better at, at predicting what was going to work and what wasn't, and being able to come up with solutions. Next slide, please.
Quality control, of course, is another issue. You know AWWA publishes content on, on water, and, and tap water, and drinking water, and, and making sure that these were accurately transformed was really a matter of public health. So that we could not, we just could not publish without making sure these things were, were as close to 100% accurate as, as possible in the conversion. So we really needed to come up with multiple levels of review. And so we came up with a workflow to do that. We used XML validators and parsers to make sure the XML itself was, was valid. We used automated software to do other sorts of validation. You know, with proper reference tags and linking. and make sure that we're referencing the right standards. We had to check that, that specific metadata was properly applied and put into the right place in the XML.
We had to go back and compare the XML to the PDFs and we actually had to send these back to, to our editorial and technical staff to review. At first, we were doing almost all of them. And then we started to, once we figured out where the problems might be, we were able to become more efficient and do more spot checking. But each of the standards was reviewed line by line, by, by somebody on our, on our QA team. Next, next slide, please.
So, where did we end up at the end of this project? Whether it, was our hypothesis correct, about doing the legacy content first? What did we, what did we accomplish? Where did we end up, and what does it mean that, that we finished converting all of this content into XML? Next slide, please.
So the results of having all of this content now in XML was that we could start to really leverage that XML for other features, such as downstream search and discovery. We could start to build out a better distribution channel in a new platform for the standards manuals. We could start to look at new features that we knew that our, our standard users were asking us for, and we could start to build out new workflows to get this work done. Next slide, please.
So, downstream search and discovery. By having XML, you now have your content in a machine- and human-readable format. You're also able to do some things like semantic enrichment, where you can add additional tags, or an additional metadata to it through automated tools. And you're able to add that, that content to indexes and other search engines, so search engines and indexing systems can look at the XML, they can pull out the pieces that they need, they can, they can create their own index and search of it. They can weight the different areas of the XML, you know, heavier or lighter in terms of the importance of that particular area.
24:04
On your right, you see an example of our content, which is now searchable in Elsevier's Engineering Village. The content is not available on Engineering Village. We have our own platform. But we're now able to offer our content to be searched and found in the Engineering Village product. And this is a new thing for us, and it, and it's really been extremely helpful to be able to, to have our our engineering content searched and searchable in more places. Next slide, please.
So as I said, we were also able to build our own platform. Our platform is called envoi. And it offers all of our, our standards managed and organized by the standard family so that you can find the active version and go back and find the historic versions. And, and the way the platform is set up, you're always going to the active version, so there's no more concern from engineers, you know, unsure whether they have the latest version of a standard. As soon as that standard is available, it's on the platform and next time that engineer goes to the platform, they'll get that standard. In the past with those binders, if there was an update, we would, we would put that printed update in an envelope and put it in the mail. If it got lost in the mail, or the offices were closed for any number of days, or that office, or that envelope arrived at somebody's desk, and they were out or they had left the organization, there is no guarantee that that, that, that new standard is actually going to make it into the, into the binder and the old one would be removed.
So, we've taken that out of the equation at this point, by always making sure that the, our, our customers have access to the most recent version of the standard. We can, by having it on a platform, we can, we can enact better entitlements and access control. We have much better access, we have much better ability to make sure that the right people can access the content and it's available to those who have provided us the right, the right kind of access to it, and it doesn't become available to those who either have not or do not have access to it. And it becomes an overall better experience for our users. Next slide, please.
Some of the new features that we're able to add on this platform include redlining. And this is a very popular feature. It allows you to take the two XML documents and put them together in a rendering that shows you everything that has changed. So you'll see here on the, on the image that everything that's in red has been removed, and everything that's green has been added. So people can really see, you know, what, what's new in this new version of a standard. And it really, it's a very effective feature.
We also have features such as annotation on sections, so if you wanted to add a note to a particular section and share that with your colleagues, you could go ahead and do that. We're also able to make the standards available on mobile devices and actually readable on mobile devices. Reading a PDF font on your phone is is never, never a fun process. It's a lot of pinching and zooming. But, but if the content is in a digital format that can flow to the size of your your view screen, then you can actually access that content on a mobile device. And we can also offer offline access as well for those people who might be in the field. Next slide, please.
28:04
So this also requires some new workflows. And, and this requires some new skills for the people who are having to do those workflows. And, and so I wanted to walk through this a little bit, on the, on the left here, you'll see an XML feed validator. So, you know, when XML comes in from our vendors, we need to validate it. So we need somebody on staff who knows how to run and manage a feed validator and then read the results. And then also somebody who can go to the, the documentation. And there's, there's really very good online documentation for NISO STS online, at NISO, I think it's nisosts.org. It has excellent documentation. So that somebody who was looking at this, this validation, these errors, could go back, look at the documentation, and then make the changes in the XML if necessary. So these are new workflows and new skill sets that are required to do this, and you'll also see on the right here, there's, there's a lot more files that need to be managed. In the past, you know, when we had a standard, we had a PDF version of that standard, and that's what we have to save.
But now we've got not just the PDFs, but we've got the XML version, and inside the XML version, we might have images; because our content is in XML, we can generate other formats as well, including EPUB and MOBI. And so all those need to be stored as well. We can also take the XML and regenerate a Word file, and this has been very helpful for our committees who work on the standards, so that after a standard is finished and published, we can provide them a Word file with the latest version of the standard in a format that they can then go and use when it comes time for them to review the next, the standard at the next time. Next slide, please.
So what's next for, for what we can do now? So some of the things that I've thought about in terms of what we can do now that we have our content in NSO STS XML: can we use this XML earlier in the production process? And by that I mean, you know, right now our committees write the standards in Microsoft Word, and then they're given to us to be edited, and then those Word documents are sent to our vendor to be converted. But, could we actually come up with a way where the, the committees who are writing this are actually writing and editing and working in a version of XML itself? They may not see the tags, but they may, but there may be an interface like a Google Doc or something like that that allows them to work, to work where the XML, the NISO STS, is actually active in the background. And that would really speed up our production processes later on.
The other area, or one of the other areas, is, can components of a standard be more easily transferred between other documents and added to other documents? So could we create an interface where an engineer could take a chunk of a standard or the piece of a standard, maybe it's an equation, or maybe it's a table, and pull that out in a format that can, number one, be added to another document, a spec document or something like that, but also, number two, that retains the information in the net and the original metadata from the standard, so that anybody who's looking at that new document can see this piece of the document came from this other standard.
32:01
So there are lots of other questions. You know, like that. XML really provides a wealth of opportunity to leverage your content because it's so well structured. Next slide.
So that's the presentation that I had for today and I would love the opportunity to answer some questions.
Marianne Calilhanna
And we do have some questions. So, thanks so much for telling that story, Dan. I just, I just love hearing what you and AWWA and DCL were able to accomplish. It's, it's a great story. Okay, so, one question. Did you have basic XML metadata before the project, or did you start not having any XML, and go straight to full text XML?
Daniel Berger
So we had some metadata that was stored as XML. But, but again, it was, it was inconsistent, and it was full of gaps and holes, so it was a good starting place, but, but it wasn't, it wasn't complete. But that was, and that was also only for, for metadata. We didn't have any XML of the actual body content.
Marianne Calilhanna
Okay. Another question. Um, do you think we must adopt XML to publish standards?
Daniel Berger
No, you don't have to. You know, I mean, it depends on what your ultimate goal is. Some, some organizations, you know, maybe they don't have a lot of standards. They have a few standards. PDFs may be working well for them and their users. What I would really recommend is, is going to your users and trying to understand what it is that they want from, from a, an end product. You know, XML is not an end product. It's, it's a tool to get you to an end product. It's, it's a tool to be leveraged. It's a data format, it's not an end product, so, so, you know, asking whether you need XML is maybe not the right question. The right question is, what is your end goal? You know, what do your users need? What do you need as an organization? Are you really struggling with the time to, to turn around standards and get them into the marketplace? Are you really struggling with, with making those standards available to your, your customers in a format that the customers want? So asking some of the questions like that might help lead you to, to a better answer than, than simply "Do you need XML?"
Marianne Calilhanna
All right. And I have a question, and it's regarding question two here. Have you answered that? Have you come up with a way to reuse parts, or parts of the standards, in other documents or other standards?
Daniel Berger
You know, we have not done that much. Part of the reason is, is, again, it goes back to our users. This is not something that they're demanding. This is not something that, that they're asking for right now. Other organizations in another fields, I know that they are asking for this, this kind of a feature, so I know that it's on the horizon. The water sector is not, does not tend to be the most innovative and out-front sector when it comes to engineering in terms of documents and specifications. So, I know this is something that's on the horizon, and I know that it's something that other industries are looking at and asking for, but the water, the water sector hasn't, hasn't gotten there quite yet. So.
36:05
Marianne Calilhanna
Okay. Um, has the Elsevier Engineering Village, did that generate new revenue? Bring new revenue into AWWA?
Daniel Berger
That's a great question and, and I would say that it's, it's too new to know the answer to that question, but it has, it has given us more, more of a way into the marketplace. So what, what it's done is it's, it's helped people learn about AWWA and the resources that we have available. And it's certainly brought people to AWWA asking for some of our content. But it's hard to say whether it's, it's really made a, made a significant revenue difference. I would imagine at this point, probably not, but sometimes that's not, that's not always the end goal, sometimes the goal is more searchability and discoverability of your content and letting people know who you are and what you, what you have. You know, there there may be some, some users that that already have access to AWWA standards, but, but using the the Elsevier Engineering Village has given them a way that they may be more comfortable to access it or find it...
Marianne Calilhanna
Right.
Daniel Berger
...for other things as well.
Marianne Calilhanna
Interesting. What is your in-house XML validator? Is it something custom? Can you speak a little bit to that?
Daniel Berger
Yeah, we have a couple of tools that we use. They're both custom. I don't know of any commercially available validators, other than using some online tools. The W, the W3 has as an XML validator online that's very good. It will just validate whether you're, it will just do a basic validation of your XML. But when it gets, you know, even though this is an XML standard, every organization is going to implement it slightly different, and have different expectations about what it is they want and how they want things to be in there. So, we're using a couple of different custom validators. They're not very fancy. They don't have beautiful interfaces but they're very effective and efficient. And you know, most, most vendors who are, are working on XML and conversion will probably have something that they can share, share with you. So, if you're working with a vendor, and you're curious about validation tools, I would go ahead and ask, ask that vendor what they have available to share.
Marianne Calilhanna
All right. And, um, the final question we have here is, I'm just going to flip back. It is, what is going on in the "Expect the Unexpected" slide? And I'm just trying to – there we go.
Daniel Berger
This was a, this was a great picture that I found. These were, these were bugs that had made nests in this tree. And they were, they were so, so prolific. And it was just really meant to, to express this idea of, that's not what I expected when I, you know, when I, when I look at a tree along a riverbank, to see them shrouded in that kind of a thick net.
Marianne Calilhanna
Right, Right.
Daniel Berger
Yeah.
39:33
Marianne Calilhanna
Well, Dan, thank you so much for sharing your story. Thank you, everyone who took some time out of their day to join us. Um, the DCL Learning Series comprises webinars such as this. We also have a monthly newsletter and our blog; you can access many other webinars related to content structure, XML, and a whole lot more from the on-demand webinars section in our, on our website at dataconversionlaboratory.com. I hope to see you at future webinars. Thanks, everyone. Have a great day, and this concludes today's broadcast.
Daniel Berger
Great, Thank you.