Structured Content: the Foundation for Digital Transformation
from The Content Strategy Experts— a Scriptorium podcast
This informative and entertaining conversation presents a number of projects in which DCL has been involved to help organizations bring new digital products to life as well as projects that streamline complex tasks with automation and content structure.
Alan Pringle: Welcome to The Content Strategy Experts podcast, brought to you by Scriptorium. Since 1997, Scriptorium has helped companies manage, structure, organize and distribute content in an efficient way. In this episode, we talk with guest Amy Williams about how content structure provides the building blocks for innovation. Hey, everybody. I’m Alan Pringle. We have a special guest here today.
Amy Williams: Hi, Alan. This is Amy Williams. I’m here from Data Conversion Laboratory.
AP: Hey there, Amy. First, let’s do some introduction so people know who you are and what your company does. So tell me a little bit about yourself and about DCL.
AW: So I’ll start with DCL. We’ve been in business over 40 years.
AP: Good for you.
AW: And, yeah, I think we have you beat. I think we’re 1981.
AP: Yeah. We’re ’97, so you have.
AW: Right. Essentially, what we do is provide data and content transformation services and solutions. We use different technologies to provide those services and different various AI technologies that you probably hear a lot about, but machine learning, natural language processing and ultimately, we use those to help our customers structure their data and their content so they can use them on different technologies and different platforms. That’s essentially what we do. I’m the Chief Operating Officer at DCL. I’ve been here for 24 years. I come from a management consulting background.
AW: I know, it’s a long time. I was so shocked when I say it myself.
AP: But hey, that means you know what you’re talking about. I think you’ve given us a good springboard with that introduction into what you and I want to talk about today. We’re going to talk about how structured content is the building block, the basis, whatever you want to call it, for doing these digital transformation and innovation projects. Would you give me what your definition of digital transformation is?
AW: So really, I’d say, at its core, digital transformation is using digital technologies to create or modify business processes and your customers’ experiences. And the goal here you’re trying to meet, business needs are changing all the time, you’re trying to meet those changing business needs and the market requirements. But I would say it’s really the re-imagining of your business in a digital age. So, I guess, if you think about it, most companies really started this transformation a long time ago. We used to have analog processes. People started to go digital. So that was sort of the first step in the digital transformation. But if you think about it, we had filing cabinets full of paper and ledgers were built to [inaudible 00:02:47] their books. And then to digitize things, we went to word processors, and spreadsheets, and scanned hard copy.
I guess, when I’m talking about digital transformation, I’m talking about taking that next step and changing the way you’re doing your business, from your internal systems to your customer interactions. If, as a company, you start to think and plan and build processes with the digital innovation, you really start to future-proof for yourself, because you’re going to become more agile, more flexible. You’re ready to embrace these new technologies. Basically, everyone has to keep up with the times to succeed, so that’s really how I see digital transformation. That’s what it is.
AP: Yeah, and that fits with kind of our, at Scriptorium, we of course, have a very content-specific view of digital transformation. And our shorthand description I think can be summed up as something like using technology to enrich the delivery of information to customers. I think you hit on a lot of good points there, especially in regard to future proofing. But let’s dial it back, go all the way back, and let’s talk about from… You’ve got this big overarching idea, but, at the core of it, you’ve got to make some changes about the way that you handle information, the way that you handle content. And really, that pivot, from my point of view, and well, not just mine, but a lot of people’s point of view, is that structured content is at the core of doing this future proofing so you can do this digital transformation. Do you want to talk about that a little bit?
AW: Yeah. I totally agree. Obviously, we’re in sort of the same business here. To me, the same thing, it’s a key building block for digital transformation, is structured content. I mean, there’s other pieces, obviously, of it. But from my perspective, and we’re both a little biased here, that structured content is that key building block here. I mean, I could talk a little bit about structured content if you, I mean, want me to do that.
AP: Yeah, we might as well. Why don’t we go ahead and just define it again. This digital transformation, people have slightly different definitions, so let’s hear yours.
AW: Okay. From my perspective, I mean, obviously, all companies, organizations, everyone has archives of content, and it’s different across industries. It could be historical documents, photos, industry standards, research. It just depends what industry. The problem is it’s not all in searchable format. I was just talking a little bit about digitizing as that first step to the transformation. But people think as a PDF, I took this, I scanned it, I’ve got a PDF. It’s a digital document. Well, obviously, it is digital, but it’s not really, because it’s not true searchable format. So that’s where the structured content comes in. We have to take that image-based PDF, take it to the next level. So you can run it through an automated OCR engine.
AP: And tell people what that is.
AW: Oh, so OCR is an Optical Character Recognition engine. And when you run it through the OCR, you get text behind that. So it’s not always beautiful text. It can be searched. But sometimes it doesn’t come out exactly right. Is and ones and Ls might be mixed up. It depends what the source format and what the quality is. And so it could be searched. The problem is if you don’t know the structure of that text, because basically you just have a bunch of text behind that image, it’s not going to be a very efficient search. So that’s where the issue comes in. And really most of the content that people are producing now for the most part is not structured. People are using Word and Google docs and it really produces unstructured content.
And what’s happening here is, when you’re writing these things, authors that are typically writing a Word document or Google docs or something like that, they’re really concentrating on the way the content looks, instead of what that content actually is. So for example, if you’re writing something and you have an introduction to a journal article and you say, “This introduction is going to be bold,” well in XML or in structured content, you would say, “This is an introduction.” You would actually say what this is. So when we talk about a searchable format, I’m really talking about XML here’s. That’s what we’re talking about.
AP: Sure. And like you said, we’re both kind of biased. I would agree with you. XML is really the way to do structured content. And when I say structured content, what I am saying is it is a publishing workflow that lets you define very consistently organized content in your documents, programmatically, so a human being doesn’t have to do it. So it sets up. You’ve got to have an introduction that has these types of elements. You’ve got to have a procedure that has this kind of structure. So all of that is programmatically enforced. And on top of that enforced structure, this is the critical part, and I think you may agree with this too, because again, we’re both biased, that you can add a layer of intelligence on top of that, that is really necessary from this delivery perspective, in particular, from my point of view.
AW: Right. And I’m assuming you’re talking about a metadata layer, right?
AP: Exactly. Exactly. Yes.
AW: Right. So in time that will facilitate an even a more efficient search in your content management system or your website. Basically, if you can’t find your content, it’s really not usable, so that’s really the key here.
AP: Exactly. And it goes both for the people who are creating the content, because if you have all of these bits and pieces of structured content inside a content management system, the people who are creating the content need to be able to gather all the bits that they need. And if they can’t find them, they’re going to probably rewrite it, which is what you don’t want to do. Plus on the delivery side, you may need that intelligence to personalize that content so you can send out something that is very specific to a region, or to the audience, or whatever else.
AW: Right. You sort of touched a little bit, then because to me, one of the biggest benefits of structured content also is content reuse, the infrastructure content facilitates content reuse. So basically, instead of creating and recreating, copying and pasting content, you’re creating that XML once, that instance once. And then other people in your organization can use it or reuse the content, but you’re also going to be able to publish it everywhere. So different apps that need it, or integrated systems that can use that XML and render it for different devices and generate PDFs for distribution, create eBooks, all of those things that can happen once you have that structured content in place. And really, I guess, the opportunities are endless as far as I see it. But it all comes back to that building block of structured content.
AP: Sure. I’m glad you brought up reuse, because when people hear digital transformation, they may think of big, shiny, beautiful marketing things and all the fancy technical ways that you can deliver content. But that reuse angle lets you basically give a very common voice or give clients your customers the same information regardless of where they are, if they’re in the sales cycle, or if they’re using their product, or whatever else. By reusing that content, you are giving consistent messaging. And yeah, it’s not as glamorous as some flashy kind of personalized distribution scheme. But that really, I think, is super important when we’re talking about digital transformation.
AW: Right. I agree. And you hit the nail on the head. It’s not super fancy. They say content is king. It’s true. It is.
AP: Yeah. Absolutely. So your wheelhouse, like ours, is structured content. So why don’t you tell me what you’re seeing out there right now, as far as trends go with structure helping these digital transformation scenarios?
AW: Right. So there’s a few different things that we’re seeing. I wanted to talk a little bit about the pharma industry, because we’re seeing a real big uptick in the use of structured content in that area, in life sciences and pharma. And really, what’s drawing that is, you can imagine, there’s a lot about documentation required to bring new drugs to a market. So here in the US we have a market language called SPL. It’s for structured product labeling that the FDA’s mandated. But what we’re seeing now is the pharma companies looking past that and worldwide. I mean, where we’re dealing with companies all over the globe right now. And they’re starting to look at where they can implement other tools and technologies that are using that structured content.
And the types of applications we’re being asked to support are things, it’s really streamlining the content around product labeling in the pharma industry. And you know what the goal there is, is they’re trying to improve the way that the content’s created and managed and delivered. It’s a full end-to-end. And they’re connecting at the end that product contact with the graphic templates. And they’re really putting together a fully-automated workflow around labeling. I mean, it’s really amazing. It’s really transformation of that whole internal publishing process for pharmas. It’s the same kind of thing that we’ve always done in tech docs. And really, the pharma industry is starting to come around to that end-to-end process and using structured content underneath. So it’s really, really exciting.
And the other trend we’re seeing actually in the pharma industry is they’re also starting to use structured content for a direct end user consumption, like through mobile apps. We just recently worked on a pilot, still around labeling. You know the labels you get when you get a prescription drug and there’s pieces of paper that you fold up? So they’re really going online and digital with those things. And they’re looking at ways for the end user and you as a consumer to go and get the most up-to-date information about those products. So that’s really interesting also.
AP: So this integration you’re talking about, it kind of is an integration in two ways. You’re integrating your processes that really assist with this automated delivery of content. But you’re also integrating things in regard to delivery and making it much easier for people to get information, your consumers, to get information, because it’s no longer just on a piece of paper. Not everybody wants to read a piece of paper in the 21st century anymore.
AW: Right. Right. And the piece of paper may be out of date.
AW: And that’s really important. It’s a liability issue. I think that a big reason why they’re being embraced in the pharma company, I think is part of liability and risk and minimizing risk.
AP: Yeah. And I’m glad you brought that up. Again, digital transformation is not just about the shiny stuff. It can really help with regulatory compliance, a lot, and give you all of the basically, intelligence you need to keep track of things, the archiving and whatever else, because you’ve got that really nice integrated process in the background managing all that information for you.
AW: Right. And it’s interesting, Alan, the legislation, that was the other area that when you asked me about trends that I wanted to hit on, that we’re seeing now. We’ve worked on a few projects now where we’re harvesting this complex legal and regulatory content and from public websites. And we’re seeing this trend in several industries. I’ve seen it in the financial industry. We’re seeing it in insurance and legal and accounting. And what’s going on is there’s all this information that appears only in public websites, this legal and regulatory type information. And their sites are constantly being updated with new content, modified content. It’s just so hard for people to keep track of it, for companies especially to keep track of it. And it’s extremely valuable, but there’s no standard for it or anything. And it’s a real challenge for companies that need that data so they can be in compliance.
And so what we’re seeing now is a bunch of projects where we’re developing applications that are just harvesting that information on a continuous basis and then structuring it, putting it into some form of XML, feeding that XML to their downstream system. So it’s streamlining that compliance process and back to avoiding the risk of non-compliance. I mean, they’re really, really important applications.
AP: Yeah. And that’s certainly better than keeping 1,400 filing cabinets full of musty old paper, isn’t it?
AW: Right. Right. And I don’t think they were really doing that. I mean, they have the information. They’re on websites. The problem is, how efficient is that? If you have up to 150 legislative websites that you need to keep track of and comply with different laws, it’s very difficult. You can have a whole stable of attorneys or legal aides sitting there working on this, but it’s just not efficient, unless it’s in a structured format and a consistent structured format. You can look at one website and it’s one way, and another website’s another way. And we’re talking documents here. It’s a little different than in an Amazon, your product details, that type of thing, but you’re talking about full legal documents.
And then you have to know what changed and what got updated and what got deleted. And you need to know that on an ongoing basis. And you need to follow those. So, I mean, it’s a lot of really valuable information that needs to come out. So we’re seeing a lot of these harvesting projects happening, and with structured contact being the outcome.
AP: Are there any other projects that really show some, I don’t know if surprising is the right word, but uses that you may not necessarily consider as being a digital transformation project that you want to talk about?
AW: I think mostly everything we get involved in is as a digital transformation project. I mean, we have some, I think, some particular interesting projects. But there’s one that we’ve been doing. We’ve been working for over 10 years now with the US Patent and Trademark Office. And I mean, it’s another good example of digital transformation. So you can imagine the USPTO receives a massive volume of patent application materials on a daily basis. And it’s a lot of different document types. And this is a lot of information. And they did have this process to digitize the incoming material. They had a whole scanning process going on, but they’re scanning to TIFF images. So it’s back to that same thing, you’ve got this information in sort of a static digital-
AP: It’s a picture, essentially. [inaudible 00:17:57], yeah.
AW: It’s a picture. Right, right. So it was taking the patent examiners way too long to go through the material. They had a multi-year backlog, when we started this, of reviews and approvals of patents, which obviously, is not acceptable to anybody. So at DCL, we developed a, there’s a fully automated system for them that transforms that high volume of scanned images to their XML schema. So they have their own XML schema. And what’s interesting about this… Well, the volume is interesting, because this is a totally lights-out, no human hands are touching this process. And we’re processing about a one and a half million pages a month. And the turnaround’s under 10 minutes. So it’s fully lights-out conversion. And even the volumes in some months have gone to two and a half million pages in a month. And it can scale to several times that.
But what the really interesting part of this is the way that we were doing the OCR, because we talked a little bit about OCR and how you can scan something. And then what you’re getting behind is not so great. Sometimes the OCR doesn’t work very well with tables and things like that. So the process that we developed now, it uses a computer vision technology. And it automatically detects that content that isn’t suitable for OCR. So things like math and there’s a lot of chemical equations. You can imagine in patent applications, they have a lot of those chemical… I don’t know what they’re called, equations? Or the pictures, the chemical pictures, the formulas, that’s what they are, and tables and things like that. So this process will extract those artifacts before it actually runs that OCR process.
So you’re running the OCR process just on text. So you get a better result. It’s removing those pieces that won’t OCR properly. And then we transform the content to XML, repackage it, the XML, with those artifacts that were removed. And you do that based on that page coordinates. So we did that computer vision to figure them out, we kept the page coordinates. And then you put them back together. And then they get delivered to USPTO.
AP: Okay. I’ve learned something today. That is absolutely fascinating.
AW: This is really interesting.
AP: That is fascinating.
AW: Yeah, it’s very interesting. And the result, it was a great result. It significantly improved the patent examination efficiency and the productivity of the patent examiners. At USPTO, they’re taking the structured contact. They have automated analytics that they use. They’re generating these claim trees. They report on different claims. There’s term and phrase identification. There’re all types of things they’re doing with the structured content. And it’s really amazing. I mean, they’ve significantly reduced their backlog. I mean, I don’t think they have these multi-year backlog anymore. It’s been a really successful project.
AP: And if you think about it, this is the kind of thing that you can pandemic proof or help reduce the risk of events like a pandemic. Because if you have these digital automated processes, you’re not as reliant as people getting together and being together and doing this kind of work.
AW: Right. Yeah, it’s a pretty cool project. The other one I thought might be interesting is one for NYPL, the New York Public Library. I’m a New Yorker. Everyone knows what NYPL is, New York Public Library. So they obtained from the US Copyright Office this catalog of copyright entries. And it’s basically this huge, vast collection of digital copyright entries dating back to 1891. And it’s really old material. So what’s in there is just the copyright status of millions of works. And so when you think about what the page would look like, I mean, they have about 450,000 pages of this stuff. But each page, they’re very dense pages. It’s three or four column. And they’re just these little catalog entries, columns of catalog entries. So each one could have a hundred entries on one page.
When NYPL came to us, what they wanted to do was create a database so somebody can quickly get in online and determine the copyright status of a specific piece of work. So they’re trying to benefit the publishing and scholarly communities, so they understand what’s within copyright, what’s not within copyright. So we developed a process there also to extract the text again, using this page coordinate data, which we’re seeing a lot in these systems. So the page coordinated data, in these systems that the end users are using, they want to show the page as it was scanned. So they want to show that image piece. And then they want to show the extracted text that’s fielded. So we use the OCR engines that use page coordinate data to be able to facilitate that type of a display for the end user. It’s interesting.
It’s based on funding for NYPL. So we’ve done three different tranches of this work. And as they get new funding, we do more. But really what’s happening is the users are able to search across these hundreds of thousands of records with a very high degree of confidence now. And they can search by specific fields. They can identify records relevant to their search. Like I said, they can use the machine readable text and the image record. I love this one. Actually, NYPL refers to this project as unlocking of American creativity, which I think is great. But that’s really what it is.
AP: Because if something doesn’t have a copyright, that means someone else can take it and use it as a building block, perhaps.
AW: Right. They can use it. I mean, I think that in the end, eventually if it’s a book that is no longer under copyright, maybe they’ll be able to get an ebook on demand. Or there’s just so many different applications for this. But it is unlocking all that creative, whether it’s some music, records, books, all the different types of things that people can have free access to, if it’s not under copyright anymore.
AP: And again, this is metadata. At the end of the day, copyright or not, that is a piece of very important metadata.
AW: Yep. So we’re back to structured content and metadata as the key to digital transformation, from our perspective.
AP: Yeah. But those two case studies were really fascinating. And to wrap up, do you have any advice for companies who were wanting to maybe do something a little more innovative and consider structured content?
AW: Right. So, I mean, I think, like we said, the structured content is one building block of that digitization strategy. I always have a hard time with that word, but digitzation. There I go again. Anyway, I mean, my advice would be, I think you need to start with an overarching digitization strategy, that needs to be well thought out before you’re going to take on a structured content project. That’s from my perspective. And I think you need to answer some larger questions here before you say, “Oh, I’m just going to create XML.”
What kind of a content management system are you using? Or are you’re going to use a component content management system and go to data? Or what downstream systems are you going to use for your structured content? What are you doing with the content? How’s it going to be utilized? Who’s going to update it? And who’s going to use it? And how will the content be created and structured? New content, how are you going to create that in a structured format? So this is a few other questions.
But again, my advice, because again, we’re biased, I would suggest working with consultants and partners, not only just because I’m biased. It’s just because I think it’s a great way to get started on drafting that overarching strategy, because part of the advantage is you’re drawing from the experiences across different clients. Both of us have experiences working with many clients and many projects. And we can draw on those experiences. So first to me would be create that overarching strategy.
And then this is one’s that’s going to be near and dear to your heart more, Alan, would be, once you decide on a structured content project, you’re going to want to develop a content model first. And that’s you. And you want to make sure it’s supporting a good representative set of content. So if you’re in the pharma industry, you want to make sure that you’re covering different drugs and products and different localities, because they’re global, different document types. With journal content, you want to make sure you’re looking at time spans, because just like we talked about for NYPL, something from the 1800s, it’s going to look very different in the 1900s. And then, I think the content model is key, which is where you guys come in.
And then you’re ready for your actual conversion. Once you have that content model, that detailed content model, I think, then you’re ready to go into a structured format and start with a pilot and some samples. And I would suggest significant testing with downstream systems before you begin a conversion of a full set of data, because you don’t want to have to go back if you have a large volume of content and redo anything. But again, once again, I would suggest working with a company that does it. But again, not only will you able to draw from the years of experience, which I already said, but like I just talked about in a couple of these examples, we can apply some automation to the conversion process, which is going to produce a higher quality and more consistent data set.
AP: Absolutely. And I will say one thing about conversion, why I think it’s really wise to use a vendor. If you are doing one of these big, innovative digital transformation projects, there’s going to be some change management you need to do to get people moved off the old way into the new systems. The absolute worst way you can introduce a content creator, in particular, to a new system and a new way of doing things, is to have them manually convert from the old system to the new system. You will gain so much hate and so many despondent, unhappy people, that right there is another perfect example of why you need to consider hiring professionals to do your conversion work.
AW: Yeah. A lot of our work nowadays, are we still calling it re-platforming?
AW: That’s really what it is. That’s what we’ve been doing. We take a lot from one platform, your content in one platform and move it to another platform. And sometimes we’re doing conversion from one XML to another XML. But we do a lot of re-platforming. And it’s a big, messy job. This is what we do, I mean, Data Conversion Laboratory, that’s all we do, so yeah.
AP: Exactly, exactly. Amy, this has been a really interesting conversation. I cannot thank you enough.
AW: You are so welcome. Thank you for having me.
AP: You are most welcome. Thank you for listening to The Content Strategy Experts podcast, brought to you by Scriptorium. For more information, visit scriptorium.com or check the show notes for relevant links.