top of page

DCL Learning Series

PubMed Central Primer and XML101

Marianne Calilhanna

Hello, everyone, and welcome to today's webinar. We will begin in just one more minute. We're going to allow some folks to continue logging on. Welcome, everyone! Welcome, Colin. Hello, Christina. Hello, Jeff. Hello, Susan. Welcome, Virginia. Happy everyone's taking a little bit of time out of their day. So welcome to the DCL Learning Series. Today we are hosting a PubMed Central primer along with an introduction to XML. My name is Marianne Calilhanna and I am the Vice President of Marketing at Data Conversion Laboratory. I'll be your moderator today and just a couple quick things before we begin. We are recording this webinar and it will be available from the on-demand section of the Data Conversion Laboratory website at If you could hit the next slide, please. 

Before we begin, I'd like to make just a really quick introduction to Data Conversion Laboratory, or DCL, as we are also known. Our mission is to structure the world's content. Content can unlock new opportunities for innovation and monetization when it has a foundation of rich structure and metadata. DCL's services and solutions are all about converting, structuring, and enriching content and data. We are one of the leading providers of XML conversion services, DITA conversion, structured product labeling conversion, and S1000D conversion.

Many people are well aware of our excellent content conversion and transformation services, which is what you see on that first blue box there. But we also do a lot of work in the other areas listed on this slide. Things like entity extraction, third-party validation of previously converted content, semantic and metadata enrichment, data harvesting or website scraping, content re-use analysis, and structured content delivery to industry platforms. So, if you have complex content or data and any challenges around those things, we can help. 

I feel fortunate to work with some great people here at DCL, and David Turner is indeed one of them. David helps our customers, or those who are not yet our customers, with content technology initiatives. He's a great person to reach out to if you have any contact structure challenge, and he really understands and demonstrates how XML, metadata, and updated workflows translate to digital transformation and, ultimately, revenue. David, help us make sense of PubMed, PubMed Central, and XML. 

David Turner

All right, definitely. We'll do that. Thanks so much. I appreciate the introduction, and hello to everybody. I'm coming to you today from my home in beautiful Sasche, Texas. If anybody that's on this webinar has ever been to Sasche, Texas, let me know. Put something there in the, in the chatbox, et cetera, because I'd be interested to know. But, really, yeah, so I'm coming to you from my home office, so who knows if some kind of craziness will ever happen in the background? If there'll be some noise or some, some loud children or something, that'll come by.


So, in terms of just background on this webinar, where the idea came from on this is, it's really from some questions that we've gotten on a repeated basis. As Marianne alluded to earlier, DCL is in this structured content business. And what that means is that, often, we're doing conversions to XML, and, and we're making updates to metadata and things like that. And a big part of our content structure practice is really around structuring content for government repositories, right? So, for example, I think she had mentioned we do conversions to XML for drug listings that are going to be submitted to the FDA, and it's called SPL, or structured product labeling. 

We do conversion of patent documents, the supporting documents that go with the patent application for the US Patent Office, and we also do a lot of work converting content and structuring it so that it can be loaded onto PubMed Central, or PMC. Because our clients have asked a lot of similar questions about this, we thought we'd put this together, and part one is about the PubMed Central process because that's typically where the first half of the questions come around. But then part two is going to be about just XML from an introductory perspective. For some of our clients, when they come in, and they are going to load it onto PubMed Central for the first time, this is really their first exposure to structured content, or their first exposure to XML. And so we've got some, we've got some explaining that we'll do.

Okay, guys, well, let's just kick in, and let's start here with part one, which is the part about PubMed Central. All right, so here is just a quick screenshot of PubMed Central, this is what it looks like, and I'll actually show you some of this live here, in just a minute. Basically, what PubMed Central is, it's an online library or repository that's part of the US National Library of Medicine. So, as you can imagine, the National Library of Medicine has lots and lots of shelves of books and journals and things like that, but they also have electronic content, and they have this repository, PMC, that is maintained by them. 

It's actually operated by the National Center for Biotechnology Information, and I think we actually have a couple of NCBI people, or at least one, on today. So welcome, thank you for coming. Um, this repository contains, really, all the full-text journal articles from a particular area: biomedical and life sciences journals. All right, so, you know the journals, and it's got all this content, and it's available for free, and I think currently the number is, like, 6.3 million articles, and they go back all the way to, to the 1700s. All right, guys, just a quick look here at what it looks like. This is not it. Supposed to be. Here we go.

All right, so this is what PubMed Central actually looks like. If you want to look up a particular journal, you can go in here to the journal list. If you want to look up a particular topic, you can do that as well. I'm gonna jump in here and, you know, just put in a journal that I know. Um... here we go, The Journal of Health Economics and Outcomes Research. I search for that particular journal, it's going to tell me what content is in there from this site. It's going to tell me, you know, what the last volumes are, et cetera. I'm just going to click on this here. You can see I can jump in and I can actually get into this particular volume. I can get into each of the articles, and open up an article here. 


And you can see here in this article, that it's got the title, the author information, the abstract, and then as you go down further, it's got the entire article. It contains the images, it contains the tables in a very flexible format. You can also output, you know, various different outputs. You can do an EPUB output, you can do a PDF output, so you can read it in a variety of different ways. But that's what, that's what PubMed Central looks like. 

Kind of the next question that we often get about PubMed Central is, what's the difference between PMC and PubMed? Or how is PMC different than MEDLINE? If my journal's already indexed in MEDLINE, isn't it already in PMC? Things like that. So, just to walk you through this, I'm gonna kind of walk through the difference here and for those of you who've been in scholarly publishing a long time, you might already know some of these things, but I'm going to try to explain in layman's terms. So when you think of PubMed Central, I mentioned that the National Library of Medicine has all these shelves of content out there that are in print. Well, electronically, they also have shelves of content. So you can think of PubMed Central as this, as a bookshelf of journals about medical research. So it's the actual full-text articles. It's all of the different pieces together. 

Whereas PubMed, you look at PubMed, it's actually a bibliographic database. It's, it's a card catalog, if you will. If you were at the library, you'd look at the card catalog to find out where those full-text articles are. In this case, you use this to have references, or links to the online articles and books, so it looks like... this here. This is what PubMed looks like. I can look up that same article that I had there before. And I can see just, you know, just see items that go along with it. So maybe I just put in something like this here. Maybe I put an author's name or something like that. 

Now, if I can find this article, yep, there's that article right there. And you can see here that it does include some of the same information, the abstract, the authors' names, et cetera. But it does not have the full record. Instead, what it has is linkage to the article on PMC. So again, back to PubMed, it's intended to be kind of that card catalog. So it has the information that points you to the bookshelf, the full text articles that are out there. It doesn't just point to PubMed Central, I mentioned there are, what, 6.3 million articles. Well, PubMed actually has something like 30 million citations. 

So just by doing the math, or you can see, it does actually also point to other repositories, actually also points to other publisher websites. So there can be lots of content out there where the full text exists someplace else that PubMed will show you the citations for, whereas PubMed Central is just that specific group that the National Library of Medicine is hosting. In terms of MEDLINE, when you think of MEDLINE, MEDLINE is a, is a subset of PubMed, So it's also a bibliographic database. It's the largest subset of PubMed. It represents something like 26 million of those, of those citations of 30, 30 million citations total. Twenty-six million of those are indexed in the MEDLINE database. And typically, one of the things that makes them very useful and special is that they're indexed using MESH subject headings, but that's another topic for another day. 


Moving along here to this next slide, in terms of of submission, if a publisher was being asked to submit content to PubMed or MEDLINE versus PubMed Central, it's, it's a little different. The rigor is a lot different; the amount of content is a lot different. So, when you're submitting to PubMed, typically you're submitting XML header-type information only. You're sending over the basic citation data, and you're sending over basic abstract data. I think they can turn these around in something like 24 hours. It's 24 to 48 hours, something like that; it's a really simple process. Now, when they go to the full-text library, when they go to PMC, there's a big difference. Publishers here are going to certainly submit the citation data and certainly going to submit the abstract data. But then they're also going to submit the full text for all the articles, all the tables, all the images, all the references, all the supporting documentation, things like that. That's what's involved. So, from a submission perspective, it's, it's pretty significantly diferent.

All right, so I'm sure you guys have all got this, you're ready. So, I'm gonna give you a quick pop quiz here, so just, I'm not going to ask you to do this like, as a, you're online or anything like that. Just, you can grade yourself, you can grade yourself, so get ready there. Okay, question number one: can an article be indexed in MEDLINE and not exist in PubMed Central, PMC? Be thinking, be thinking... hope you wrote down your answer. The answer is yes, certainly. So, MEDLINE, again, has 26 million citations. PubMed Central has 6.3 million articles, so you can certainly have an article that's sitting on the publisher side, or in another, in a database that does not exist in PubMed Central.

Question number two. Are articles in PubMed Central generally also indexed in PubMed? Your answer there? No cheating. Yes. Yes, absolutely. Articles in PubMed Central are generally also indexed in PubMed. They tend to work hand in hand. 

And question number three. Do PubMed and PMC have the same rigorous requirements for full-text XML? The answer is no. We just talked about number three, so for PubMed, it's typically a relatively quick process, XML header-type info. Whereas PMC requires a lot more information. A lot longer process. All right. So if you graded yourself, if you got 100, let me know in the chat. If you didn't make 100, also let me know in the chat, and just let me know how you did on this quiz. All right, let's move along here and let's, let's talk a little bit more about: why do publishers put content on PubMed Central? What are the benefits to them?

Basically they do it for four main reasons. First of all, there's the idea of increased discoverability and access. When you're you're researching, you obviously want your content to be accessed by the largest audience possible, and PubMed Central is certainly the preeminent repository of this kind of information. Another great benefit is that it allows you to archive your articles, really in perpetuity. Journal publishers might come, they might go, but over time, ideally, the library, the US National Library of Medicine, will be there, and so this allows you to archive them separately from, you know, what the publisher's website might have or what some other database might have, in some way that's there for the common good. 

Another great one is that you get a lot of increased exposure because you're integrating with the other databases that are part of the National Library of Medicine and NCBI. Um, sometimes it's really just important because maybe your funder is requiring that content to be in PubMed Central. All right, um, from there let's move on and let's talk about the process itself. 


So how do we get a journal into PubMed Central? Well, first of all, I'll tell you, they do actually have this whole process lined up over here. And there's a website here, and we're gonna give you this in, there's, a handout that includes all the links to this kind of thing, but there is a process that's outlined here on their website that walks through a lot of this detail, but I'll hit it from kind of a high level here today. All right? So first of all, you've got the process of just submitting the application. Because you're submitting content here, you have to give them some basic information. 

An important thing to understand here is that when you're submitting content to PubMed Central, it's not books or a journal article; what you're typically getting approved is an actual full journal, right? So what they're going to want to know is, who publishes this journal? Who's the management of this journal? What's the journal title? What's the ISSN? When was it published? How often is it published? You know, what's the website? And then they want to know, you know, links around what your various policies are for editorial or peer review and things like that. So you submit all that information, kind of as the first step, just to make sure, well, to prove that you're a reputable journal, and that your content deserves to go on this site. Once you pass that step, they move into an application screening process.

So, the people at PubMed Central start looking at the journal, and they start asking a lot of different questions. So they start asking, you know, is this peer-reviewed content? Is it the right kind of content? Is this biomedical content, does it actually belong here on this site or does it belong on some other database, um, what are these author affiliations that we've got? Are they, are they appropriate for this kind of content? Are the article types relevant? So it's kind of a high-level application screening there to give the idea of, from a macro perspective, does this continent fit? 

From there, they move into a more detailed view, which is the scientific quality review. This, this is really intended to be a review of the content itself. It's a content review, not a format review. And if you're already indexed in MEDLINE, you generally don't have to be, won't go through too much more in terms of the scientific quality review. If you're not in MEDLINE already, they're going to start asking questions like What's the scientific rigor here? Did the authors apply the scientific method, did they provide full transparency in providing their supporting data, things like that. They're going to want to know, you know, is this, is there good editorial quality here? Whatever your articles claim, are those claims clear and logical? 

The figures and the tables that they've added, are those well constructed, do they really contribute to that? So they do look at both scientific rigor and editorial quality, and then ultimately, they come back and tell the publishers, you know, if you made it or not. If you don't make it, you are eligible to re-apply in 24 months. And that's pretty consistent across the way here. I will say also theere's full criteria online for this scientific quality review. They walk through all of the different questions that they ask. Lots of, lots of sample questions. I wouldn't worry too much about scratching down that website, because we've included it in this handout, but you can certainly, you're encouraged to go and take a look. 

All right, so we've submitted an application. They did kind of a high-level screening and they dug in and looked at the content, kind of at this point, at this point, they move in and they want to look at the format. They want to look at the format of the content. You know, does this thing meet our technical requirements for this content? UltimatelyPubMed  Central wants to have their content to be consistent and want to follow a similar format. They want to be able to have nice linking, and so do that, it has to be in the right format, and that's typically where a vendor gets involved. 


The publisher is asked to submit an initial package of 25 articles, and those articles should include XML, should include PDF, should include images, should include supplemental data, all in a big package. At this point, they review all of that content. Now, if after three rounds of evaluations, they can't address all the reported errors, the application gets rejected. Um, and so, you've got to make sure that you have clean XML. And we've had pubishers come to us a couple of different times, where they have two strikes and they need somebody to help them on this third strike. You're not required to use a vendor. You can certainly do this yourself, but you just need to know that if, after three evaluations they, they will reject you, and, I believe my understanding is that the same 24-month period applies where you won't be able to, you'll have to re-apply in another two years. 

So after you pass that technical evaluation, they move you into a pre-production process. This pre-production process says All right, your content looks good, format looks pretty good here, let's, let's execute an agreement. Let's, let's get a larger set of article files. So you go in and bring in the rest of the articles that you're going to load, based on, you know, what you told them in terms of publishing frequency, et cetera, in the, in the first step. Then they, they, obviously, the review those as well, and they'll see the correction report, if any data errors are found, and as before, if they continue to find repeated errors, they will, they will reject your application. You will have to wait again to reapply.

All right, so, we've been through all of these processes. At that point you kind of move into a live release phase, and the live release phase, what I'm gonna show here is a workflow. It's typical of what we do as a vendor. This could be different if you do it yourself. Or even if you use another vendor. It just kind of depends on their process. But, generally, the process that we use is, the publisher will send us a PDF and a Word version of their articles. We take those, sometimes they'll send us their images separately; other times, they'll just asked us to extract the images from what they've got on these other documents. 

So we take those, we go through a process of converting that content, and we create an XML file, and we submit the XML, the PDF, and the images to PubMed central for you. We just do that on a regular, ongoing basis, so as your journal comes out, you send us the content, we make the conversion, we do the submission, and then we'll also typically send a copy of the XML back to our client so they can keep it for their records.

That's overall the entire process for getting a journal loaded. If you have questions on that, do put them there in the, in the questions pane, and we'll, we'll hit them at the end. I'm happy to do that, or I can obviously talk to you afterwards as well. 

All right, so still here in this PubMed Central section, let's, let's look at some frequently asked questions. I will say that PubMed does have a really great frequently asked questions page, here's a screenshot of it. Um, and I'm also going to show you the the link here. Again, don't worry about copying it down, we'll put it in that set of links that's on the, on the handouts there, but they do, they do a great job of going through a lot of common questions that are out there. And if you don't really see an answer to your question, first of all you can contact me anytime you like. But you can also contact PubMed Central directly at this email address. They're a very pleasant group to work with, I think they get back to publishers quickly, and so, so feel free to reach out as well. And that comes from their, their website.


All right. But I'm going to hit a couple of the frequently asked questions that WE get. Every now and then a publisher will ask us Why, why do we have to go through all this process of getting together the complete articles, rather than just linking to our journal site, we've already got it up on our site. Why can't they just link to it? And the answer is, really goes back to what we talked about earlier. You know, the link it's going to be on PubMed, the "card catalog" that's already established to do links to full text on the publisher sites. The idea here is that you're, you're having this content, this full-text content saved in a different repository, in the National Library of Medicine Repository, that's accessible for all time in the future. So, they require full-text articles and supplemental information. 

Another common question, Why does, why do they require this article in XML? I've already got a PDF. You know, maybe I've put it into HTML on my website, why do we give you this XML? Well, you know, there's a number of reasons why and they do list these on their website. First of all, XML is his hands down the most effective archival format. It's hardware and software independent. And so, you know, it just really works well to interact with all systems and devices. The example that I think of here, when I was in college, and I need to date myself on this, we used WordPerfect to create all of our documents. So I had all this great stuff that I, by then, you know, my resume ready for job searching, all sorts of different documents I saved in WordPerfect on a three-and-a-half-inch inch floppy disk, and 10 years later I couldn't access those documents anymore. I didn't own WordPerfect anymore. I'd moved on to Word. I know WordPerfect is still out there and I can dig those things that somehow if I want, but the idea here is that if you put your content into specific formats that require a specific kind of reader, you limit the ability to, to interact with other systems. You limit the future-proof aspect of it. So XML is really good at that. Another reason, XML is fantastic for readily transforming the content into, you know, whatever the best format is for a particular reading device. 

I mentioned on the, on your screen, they have a link for, to go to PDF. So if you're on your laptop, you want a PDF, if you want to print something out to read it, you can print out the PDF. But if you've got an eReader, you know, you can get the EPUB version, or if you want some other accessibility features, things like that, it allows, and there may be readers that we don't even know about yet. XML is going to set the stage for those.

Other reasons, it's a better search experience than PDF or HTML because there's this tagging around the different sorts of elements. You tell the computer, hey, this is the author's name and it recognizes that's the author's name and it's not just the name David Turner. And so that allows you to really search in a much more granular, practical way. It also enables effective linking of content. Because we have XML set up, you know, we can go into this particular article, and we can come in here, and we can, we can know what the other articles - Oh, I'm back on PubMed, not PubMed Central, aren't I. Anyway, but you can, you can link, and you can see what the other articles that this person has written are, because of that XML that's around it. Then, kind of last here, it does provide for those accessibility features, and we'll talk about that more in a minute. 


But accessibility is so crucial, and you've gotta have good structure to your content, to be able to do that. That leads me to my next question, and this one is a lot more common than you might expect, and that's What is XML? We've started this journal, and we've been publishing now for two years, but we've never heard of this XML thing, we kind of heard of it, but we don't really know what it is. It's something computer or whatever. And so that's what kind of leads into this next, this next segment, part two, the intro to XML. Now, when I talk about structuring content, typically I am talking about XML. And the idea behind XML is really this. 

So, um, when you look at, say, a journal article's references, like on the screen here, as a human being, you can pretty easily look at this journal and recognize the different elements, right? You can look at this and know that that's an author name. You can look at that and know that that's the name of a journal, or you can look at this and know that those are our page numbers. But to a computer, even if they can search and they can recognize that, you know, that "Bahner" is a word, it doesn't know that that's an author. It's, it's just, it's meaningless text. So, yes, if you're searching just the right thing, you can search for it, but it doesn't give the article, it doesn't give the computer any real, semantic meaning, if you will. 

The idea behind XML is to provide a standard of structuring content so that the computer can recognize elements like these too. When we convert a journal article into XML, the computer now "knows" this is an author name. It now knows this is a journal name. It now knows these are the different page numbers tht are out there. It does, it does this through a series of hidden tags. All right? So there are standard XML tags that are running behind the scenes that you'll see with an opening tag and a closing tag. The closing tag, you see, has a different arrow that has a slash in it, or backslash; I can never remember which one you call it, but, in any case, so the idea here is that you see those tags, like you see "surname" here, that lets, that lets the computer know, Hey, the text that's in between these two sets of brackets? That's the author's surname. 

The next one, that's the author's given name. Or if you go down a little further, this is the source of that reference. These are the page numbers that are relevant. So, we create these tags, kind of behind the scenes, so that the computer knows what all these different pieces are. And just to clarify, for the most part, your readers aren't going to see these tags. Typically, your humans are still going to see, you know, the HTML and PDF printout or something like that. What we do is, we set up where the computer can see these tags, so that we can enable a lot of the functionality we talked about before.

Which then leads to the next question: how do we know what tags to use? Do we just, we just make up tags, you know, how do we know what tags to use? Well, the answer to that comes in the idea of XML specification, or specs. And different communities have different specs. The idea here is the XML specs will allow you to create the elements in a standardized way with elements that are common to a particular publication and publishing community. There is a great presentation on this by my mentor, Bill Kasdorf, about XML specs. It's a few years old, but it's still, it's still a fantastic presentation. The slides, I think I put a link to them on there. If you have trouble finding it, let me know, and I can, I can always hit up Bill and get those as well. And he may have more updated ones than this, but this is kind of a transformational one for me a few years back. But in any case, let's just use the example of, you know, the journal article. 


Certain things are going to always be important when you're tagging a journal article. As I mentioned before, you care who are the authors, right? You might have a volume number. You might have certain citations, you might have page numbers. So certain things are always important to a journal article. So for this, there's a community out there that developed a suite of tags for journal articles, that's called the Journal Article Tag Suite. We use JATS, the specification when we're loading content, it has to do with the journals, the scientific journals like this. I have a little link here. You can go here to the JATS website, and you can look up and see what all these different elements are. All these are the standard tags that you're supposed to use when you create your content. Again, don't worry about this. These tags are in the, in the handouts, as well. 

All right. But the idea behind specs, I want to show a different example. I mentioned a particular kind of publishing community. So, if you're trying to create a journal article, JATS is going to make a lot of sense. But let's say you're trying to create a reusable section of technical documentation. I think we have some, some non-scholarly publishers that are on here. Maybe you're trying to create educational content. Maybe you're trying to create a clinical trials protocol in the pharma industry, and you want to be able to reuse this kind of content. Well, you're probably not going to care about the volume number of something. That's, that's an element that doesn't really matter. 

Honestly, because of the fluid nature of modular topics, you're really probably not gonna care very much about page numbers either. So, you know, those elements might not be all that important. What might be important, though, is being able to distinguish between a topic title and a section title, you know, do I want to re-use the whole topic? Do I want to reuse just part of that topic? So there are elements in this community that allow you to have, you know, a distinction between the topic title and the section title. Or maybe you want to do conditional text. One of the things that you see a lot with you know, technical documentation is the idea of being able to reuse the content for maybe a different audience. So you might create technical documentation that is 80% the same, but 10% is different because some of your audience is novice and some of your audience is experts. In DITA, they created an element for that. 

They created attributes, you know, for audience, so that's something, for this particular kind of an element, you might want to use the DITA specification. JATS for one, DITA for another. And I just put together a quick little slide here of some common XML specifications. It's by no means an exhaustive list. Depending on the industry you're in, there might be different communities that use this, but JATS is often used for scholarly journals, it also has a sister spec that's called the Book Interchange Tag Suite, or BITS, that's very similar. And then the next one here, kind of a general purpose, you see this a lot in pharma, you see this a lot in education, you see this a lot in technical documentation. That's the DITA one I just talked about. 

If you're in, like, the humanities, you're transcribing, you know, The Thomas Jefferson Papers or something like that, TEI is, it's a really common XML spec for that. If you're in health care, there's this HL7 spec that you may have heard about: billing records, patient tracking, things like that. If you publish standards, there's a whole standard that's for standards organizations. There actually was a standard that was called ISO STS, and that was replaced by a new one called NISO STS a few years ago. 


In any case, so the idea here is that in your different communities, there are these different sort of standard suites of tags, and you can, you can find websites on these and you can find all this information about it, and I'll be glad to help with any of that.

All right, so we talked about what XML is. We talked about this idea of a spec. How do we create this XML? What's the idea? Well, you've really got two options. First of all, you can create XML yourself and you can submit to PubMed Central yourself. As long as you have the right tools, which includes, like, some sort of an XML editor or XML transform type of a tool. Um, you can really, you can do it for free if you use Notepad, Notepad++. You gotta have some some technical knowledge to be able to code like that; it's not exactly the prettiest thing. You can spend a little money, and I'm not endorsing any particular editor, but I just happen to know that Oxygen's got this, this setup online where, you know, you can practice here. So here's, here's an example of an XML editor so you can make changes in this content however you like, you know, ABC 1, 2, 3, You can add comments and attributes and all that kind of thing over here. You can see the XML itself. You can see here, I just added ABC 1 2 3 without ever actually going into that, into that code. 

Certainly that, and there are several great authoring tools out there. Um, but you've got to have something like that to be able to create XML yourself. Another thing that, that really, you probably should have for this is some sort of an XML-enabled CMS or an XML-aware CMS; there are also a lot of great tools like that out there. I've also seen that there are some publishers that are trying the quote/unquote "free" option, which is using a GIT repository. I will say, "free" is kind of a funny word because, while you might be able to get started for free, typically, there's a lot of setup that's involved and you're gonna have to pay somebody to do that, whether it's somebody on your own team or not. 

And then lastly, there are some technologies that are out there that will automate the creation of XML. Some will even automate the submission to PubMed Central. I would caution you a little bit about those. Our experience has been that, yes, there are some great tools that will automate a lot of this, but if you need it to be exact, automation may not be the only choice. You may need to have another step in there. Because, remember, three strikes, and you're out on this. And we've seen this more than once, where a publisher has come to us and said We bought this technology, it's supposed to be doing this. But, you know, during our scientific review, we kept getting errors and we can't figure them out. So, anyway, you just need to be careful when you're going into that process and you can do it, and you can do it yourself. 

Obviously, you can also use a data conversion specialist, some sort of a vendor to do this, somebody like DCL. (It doesn't have to be DCL, but we HOPE it would be DCL.) So, you can use that. But again, you need to remember this. You have a lot of options when it comes to vendors, and quality does matter. Quality matters for getting the content into PubMed Central. Quality matters for how that content's going to be used, after it's in PubMed Central. And to demonstrate this, to talk about how quality matters, I do have a fun little lesson from the classic '80s movie Moonstruck. Let me know in the chat if you, if you love this movie, or you hate this movie, or if you've never heard of this movie, that's fine. But with that, Marianne, get ready to turn this on. I do need to tell everybody, if you're listening to this using your phone for the audio, you may not hear the video. You're going to have to turn up the volume on your computer, or just read the captioning, something like that. So, with that, Marianne, why don't you kick it off. 


[Plumber scene from Moonstruck]

"Well, Mr. Castorini, what do you think?"

"Ten thousand, eight hundred dollars."

"That seems like a lot!"

"Look. There are three kinds of pipe. There's the kind of pipe you have, which is garbage. And you can see where that's gotten you. Then there's bronze, which is pretty good, unless something goes wrong. And something always goes wrong. And then, there's copper, which is the only pipe I use. It costs money. It costs money because it saves money."

"I think we hould follow Mr. Castorini's advice, Hart."

David Turner

There we go.

Marianne Calilhanna

Okay, back to you, David.

David Turner

All right, so I hope you enjoyed that. You know, obviously, out here, we don't sell any copper pipe, but we do, we do want to impess upon you that quality does matter. And, the idea here is that the message you can take away from this is that I think we should follow Mr. Castorini's advice. Quality XML costs money, because it saves money. And we've got countless examples of this where publishers have taken a less expensive route to try to get the XML, and they've come to us later and said, we need to fix this, and there's a lot bigger pain with the backing up.

All right, I'm just going to finish up a few things here. We might hit on kind of the why of XML, some of the great benefits. We did hit on this a little bit during the PubMed Central part of the program, but I thought I would his on it a little bit more right here. Some benefits here. You've got this kind of XML structure behind the scenes. No matter what industry you're in, um, you're really going to see a lot of benefits. First of all, the idea of interoperability, simplified data sharing. So the XML makes the text readable by a computer, which makes it actually easy to share with other computers. And because it's software and hardware independent, it could easily adapt between systems. So that's one really great benefit, why XML makes sense as opposed to having a lot of PDFs laying around.

Second of all, the whole, I guess, search. HTML tagging, makes it easier to automatically parse the content of an article, right? To parse the different pieces, which really helps when you try to get into more detailed, more focused, and more filtered type of searching. If that content is in Word or PDF, you just don't have that same kind of ability to find things fast. Another great benefit: the idea of multi-channel publishing from, from a single, single source here. So with XML, instead of creating your content in a lot of different ways, you can create that in XML and then you can automatically, from there, get your outputs through various transforms as needed. If you need a PDF, if you need a print-on-demand file, if you need HTML for the website, if you need an EPUB, if you need whatever. You've got it, because it's in, it's in that XML, and you don't have to go and recreate it each time. A simple transform, and you've got the new format.

Accessibility. I talked about accessibility a little bit. The structure behind XML, its ability to adapt to various environments, really makes it a key component when it comes to accessibility. It also lets the computer know what's important to read. If you've ever just done, like, a kind of a, one of those, a screen reader that, it's not like an optimized screen, it just reads everything, and it doesn't really know where to stop or start, et cetera. But when you put this, this structured tag and you can give a computer instructions about what's important to read, how should it be read, where should be read, where should it go next?


It also does a lot in terms of things like voice assist applications. It lets you know, it can put those things together that can be used in voice-on-demand. We hit on this a little earlier as well, this idea of future-proofing. It's not tied to a particular software or hardware. The thing is that XML is going to still be able to be deciphered and used several generations from now, regardless of changes in technology. Let's see, after future-proofing, we've got enforcing consistency. XML really does help to enforce consistency. After all, you don't want to have a journal that doesn't have an abstract, for example. Right? 

Or if you're a pharmaceutical company, you don't want the objectives and endpoints created for this study to be wildly different from the objectives and endpoints created for other studies. If you're trying to comply to different regulatory requirements that are out there, you need to have that kind of consistency, and XML is perfect for creating and enforcing that kind of consistency. And last I'm going to mention one that's not really so much for the PubMed Central part, but when you create content in XML, you do have the ability to, to reuse and chunk that content. Right, so typically in scholarly publishing, scholarly journals, we don't do a lot of reuse; that's called plagiarism. But, you know, in some of the other industries that are out there, as you move to XML, you can do this kind of content reuse and chunking.

All right, so that's the great benefits of XML, and really concludes part two of this presentation. So, Marianne, if it's okay, I'm gonna give a quick summary here in the next couple of minutes, and then we'll open it up to any questions that you have.

Marianne Calilhanna

Sounds good.

David Turner

All right, so, in summary, structured content, XML, is really useful in a lot of different industries. There's STM publishing, like we've talked about for PubMed Central. Educational publishing really makes a lot of sense, especially if you've got a lot of content to reuse, that can be really useful. Technical documentation is the same way. If you're in pharma, the move toward structured documentation has been really really picking up speed over the last couple of years. And that's something that, we're helping some pharma companies out with. Healthcare. XML makes a lot of sense for healthcare and healthcare records. Financial services. We're seeing more and more, you know, standards in and around how financial information is reported, how it's shared, et cetera. So one thing I want you to take away from this is that XML is useful ‌in all these different ways. Another takeaway is that these are a lot of benefits, and we just covered all of these, so I'm not going to go through them individually, but you know, there are a lot of, a lot of positives. 

Another key point I want you to get is that, you know, using XML, writing the structured content, it doesn't have to be difficult. If you don't wash your hands of it and think, Oh, that's impossible, it's something you could do. But it would really help to have a partner, this kind of thing. Again, it doesn't have to be DCL, but DCL will be glad to be that partner. And, last here, the PubMed Central, the PubMed Central submission process doesn't have to be difficult either. And again, it can help to have a partner, and we'd be glad to talk to anybody about that. In any case, I'm going to just show you here that I do have a picture of resources. All of these links are in the handout that we've provided. If you want to try to write down a couple things, but they are here. You certainly can, and then, Marianne, we can move over and answer whatever, whatever questions we might have received.


Marianne Calilhanna

Well, thank you, David. And just to clarify, everyone, there is a section on your GoToWebinar control panel that is called handouts, and if you click on that arrow, you should be able to access a PDF that we put together with all of David's great consolidation of the PMC links and other content that might be useful. So we do have a couple questions. And, David, the first question, I think it's interesting, because, you know, the terms "PubMed Central" and "PubMed" are confusing. I'm going to read the question. But, if I understand it right, I think two the terms should be swapped. So it illustrates why this landscape is a little bit, you know, just, just kind of confusing, the terminology. So the question is, Is there a way to have an article in PubMed before it is indexed in PMC? So I think that should be flipped. Is it possible to have an article in PMC, PubMed Central, before it is indexed in PubMed?

David Turner

Typically, it is possible to have a have an article cited in PubMed, and not have that article exist in PubMed Central yet. That is entirely possible. You won't find a case where, you typically won't find a case, where an article has been, has been loaded and submitted and the journal has been approved for PubMed Central, and not be, also in PubMed, because, typically, when they do the PubMed Central process, they also index it for, with other major databases: MEDLINE, PubMed, et cetera.

Marianne Calilhanna

Okay, and so, related to that: If a journal is already indexed in PubMed Central and interested in applying for MEDLINE, will there be any additional technical requirements for data submission once accepted? Or to the XML files sent to PubMed Central handle both indices? 

David Turner

Yeah. That's a good question. And, you know, in general, when, when you put a, you submit the content to PubMed Central, they're also going to take the step of making sure that it's indexed in the appropriate place in PubMed, and into MEDLINE. You should not have to do a separate application. Typically, when we work, we get it the other way around where something's been in MEDLINE, or PubMed Central says, Hey, get it indexed in MEDLINE first, and then we'll do PubMed Central after that. But typically, you don't have to do a separate application. And if you do, the application for the MEDLINE and for, for PubMed, are really, really simple. And if you already have the information from PubMed Central, literally, it's, you click a, click a link, and you upload content. So. But in general, you shouldn't have to do that. 

Marianne Calilhanna

Thank you. So another question: If a publi- this publisher is submitting Word and image files for all submissions to PubMed Central. Could they just send an XML package to an FTP site - to an FPC site instead? 

David Turner

Okay, so they're, they're working, they're loading content to PubMed Central themselves? I'm not sure...

Marianne Calilhanna

That's how I understand it: We, a publisher, submit Word Image files for all submissions. Could we just send XML packages to an FTP site for submissions instead? 


Well, you've got to load to the PubMed Central platform, and typically, as part of your package, you have the XML and the images. I'm not sure I understand the question exactly. Ordinarily, if you're loading to PubMed Central, you wouldn't just load Word documents and images. You have to load a package that includes XML, and the PDF of the article, and all the image files, and we, we typically load those to the PubMed platform. It's not a simple FTP process. Now if they're wanting to work with us, and you know, they can load it to our FTP and we can help to convert that, and submit it to PubMed Central for them. I'm sorry, whoever asked that question, I hope, I hope that makes sense. I didn't, I didn't fully understand it, but if you want to contact me afterwards, we can certainly walk through it in more detail.

Marianne Calilhanna

Here's another question. How quickly can DCL generate XML for a journal submission to PubMed Central? 

David Turner

That's a, that's another good question. I mean, normally, normally we don't put a specific time frame on it, but it's something that can be turned around, well, normally I think we turn it around in just in a few days. It's something that if, if the need is there, we can turn around, really, faster than that as well. I don't want to put our production people on the spot, so I'm not going to give any exact guarantees. But, yeah, it's, it's usually a period of days, not a period of several weeks or months or anything like that.

Marianne Calilhanna

Yeah, and I do remember when the pandemic just started, there was a period where we had some really rush Covid-related articles that I think our team was working over the weekend to continuously get those submitted to PMC. 

David Turner

And you really want to do whatever it takes to get it, get it done in time. And usually that's one of the first questions that we'll ask when a publisher comes: What timeframe are you looking at here? Where are you in this PubMed Central process? You know, how can we help you in the best way? And a lot of times, there's no real time constraint. Other times, it's been, well, Hey, we've submitted our application and we've had two strikes. We need something. We've gotta give something to them by Friday. Because there is a, there is a bit of a timeframe from when you get your scientific application done to when you get your technical requirement done. There is a period of time that can expire and your application can be rejected because you didn't get things in timely enough. But we'll make sure that you do get those in on time.

Marianne Calilhanna

Thank you. So does PubMed Central require a specific flavor of JATS XML? 

David Turner

No, it's really, it's just NISO JATS XML. They actually provide a link to that. That same page that I showed on the slide. There is a vocabulary that's, that's involved with that. Showing it here... I had it here somewhere. That's where that might be. Yeah. So, you use a NISO JATS general publishing tag set, and you use that to conform to the PMC style. They do have, as it says here on the screen, they do have a style checker that you can run to make sure that it does validate. 

Marianne Calilhanna

And I think this is our last question. Does PubMed Central only accept journal articles? Or does it accept books, monoliths, and chapters as well?


David Turner

My understanding is that it only accepts journal articles at this time. Now, the National Library of Medicine accepts book chapters and other pieces as well. But, PubMed Central, my understanding is that it is, in fact, journal articles, unless that has changed recently. I don't think so. 

Marianne Calilhanna

I seem to be having trouble with my, ah, there we go. I was hust having trouble with my... my audio. Forgive me for a moment. And now I think there's a strange echo. I don't know if anyone else hears it. 

David Turner

I'm not hearing it. 

Marianne Calilhanna

Okay. Well, thank you so much for all this information. I want to thank everyone who's joined us today. If there was a comment submitted and we didn't get to address it here live, we will be in touch after today's webinar, so please hang tight. The DCL Learning Series comprises webinars, a monthly newsletter, blogs. And I invite any of you to visit to sign up for those things and keep abreast of what we put out to help, try to help our community and our industry grow together. Thank you very much for your time. And this concludes today's webinar. Thank you, David.

David Turner

Thank you.

bottom of page