DCL Learning Series

PubMed Central Primer and XML101

Marianne Calilhanna

Hello, everyone, and welcome to today's webinar. We will begin in just one more minute. We're going to allow some folks to continue logging on. Welcome, everyone! Welcome, Colin. Hello, Christina. Hello, Jeff. Hello, Susan. Welcome, Virginia. Happy everyone's taking a little bit of time out of their day. So welcome to the DCL Learning Series. Today we are hosting a PubMed Central primer along with an introduction to XML. My name is Marianne Calilhanna and I am the Vice President of Marketing at Data Conversion Laboratory. I'll be your moderator today and just a couple quick things before we begin. We are recording this webinar and it will be available from the on-demand section of the Data Conversion Laboratory website at dataconversionlaboratory.com. If you could hit the next slide, please.

Before we begin, I'd like to make just a really quick introduction to Data Conversion Laboratory, or DCL, as we are also known. Our mission is to structure the world's content. Content can unlock new opportunities for innovation and monetization when it has a foundation of rich structure and metadata. DCL's services and solutions are all about converting, structuring, and enriching content and data. We are one of the leading providers of XML conversion services, DITA conversion, structured product labeling conversion, and S1000D conversion.

Many people are well aware of our excellent content conversion and transformation services, which is what you see on that first blue box there. But we also do a lot of work in the other areas listed on this slide. Things like entity extraction, third-party validation of previously converted content, semantic and metadata enrichment, data harvesting or website scraping, content re-use analysis, and structured content delivery to industry platforms. So, if you have complex content or data and any challenges around those things, we can help.

I feel fortunate to work with some great people here at DCL, and David Turner is indeed one of them. David helps our customers, or those who are not yet our customers, with content technology initiatives. He's a great person to reach out to if you have any contact structure challenge, and he really understands and demonstrates how XML, metadata, and updated workflows translate to digital transformation and, ultimately, revenue. David, help us make sense of PubMed, PubMed Central, and XML.

David Turner

All right, definitely. We'll do that. Thanks so much. I appreciate the introduction, and hello to everybody. I'm coming to you today from my home in beautiful Sasche, Texas. If anybody that's on this webinar has ever been to Sasche, Texas, let me know. Put something there in the, in the chatbox, et cetera, because I'd be interested to know. But, really, yeah, so I'm coming to you from my home office, so who knows if some kind of craziness will ever happen in the background? If there'll be some noise or some, some loud children or something, that'll come by.

4:09

So, in terms of just background on this webinar, where the idea came from on this is, it's really from some questions that we've gotten on a repeated basis. As Marianne alluded to earlier, DCL is in this structured content business. And what that means is that, often, we're doing conversions to XML, and, and we're making updates to metadata and things like that. And a big part of our content structure practice is really around structuring content for government repositories, right? So, for example, I think she had mentioned we do conversions to XML for drug listings that are going to be submitted to the FDA, and it's called SPL, or structured product labeling.

We do conversion of patent documents, the supporting documents that go with the patent application for the US Patent Office, and we also do a lot of work converting content and structuring it so that it can be loaded onto PubMed Central, or PMC. Because our clients have asked a lot of similar questions about this, we thought we'd put this together, and part one is about the PubMed Central process because that's typically where the first half of the questions come around. But then part two is going to be about just XML from an introductory perspective. For some of our clients, when they come in, and they are going to load it onto PubMed Central for the first time, this is really their first exposure to structured content, or their first exposure to XML. And so we've got some, we've got some explaining that we'll do.

Okay, guys, well, let's just kick in, and let's start here with part one, which is the part about PubMed Central. All right, so here is just a quick screenshot of PubMed Central, this is what it looks like, and I'll actually show you some of this live here, in just a minute. Basically, what PubMed Central is, it's an online library or repository that's part of the US National Library of Medicine. So, as you can imagine, the National Library of Medicine has lots and lots of shelves of books and journals and things like that, but they also have electronic content, and they have this repository, PMC, that is maintained by them.

It's actually operated by the National Center for Biotechnology Information, and I think we actually have a couple of NCBI people, or at least one, on today. So welcome, thank you for coming. Um, this repository contains, really, all the full-text journal articles from a particular area: biomedical and life sciences journals. All right, so, you know the journals, and it's got all this content, and it's available for free, and I think currently the number is, like, 6.3 million articles, and they go back all the way to, to the 1700s. All right, guys, just a quick look here at what it looks like. This is not it. Supposed to be. Here we go.

All right, so this is what PubMed Central actually looks like. If you want to look up a particular journal, you can go in here to the journal list. If you want to look up a particular topic, you can do that as well. I'm gonna jump in here and, you know, just put in a journal that I know. Um... here we go, The Journal of Health Economics and Outcomes Research. I search for that particular journal, it's going to tell me what content is in there from this site. It's going to tell me, you know, what the last volumes are, et cetera. I'm just going to click on this here. You can see I can jump in and I can actually get into this particular volume. I can get into each of the articles, and open up an article here.

8:08

And you can see here in this article, that it's got the title, the author information, the abstract, and then as you go down further, it's got the entire article. It contains the images, it contains the tables in a very flexible format. You can also output, you know, various different outputs. You can do an EPUB output, you can do a PDF output, so you can read it in a variety of different ways. But that's what, that's what PubMed Central looks like.

Kind of the next question that we often get about PubMed Central is, what's the difference between PMC and PubMed? Or how is PMC different than MEDLINE? If my journal's already indexed in MEDLINE, isn't it already in PMC? Things like that. So, just to walk you through this, I'm gonna kind of walk through the difference here and for those of you who've been in scholarly publishing a long time, you might already know some of these things, but I'm going to try to explain in layman's terms. So when you think of PubMed Central, I mentioned that the National Library of Medicine has all these shelves of content out there that are in print. Well, electronically, they also have shelves of content. So you can think of PubMed Central as this, as a bookshelf of journals about medical research. So it's the actual full-text articles. It's all of the different pieces together.

Whereas PubMed, you look at PubMed, it's actually a bibliographic database. It's, it's a card catalog, if you will. If you were at the library, you'd look at the card catalog to find out where those full-text articles are. In this case, you use this to have references, or links to the online articles and books, so it looks like... this here. This is what PubMed looks like. I can look up that same article that I had there before. And I can see just, you know, just see items that go along with it. So maybe I just put in something like this here. Maybe I put an author's name or something like that.

Now, if I can find this article, yep, there's that article right there. And you can see here that it does include some of the same information, the abstract, the authors' names, et cetera. But it does not have the full record. Instead, what it has is linkage to the article on PMC. So again, back to PubMed, it's intended to be kind of that card catalog. So it has the information that points you to the bookshelf, the full text articles that are out there. It doesn't just point to PubMed Central, I mentioned there are, what, 6.3 million articles. Well, PubMed actually has something like 30 million citations.

So just by doing the math, or you can see, it does actually also point to other repositories, actually also points to other publisher websites. So there can be lots of content out there where the full text exists someplace else that PubMed will show you the citations for, whereas PubMed Central is just that specific group that the National Library of Medicine is hosting. In terms of MEDLINE, when you think of MEDLINE, MEDLINE is a, is a subset of PubMed, So it's also a bibliographic database. It's the largest subset of PubMed. It represents something like 26 million of those, of those citations of 30, 30 million citations total. Twenty-six million of those are indexed in the MEDLINE database. And typically, one of the things that makes them very useful and special is that they're indexed using MESH subject headings, but that's another topic for another day.

11:58

Moving along here to this next slide, in terms of of submission, if a publisher was being asked to submit content to PubMed or MEDLINE versus PubMed Central, it's, it's a little different. The rigor is a lot different; the amount of content is a lot different. So, when you're submitting to PubMed, typically you're submitting XML header-type information only. You're sending over the basic citation data, and you're sending over basic abstract data. I think they can turn these around in something like 24 hours. It's 24 to 48 hours, something like that; it's a really simple process. Now, when they go to the full-text library, when they go to PMC, there's a big difference. Publishers here are going to certainly submit the citation data and certainly going to submit the abstract data. But then they're also going to submit the full text for all the articles, all the tables, all the images, all the references, all the supporting documentation, things like that. That's what's involved. So, from a submission perspective, it's, it's pretty significantly diferent.

All right, so I'm sure you guys have all got this, you're ready. So, I'm gonna give you a quick pop quiz here, so just, I'm not going to ask you to do this like, as a, you're online or anything like that. Just, you can grade yourself, you can grade yourself, so get ready there. Okay, question number one: can an article be indexed in MEDLINE and not exist in PubMed Central, PMC? Be thinking, be thinking... hope you wrote down your answer. The answer is yes, certainly. So, MEDLINE, again, has 26 million citations. PubMed Central has 6.3 million articles, so you can certainly have an article that's sitting on the publisher side, or in another, in a database that does not exist in PubMed Central.

Question number two. Are articles in PubMed Central generally also indexed in PubMed? Your answer there? No cheating. Yes. Yes, absolutely. Articles in PubMed Central are generally also indexed in PubMed. They tend to work hand in hand.

And question number three. Do PubMed and PMC have the same rigorous requirements for full-text XML? The answer is no. We just talked about number three, so for PubMed, it's typically a relatively quick process, XML header-type info. Whereas PMC requires a lot more information. A lot longer process. All right. So if you graded yourself, if you got 100, let me know in the chat. If you didn't make 100, also let me know in the chat, and just let me know how you did on this quiz. All right, let's move along here and let's, let's talk a little bit more about: why do publishers put content on PubMed Central? What are the benefits to them?

Basically they do it for four main reasons. First of all, there's the idea of increased discoverability and access. When you're you're researching, you obviously want your content to be accessed by the largest audience possible, and PubMed Central is certainly the preeminent repository of this kind of information. Another great benefit is that it allows you to archive your articles, really in perpetuity. Journal publishers might come, they might go, but over time, ideally, the library, the US National Library of Medicine, will be there, and so this allows you to archive them separately from, you know, what the publisher's website might have or what some other database might have, in some way that's there for the common good.

Another great one is that you get a lot of increased exposure because you're integrating with the other databases that are part of the National Library of Medicine and NCBI. Um, sometimes it's really just important because maybe your funder is requiring that content to be in PubMed Central. All right, um, from there let's move on and let's talk about the process itself.

16:04

So how do we get a journal into PubMed Central? Well, first of all, I'll tell you, they do actually have this whole process lined up over here. And there's a website here, and we're gonna give you this in, there's, a handout that includes all the links to this kind of thing, but there is a process that's outlined here on their website that walks through a lot of this detail, but I'll hit it from kind of a high level here today. All right? So first of all, you've got the process of just submitting the application. Because you're submitting content here, you have to give them some basic information.

An important thing to understand here is that when you're submitting content to PubMed Central, it's not books or a journal article; what you're typically getting approved is an actual full journal, right? So what they're going to want to know is, who publishes this journal? Who's the management of this journal? What's the journal title? What's the ISSN? When was it published? How often is it published? You know, what's the website? And then they want to know, you know, links around what your various policies are for editorial or peer review and things like that. So you submit all that information, kind of as the first step, just to make sure, well, to prove that you're a reputable journal, and that your content deserves to go on this site. Once you pass that step, they move into an application screening process.

So, the people at PubMed Central start looking at the journal, and they start asking a lot of different questions. So they start asking, you know, is this peer-reviewed content? Is it the right kind of content? Is this biomedical content, does it actually belong here on this site or does it belong on some other database, um, what are these author affiliations that we've got? Are they, are they appropriate for this kind of content? Are the article types relevant? So it's kind of a high-level application screening there to give the idea of, from a macro perspective, does this continent fit?

From there, they move into a more detailed view, which is the scientific quality review. This, this is really intended to be a review of the content itself. It's a content review, not a format review. And if you're already indexed in MEDLINE, you generally don't have to be, won't go through too much more in terms of the scientific quality review. If you're not in MEDLINE already, they're going to start asking questions like What's the scientific rigor here? Did the authors apply the scientific method, did they provide full transparency in providing their supporting data, things like that. They're going to want to know, you know, is this, is there good editorial quality here? Whatever your articles claim, are those claims clear and logical?

The figures and the tables that they've added, are those well constructed, do they really contribute to that? So they do look at both scientific rigor and editorial quality, and then ultimately, they come back and tell the publishers, you know, if you made it or not. If you don't make it, you are eligible to re-apply in 24 months. And that's pretty consistent across the way here. I will say also theere's full criteria online for this scientific quality review. They walk through all of the different questions that they ask. Lots of, lots of sample questions. I wouldn't worry too much about scratching down that website, because we've included it in this handout, but you can certainly, you're encouraged to go and take a look.

All right, so we've submitted an application. They did kind of a high-level screening and they dug in and looked at the content, kind of at this point, at this point, they move in and they want to look at the format. They want to look at the format of the content. You know, does this thing meet our technical requirements for this content? UltimatelyPubMed Central wants to have their content to be consistent and want to follow a similar format. They want to be able to have nice linking, and so do that, it has to be in the right format, and that's typically where a vendor gets involved.

20:19

The publisher is asked to submit an initial package of 25 articles, and those articles should include XML, should include PDF, should include images, should include supplemental data, all in a big package. At this point, they review all of that content. Now, if after three rounds of evaluations, they can't address all the reported errors, the application gets rejected. Um, and so, you've got to make sure that you have clean XML. And we've had pubishers come to us a couple of different times, where they have two strikes and they need somebody to help them on this third strike. You're not required to use a vendor. You can certainly do this yourself, but you just need to know that if, after three evaluations they, they will reject you, and, I believe my understanding is that the same 24-month period applies where you won't be able to, you'll have to re-apply in another two years.

So after you pass that technical evaluation, they move you into a pre-production process. This pre-production process says All right, your content looks good, format looks pretty good here, let's, let's execute an agreement. Let's, let's get a larger set of article files. So you go in and bring in the rest of the articles that you're going to load, based on, you know, what you told them in terms of publishing frequency, et cetera, in the, in the first step. Then they, they, obviously, the review those as well, and they'll see the correction report, if any data errors are found, and as before, if they continue to find repeated errors, they will, they will reject your application. You will have to wait again to reapply.

All right, so, we've been through all of these processes. At that point you kind of move into a live release phase, and the live release phase, what I'm gonna show here is a workflow. It's typical of what we do as a vendor. This could be different if you do it yourself. Or even if you use another vendor. It just kind of depends on their process. But, generally, the process that we use is, the publisher will send us a PDF and a Word version of their articles. We take those, sometimes they'll send us their images separately; other times, they'll just asked us to extract the images from what they've got on these other documents.

So we take those, we go through a process of converting that content, and we create an XML file, and we submit the XML, the PDF, and the images to PubMed central for you. We just do that on a regular, ongoing basis, so as your journal comes out, you send us the content, we make the conversion, we do the submission, and then we'll also typically send a copy of the XML back to our client so they can keep it for their records.

That's overall the entire process for getting a journal loaded. If you have questions on that, do put them there in the, in the questions pane, and we'll, we'll hit them at the end. I'm happy to do that, or I can obviously talk to you afterwards as well.

All right, so still here in this PubMed Central section, let's, let's look at some frequently asked questions. I will say that PubMed does have a really great frequently asked questions page, here's a screenshot of it. Um, and I'm also going to show you the the link here. Again, don't worry about copying it down, we'll put it in that set of links that's on the, on the handouts there, but they do, they do a great job of going through a lot of common questions that are out there. And if you don't really see an answer to your question, first of all you can contact me anytime you like. But you can also contact PubMed Central directly at this email address. They're a very pleasant group to work with, I think they get back to publishers quickly, and so, so feel free to reach out as well. And that comes from their, their website.