DCL Learning Series
Deciphering the Tower of Babel
Hello, and welcome to the DCL Learning Series. Today's webinar is titled "Deciphering the Tower of Babel: How the Language Metadata Table Helps Manage Media Assets." My name is Marianne Calilhanna. I'm the Vice-President of marketing at Data Conversion Laboratory, and I will be your moderator today.
Before we get started, I want to let you know, we will allow time at the end of the webinar for Q&A, so please feel free to write your questions in the question boxes that come to mind. If we don't have time to answer all the questions, we will get in touch with you personally after the webinar. Next slide, please. If you could hit next again, sure.
Before we begin, I'd like to provide a brief overview on Data Conversion Laboratory or DCL as we are also known. We'd like to say our mission is to structure the world's content. Content can unlock new opportunities for innovation and monetization if it has a foundation of rich structure and metadata. For content to be easily discovered and used via modern platforms, it should be converted into a multidimensional XML format from which machines can extract pertinent information. Next slide.
At DCL, we offer a number of service categories. I'm not going to read through each of these, but I will say DCL is best known for its on shore project management, excellent customer service, as well as working with complex content and data challenges. Next slide.
We work across many industries, and here are just a few of our representative customers in the areas of defense, government, pharma, financial and legal, technical documentation, libraries, and publishing. Next slide.
I'm delighted to introduce today's presenter: Yonah Levenson. She is the Manager of Metadata Strategy at WarnerMedia. Yonah never anticipated that when she answered an ad in college to typeset for the student newspaper that it would launch her career in metadata standards and markup. She's been instrumental in implementing change wherever she works. At WarnerMedia, she's focusing on metadata and taxonomy terminology strategy and governance, leading efforts to extend terminology consistency while adhering to industry standards.
Previously, Yonah was at Pearson Education, where she developed and implemented global metadata standards, working side by side with data architects, markup and workflow specialists, and content creators to come up with flexible solutions to meet today's educational needs. Yonah was active with multiple standards' development efforts including BISG committees, as well as contributing to the IDPF Markup Standard. She was a key contributor to Pearson's rights metadata standard and system implementation effort.
Prior to Pearson, Yonah was a project manager here at Data Conversion Laboratories. She's received her master's in information science from Rutgers University and is the co-founder and co-director of the digital asset management certificate program at Rutgers. She teaches metadata for DAM, as well as content integration for DAM. Yonah is the chair of the Language Metadata Standards Committee, or LMT. Yonah, thanks so much for joining us today, and I'll let you take it from here.
Yonah Levenson (4:36):
Hi Marianne, and thank you for that warm introduction and welcome to this webinar on the Language Metadata Table. I'm really pleased to be here, back with some of my DCL family.
As Marianne said, I'm the Manager of Metadata Strategy at WarnerMedia. I'm currently part of an analytics group. Data is where it's at, and there's a lot to do. If you're not familiar with WarnerMedia, these are just some of the shows and some of the talent that we have. You may have heard of Games of Thrones or Anderson Cooper and all kinds of really cool stuff. John Oliver... WarnerMedia consists of HBO, HBO Max is actually launching next Tuesday, Turner, Warner Brothers, and CNN. There's a lot of metadata out there.
The Language Metadata Table, I started that back... I'll give you some more information there, but it's been around for a couple of years. These are just some of the committee members and contributors that we work with. You may recognize some of the other broadcast and entertainment companies, but there's also interest from other areas as well, which is why we thought this would be a good webinar for the DCL community.
So metadata. There's lots of languages everywhere, and I don't know about all of your particular worlds, but in our world, we're not just dealing with English. We deal with Spanish, French, Italian. If you saw my brilliant friend, there's both Italian and Neapolitan in there, and there's all kinds of languages everywhere, and you want to make sure that you're using the right terminology to hit your particular audience.
When we started going down this path at my office, I was sitting in meetings and there were a bunch of people in there, and they were like, "Okay, well what's the language code for Latin American Spanish?" We went down this path, and I found out that there were different codes in different systems, all standing for Latin American Spanish. So within my own organization there was no single, unified standard. We started looking around and found that there were some different standards, but they didn't quite fit our needs, and that's what started the Language Metadata Table.
Last year, I was fortunate enough... the LMT was nominated as the Taxonomy Boot Camp Best Practice Award. This was for Taxonomy Boot Camp London, and on my way over, I know this are a little bit hard to read, I saw there were all these codes on the screen, because I finally had time to sit down, watch a movie. It's a little bit small on here, but you can see there's IT, JP, LS, PO. Now mind you, I had already been working with language codes. I've been working with metadata terminology for years. I'm like, "What the heck is PO? Is it Polish? Is it Portuguese?" When I dug down a little bit further, I did find out that PO did in fact stand for Portuguese because there was no Polish in the offering.
I just thought this is so indicative of... Here's a great use case of where there needs to be standardization of language codes. Also, from my perspective, having worked in IT groups for really all of my career, I could see plenty of real estate in here where they could spell out the information so that I wasn't looking at something and guessing, "What does PO stand for," and then, again, finding out that it's Portuguese. Here's this great shot of... When you're sitting on a plane, if you remember back to those days a few months ago.
So the scope for the LMT is we really just focused on the languages. It's the notation of the various scripts and the writing included because you need to make sure that you have the right codes for the written language as well as the audio language. We also are including the endonyms, which is the language name and the country's language, and also there can be exonyms available as well, the languages spoken in other countries as well. So this one would have French, Français, and... I cannot pronounce the last one.
How do you do it?
Thank you. I'm not sure what country that's from either, maybe Germany.
It is German, okay. So there's all these different ways to call the languages, and I needed a centralized table where this information could be available for the different audiences that needed it within my organization.
So what's our mission statement? We wanted, again, to have this unified source of reference. Particularly, we were focused on the media and entertainment industries, but what we've also found is that it can be applicable for other industries as well. I don't like to reinvent the wheel. If there's something that's already there, let's take advantage of existing standards, and when we poked around, we found out that there was already a way to get better language codes across, and that was IETF BCP 47, also known as RFC 5646.
I'll go into that a little bit more, including spelling out of the acronym and what it stands for, because those of you who may work with ISO standards like 639, it just didn't have enough. It wasn't granular enough for our purposes or, in some cases, broad enough.
We want to make sure that there are consistent practices and usage throughout best practices. Also, as with any of you who have ever participated on any kind of standard, to make sure that there's approval, that we vet all the definitions, and make sure that there is agreement and consensus and sign off with the thought leaders who focus on the information and the data from the business and professional institutions throughout the exchange of knowledge and collaboration.
So as I said, this began back in 2017. I worked for HBO, which is now WarnerMedia. We were sitting around a room, like I said, and found out that there were all these different flavors of Latin American Spanish. There was LAS, SPA-LA, LAT AM SPAN, and some cases just SP with the... "But we always used Latin American Spanish." But at times, you really need to be more detailed with regards to your use of language. Should it be Spanish as spoken in Spain or in Mexico? We all know here, particularly if you're in the US, there could be French, but if you're speaking French in Louisiana, it could be creole. If you're speaking French in Canada, it could be Québécois or it could be French as spoken in France or Paris. We needed to differentiate all these different languages.
So in 2017, we started with 128 languages that were in use at HBO. We presented that in 2018. MESAlliance is a group called the Media and Entertainment Support Alliance. It's a support organization. It has several hundred members across the broadcast media industries. They hold regular meetings and conferences, and they said, "Hey. What do you got?" And I said, "We've been working on this cool language thing." They were like, "Let's present it." I'm like, "Okay." Then they said, "How about we make this an industry standard," and that's what happened.
So we forked a working group with those people from that earlier slide that I showed from all those different companies, and we published our first formal version of it, LMT 1.0.
At this point, it's now May 2020, hard to believe. Beginning of the year, we published LMT 3.0 with over 200 languages. We have, actually, a few more that have been requested, and we're getting ready to publish LMT 3.1 with some corrections and some additional language requests.
The way that we do this is by making sure that there is a use case to show that there is a true need. As a result of having worked with the languages for so many years, I remember reading an article not too long ago in the New York Times about... There's some island off South Carolina where they speak some other kind of a dialect. I don't think it's in common use, it is not our job to track every language that may be in existence, but really, is there a use case where you need to define what languages are being used and that somebody may end up searching for, etc, within your business and organization?
Here are a number of different reasons why having a common Language Metadata Table, a standard, a table of codes, is because we need it, particularly in media and entertainment, as well as publishing and other content centric organizations. We need it in broadcast media for audio and [inaudible 00:15:17] for content. We need visual or written languages. I noticed that today is global accessibility awareness day, so if you think about people who have accessibility requirements, they may need to know... What's the audio language? What's the written language? If they have a hearing issue, they want to make sure that they can read it in the language that they expect.
There's a lot of talk out there now with regards to localization, meaning that somebody may want to say, "What's all the stuff that you have on Mexico or what's all your content for this particular company, etc." One of our early participants was from Hasbro Toys, and they were particularly interested because when they make their games, they have to make sure that they have the right language included in the instructions as well as the packaging. Broadcast media, again, those of us of a certain age, we have been known to buy DVDs and Blu-ray and stuff like that, and we wanted to know what might be the alternate languages besides English that could be on there.
From localization, you want to know what's out there, but also from a rights and licensing perspective, what languages is one allowed to use or apply or what is one buying? The distribution territory... So I mentioned Latin American Spanish, and that was one reason we couldn't use the ISO 639 codes for languages because they were by country. We combined those with 3166. You could do it by country, but there wasn't for a geographic area of Latin America or North America or Europe or Asia, so we needed something that would encompass all of that as well as being able to go more granularly.
As I mentioned here, the accessibility. So the subtitles for the deaf and hard of hearing is the SUDH, and like I said there's also, for screen readers and stuff like that, knowing what language can be very helpful.
Advantages of adopting LMT. By having standardized codes that show what the distinctions are between the written and the spoken languages is really helpful. It provides flexibility, because this way you can compare. Do I have it in Latin American Spanish? Do I also have it in Argentinian Spanish or Chilean Spanish or Portuguese as spoken in Portugal versus Portuguese as spoken in Brazil, etc.
With all the different work that I have done in my career, and Marianne mentioned that I had participated in the BISG, B-I-S-G Committees. I've also given this presentation to that group as well. There's a lot of data that needs to remain within one's own organization as well as being exported out, possibly in some kind of data feed so it can get pulled into... If you're in the book world, it can get pulled into Amazon or Barnes and Noble, Books-A-Million-, etc, and you want to make sure that everybody's presenting the information in a consistent way so that they don't have to interpret or have some kind of unique feed or mapping table that knows, "When I see SPA from this organization, then I know it's Latin American Spanish, but if I see SPA from another one, I know that it's Spanish as spoken in Chile."
So by having the standard set of codes and everybody agrees, "Here's the display value, here's the audio code, here's the written code," it just takes all the guess work away.
One of the big things that I... In order to get a language added to the LMT, we want to make sure that there is a use case, meaning not just... As I said before, there was this island in South Carolina, maybe somebody wrote one story on it, is it enough to be in LMT? I'm in the broadcast and media industry, we're distributing content all over the world, and we need to make sure that those who are receiving our content know what language it's in. So this is just an example.
For use case, what we've been going with is that we need eight or more uses of a language in order for it to be considered for inclusion. Then we work it through, and I'll go into more details, through the working committee to make sure that everybody has approval because we keep finding that company A may use this particular flavor of a code based on legacy and somebody else may use something different, and can we come to a consensus on this? We're also working to get other standards organizations to also agree that the LMT is a good thing.
We have all these different audiences we need to please: The licensing, distribution of content, accessibility, definitely, is up there, and then also make sure from a localization perspective that we're providing what's needed.
So here are some of the different examples, and I've mentioned them. We need the audio, what language it's in. If somebody may want to say, "Well it's in English, but perhaps, it's with a Scottish accent, English as spoken in Scotland or in Great Britain or English as spoken in Spain or English as spoken in the US or generic English." There may be different accents associated with the audio, so it may be difficult for somebody to hear and they may just want a particular dialect or they may want to see are there also closed captions or forced narratives burned in? Are those also available?
In the shot on the right, you can see there's probably audio, but then, in here, there's the closed caption so that people could be able to understand, and all the different examples that I really just mentioned now.
It is really important to mention, with regards to the written versus the audio, that there are a number of languages out there where the written language is different from the audio language. Chinese is an example, and also Serbian has two different alphabets. It's got a Latin alphabet as well as a Cyrillic alphabet. I have some examples of those.
Also, with the burned in or forced narratives, if somebody has a sign, a stop sign or... I always think of some of the shots in Star Wars where they're establishing what planet they're on, what is that language that the location is written in?
I'm going to move on now. Here's the meat. Here's IETF BCP 47. IETF stands for the Internet Engineering Task Force. BCP is the Best Current Practice, and number 47 is the number of the practice. Down here at the fourth bullet, this is how it's constructed. It is a combination of existing standards. I've mentioned a couple of times ISO, the International Standards Organization 639. They have a bunch of different language tables. There are two and three character language codes. ISO 3166 has the two character language code, but now, when I'm starting to get to Latin America and other territories, 3166 is purely country codes. So by extending out and using another standard, which is not maintained by us, but it's maintained by the UN, now I have a place to have my territory defined. So Spanish as spoken in Latin America is now ES-419, and I'll have examples of that as well.
If you take all those different standards out there, you can combine and come up with more than 40,000 different combinations, and that's a lot. So people can say, and I've run into this, again, in my own company, where they're like, "Whoa, we're adhering to IETF BCP 47." Okay, but what code are you using for Spanish? Are you just using ES or are you using SP? ES is the preferred term for Español, and how are you designating the country? "Oh, we just know it's Latin America."
By being able to be more granular, and those of you who have worked in data conversion, sometimes we may have to start very broadly, but in order to narrow things down, that's not always so easy. It's always easy to go from granular up to broad, but going broad back down to granular, not so easy. So let's give the codes that we need, mark up the content accordingly, and then you have enough leeway to play around with it.
One of the other nice things about adhering to IETF BCP 47 is that the codes are under regular review to make sure that the lists are kept current. So as per the example, Greenlandic was updated to Kalaallitsut to to reflect cultural norms. We've had a lot of conversations about Norwegian lately, and it's also supported by the W3C.
So what does it look like? Here's what the syntax is. It's got the... In the green there... Language script, region, variant, extension, private use. So now I can have Mongolian as written in Cyrillic as used in Mongolia. Again, we have the 40,000 combinations. And here's the real charm: The third bullet of the LMT is that you have these pre-constructed codes supported by the use cases, so you don't have to think about it. Just say, "What code do I need," look it up, there it is.
We also define the language grouping, so if you can see what the options are for Spanish. It's a little bit difficult for Chinese, but there are use cases for them. Then we make sure that we have the group name, the tag, the code, make sure we have values for audio and the visual and descriptions.
So here's an example of what a table looks like in here. In the first column, we have English, so that's pretty straightforward. On the left hand side, these are all the different sub-elements that we capture in our taxonomy tool, so we have all these different values in there, and there's a place for everything and everything in its place. Spanish, as I said, ES-419 for Spanish as spoken in Latin America, and if you needed to have it... Our default is in English, but if it's in a language other than English, then we have what it looks like as an endonym. Then also the visual language.
Serbian was a challenge because it could be... Over here, it's got Serbian SR, with the Latin script, as spoken in, RS, Republic of Serbia, but also in Cyrillic. What's really interesting, it kind of blew our minds, is that for Serbian, it's 50/50. There is no dominant written language there, so we had to make sure that we had a place to hold the visual language tag one and two. Orally, it's just one language, but written it could be other ways.
In Chinese, there's all kinds of interesting things that go down in your written languages, so visual. We did visual, also, because of sign language, because it's something that you see with your eyes. In Chinese, you can have either simplified or traditional Chinese for the written, but then you have lots and lots of different audio tags. Could be Mandarin, could be Cantonese. Chinese is spoken in Hong Kong and in Singapore, etc. So this list is actually quite interesting.
Armenian. Again, there's eastern and western, so that was... Somebody else on the committee who runs a translation service. They clued us in on that one. So there's Armenian as spoken by people in the Armenian diaspora. Then last but not least, the American Sign Language, and you'll notice that there's no audio language because it's a sign language. It only has a visual language. Again, so here's the ASE for American Sign Language.
Here's what's been going on. MESAlliance, which again, I encourage you to take a look mesalliance.org. They have been a sponsor because this was going beyond WarnerMedia, and we needed support from an organization that was supported by the industry, and MESA gives us... I'm the chair of the LMT committee, and we've been posting information on the MESA website. We have the XML, we have a spreadsheet, we have documentation, we have presentations out there, and again, they've given us a home, and they've provided resources to help us with those kinds of efforts including the running of the working committee meetings.
However, SMPTE, which is the Society for Motion Pictures Television Engineers, is also very well known. It's one of the main standard bodies in the broadcast and media space, and they have a very robust setup for standards adoptions, and they also have a lot of tools and other resources and the infrastructure to help support the LMT. So this way, it would be easier for others to work with the standard and be able to pick up. So we're in the process of moving over from MESA as the physical home of the LMT to SMPTE, where it will be easier for people to find and access. It'll hit a broader audience.
In addition, we have partners that we're working with: EIDR, which is the Entertainment ID Registry. Many people may not have heard of EIDR, but they are the... It's like the ISBN for broadcast and media content. So you'll find movies and television content, all different kinds of stuff in there.
ISDCF is the Inter-Society Digital Cinema Forum. They have maintained for many years their own language table which was in desperate need of being updated. So they're interested in adopting LMT so that they don't have to do the same work all over again.
MovieLabs is an organization that also works on distribution standards for broadcast and media content. In many ways, they would be equivalent to [inaudible 00:33:14] in the publishing world. Again, you have to a target. What are the fields that everybody is going to need? What are the codes that are allowed? Who can take it and do what?
So I just wanted to reiterate that there is... The LMT for publishing. I think it's really useful. This way, you know exactly what your audio content code is going to be. I know, coming from the educational world, that there would often be audio clips within an ebook, so it would be good to know what language or languages those clips are in, and again, for accessibility.
As I mentioned earlier, Hasbro and other organizations, including my own, need information about the languages to include on packing and it's also useful for manufacturing because you need to know... What language does this contain? As I said earlier, there's over 200 languages so far, and there's more coming.
Moving forward, we probably have our next working group meeting in mid-June. We're working on the 3.1 release. There's some technical details we're working out. MESAlliance continues to be the sponsor and to lead LMT with me as the co-chair. We'll move to SMPTE for the technical home. There will be SMPTE URNs. We have some in the table now that aren't quite what they should be, and we're also working on a common language code register and therefore updated documentation.
For resources and links, general inquiries you can send to these email addresses: firstname.lastname@example.org, and then to get to me, the direct contact lmtchairs. If you go to the mesalliance.org site, there is a working committees link, and then, if you scroll down, you'll find the Language Metadata Table and there's a bunch of different postings in there.
One more thing I thought would be helpful for you all are these validation sources. These are different places where we go to check to see what code is already being used, and again, lots of folks are already adhering to IETF BCP 47, so we just jumped on the bandwagon. We just made a table of the codes. We're not reinventing the wheel. That's what I have to say. Marianne, should I go to the next slide?
I think you can keep it here in case people want to jot down some of these URLs. I will also share the slides in a format on the Data Conversion Laboratory website so that people can click through and find the URLs a little easier than writing them down, just in case. I'll alert everyone on today's webinar about that. Thank you, Yonah. We do have some questions that have been submitted. So if I could toss one of these to you...
Is the LMT fully compatible with IETF BCP 47, and is LMT an extension of BCP 47 or an implementation?
The LMT is an implementation of IETF BCP 47. So it's fully compliant from that perspective. What we did, because you can take those various codes... If you take all the ISO codes and the UN codes, there's 9,000 codes all together, and you can make 40,000 combinations. Instead, we just made it look a lot simpler by reducing it back down to this and say, "Here's your value, look up whatever you need and go for it." If you need something else, you can go ahead and make your own, but maybe consider submitting a suggestion for inclusion to the LMT committee.
All right. Once you establish the LMT, can you talk about how you go about applying it to content? Do you just do day forward, do you go into your back file? Do you focus on a particular content type? Any suggestions for best practices for someone getting started?
My recommendation is to figure out how it works best with what your workflow is. I've had very interesting, interesting in air quotes, experiences where updating metadata could have taken... This was a few years ago, could have taken several weeks to really sweep through the whole database and then apply it and then make sure it got distributed out. So you have to really decide what's best for your own organization.
One could maintain a mapping table under the covers that has your current codes and then your new LMT codes so this way, if you have a lot of legacy data, you could just include a transform on the way out the door. I recommend that you take my Metadata for DAM class at Rutgers for more information on best practices from that way.
Okay. In light of today being Global Accessibility Awareness Day, can you speak a bit more about how this can help, maybe provide some details on subtitles, captions, some interesting stories that you've experienced?
Sure. Along the lines from an accessibility standpoint... One of the stories... This is accessibility, but it's also context. There was the movie Roma that came out last year, and [inaudible 00:39:56], when he saw the subtitles that were being used in Spain, had the movie yanked out of the theaters because it had been turned into Castilian and the movie is set in Mexico. So he felt that the Castilian subtitles changed the whole context of the movie because they used "madre" for mother instead of "mama," and they just felt that it didn't reflect properly.
So knowing what language your closed captions and your subtitles are in can really impact the experience the viewer has, and also, from an accessibility standpoint, if you're from the US and maybe people don't know that C-O-L-O-U-R is the same as C-O-L-O-R. So C-O-L-O-U-R is the UK or the British spelling for color and then C-O-L-O-R.
So from an accessibility standpoint... And again, the colloquialisms can cause problems for those who may have different kinds of disabilities. So knowing exactly what language their content is, whether it's audio or visual, can really make a difference for the experience.
Interesting. Well thank you so much again for your time, Yonah. Thank everyone who registered and attended. This is the end of our program today. In light of so many webinars, we at DCL's Detailed Learning Series... We're experimenting with having a bit shorter webinars so that we can provide you with more information over the coming months. So this webinar, we did try to keep it to 45 minutes for your benefit. This concludes today's broadcast. You can access many other webinars related to content structure, XML standards, and more from the On Demand webinar section of our website at dataconversionlaboratory.com. We hope to see you at future webinars. Have a great day.