top of page

DCL Learning Series

Legal Implications in the Brave New AI World: Copyright Infringement and Training Sets

Marianne Calilhanna

Hello, and welcome to the DCL Learning Series. Today's webinar is titled Legal Implications in the Brave New AI World: Copyright Infringement and Training Sets. My name is Marianne Calilhanna, and I'm the VP of marketing here at Data Conversion Laboratory. I'm so thrilled with the reception that this webinar has received, so thank you for taking time out of your day. Before we begin, this webinar is being recorded and it will be available in the on-demand section of our website at dataconversionlaboratory.com. My colleague Leigh Anne is working behind the scenes, and if you have any issues with this platform you can just send a chat in the control panel and she can help you out. We have some good news for the attorneys in the house today. Attending this webinar [live] qualifies you for one CLE credit. There are two forms that can be completed and returned with the CLE code printed on them. Instructions for how to return the documents are within the document themselves, and you will receive the CLE code at the end of this program. You can access those documents via the handout section of the control panel. 


Finally, we'll save some time at the end to answer any questions you have, but in the meantime you can feel free to submit anything that comes to mind via the questions box on the control panel. We have some new audience members with us today who might not be familiar with Data Conversion Laboratory, or DCL as we are also known. I want to take a moment to introduce you to our services. DCL has been converting, structuring and enriching content and data since 1981. We are the leading provider of XML conversion services, DITA conversion services, S1000D conversion, and SPL conversion. While conversion is our middle name, we offer many other services and solutions, as you can see from this slide. And as the age of AI is in full force these days, we've seen an increase in the support of dataset development, and structure and content for training large language models. If you have any sort of data or content challenge, we can help. It's a true joy to introduce you to my boss, Mark Gross, president of Data Conversion Laboratory, and his son Daniel Gross, partner at Myers Wolin. Welcome, gentlemen. I'm going to turn it over to you two now. 


Mark Gross

Okay, I'm on. And Daniel's on. 


Daniel Gross

Hey. 


Mark Gross

Hello. Okay, so yes, it's a real thrill to welcome a special guest today, Daniel Gross, my son, who's a partner at Myers Wolin, a law firm specializing in intellectual property, patents, copyrights, trademarks. And what's happened over the last six months or so is almost every conference I've been at, the question of large language models and collections of data and ChatGPT comes up in almost every session I sit in on at one level or another. And I thought one of the issues on all that is who owns the copyrights on that?


4:02

What is that data? Who does it belong to? There's been lawsuits that people talk about, and I thought it would be really good to have an expert with us to talk about those issues. And based on the registration of this conference, a lot of people are interested. And I hope this will be an enlightening hour that we spend together. So welcome, Daniel. Would you like to take a minute and tell us about yourself and Myers Wolin and what you guys do? 


Daniel Gross

Sure. Thank you for having me. I'm very excited to be here. It is a fascinating topic. I'm a partner at Myers Wolin, where I've been for I guess 11 years now. Like you said, we're an intellectual property law firm. We specialize in patents, trademarks, copyrights. My contact information should be in one of the handouts that's being distributed, so I'm happy to discuss any of these topics further. The copyright angle on ChatGPT and on model training, things like that, it's fascinating. I'm really excited to be here and have this conversation with you, so thank you. 


Mark Gross

Okay, well let's get started. First, to level the field a little bit, if you could just give us a little overview of what is copyright content, and what does it mean to license it, and how long do copyrights last?


Daniel Gross

Sure. So, I know we have a pretty wide spectrum of people viewing this, including lawyers, non-lawyers, publishers, et cetera. So, try to lay out our terms. Copyright law protects writings of authors. So, the constitution provides the ability to secure for a limited time to authors and inventors exclusive rights to their respective writings and interventions. That's Article I, Section 8 of the Constitution. The writings for authors is inherently for a limited time, requirements to secure copyright are originality and fixation that you create something original and you fix it in some medium. And while the term has changed over the years, the current term is for an individual life of the author plus 70 years, or 95 years from publication, or 120 years from creation for corporate work. That was most recently extended in 1998, and that's the current term. So, basically if something is created now or something's on the internet, if something is created and put on the internet it is currently under copyright. Older works that got uploaded may not be. 


Mark Gross

Okay. So, you don't necessarily have to file it though for it to be copyrighted. If you write something original and put it on the internet, does that mean you have a copyright on it?

 

Daniel Gross

So, copyright vests automatically. Litigating it later on may require a recordation, but if you create something new and put it on the internet, you will be entitled to some copyright protection in it. 


Mark Gross.

Okay, so - 


Daniel Gross

And you have to talk about licensing, licensing in this context is really just a grant of somebody to the right to leverage that, and basically saying that you won't sue them for copyright infringement in that context. 


Mark Gross

Right. So, it's very possible for everything on the internet except older works is at some level copyrighted, right?


8:01

I mean, really it's all additional materials that's put out there. So, then the big question is – 


Daniel Gross

There are exceptions to that, largely facts, listings of information and things like that, that may or may not be protected. But generally we could assume that if you're, as we're going to talk about later, if you're ingesting a huge amount of information that exists on the internet, there will be copyrighted information swept up in there. 


Mark Gross

Okay. So, to what extent are you allowed to use unlicensed copyrighted information that you find on the internet? That's probably the big, the 700-pound gorilla here. 


Daniel Gross

Right. So, the main exception we're going to be talking about here is fair use. So, once something is determined to be a copyright usage, a copyright, something that would otherwise be an infringing use of copyright may be fair if it satisfies the fair use test, which is represented in four main factors. That's the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and the effect of the use upon the potential market for the value of the copyrighted work.


The first and last step factors there are the ones we're going to talk about most, and those are the ones that are discussed the most in the case law. So, that would be the purpose and character of the use, whether it's commercial or non-commercial, whether it's what the court's called transformative, that would be if it adds something new or changes the underlying purpose of the work. And the other factor we're going to talk about is the effect of the use on the potential market. So, if you're selling a book as a book, then that would certainly impact the market for book sales. But if you are providing it such that it can be viewed by some computer system, then that may not be the use that the work would otherwise be sold for. So, does it supplant effectively or substitute for the original use that the author intended, or does it undermine a traditional market for derivative works? 


Mark Gross

So, people usually think of fair use as using a small amount of it, but you really expanded on as lots of other things. I think of it as using a paragraph out of our work and that would be fair use, but it sounds like there's a lot more factors there. 


Daniel Gross

So yes, there are other factors, but that also has to be balanced against substantiality. How important is that one paragraph? So, if you have a long essay that you've protected and only one of those paragraphs is really important, the rest of it is leading up to it or leading away from it. And the only reason anybody would buy that essay is to get that one paragraph and you quote that paragraph, then you're still undermining the work. And that was a famous case years ago that the Supreme Court took where a single section of a longer work was excerpted and that undermined the sales of the underlying work, and that would still be an infringement, not necessarily fair. So, that means pulling one paragraph doesn't automatically give you a free pass. It may in many instances mean that the use is not infringing, but it's not necessarily the case. 


Mark Gross

Okay. And if something is infringing, depending on what it is, you typically would have to sue to get to somebody to stop it? I mean, it's –


Daniel Gross

Yes. So well, you'd start with an angry letter and build up from there, but ultimately the way you force the issue is to sue. Yes. 


Mark Gross

Okay. So part of this goes into, the term that's always used is "derivative works." So, let's talk about that. 


12:04

Daniel Gross

Right. So, when you publish a book you're not entitled to protection only in that book as it's originally provided, but there are some standard uses that are part of the commercial halo that surrounds that underlying work. So for example, if I write a book I have some write in licensing a movie based on that work. So, the copyright law allows for a derivative work. So, what's that? It's a work based on or derived from an existing work. So musical arrangements, translations, movies based on a work, abridgment, reproductions, et cetera. So sequels, for example, would be a derivative work in most contexts. 


Mark Gross

So, derivative works are sometimes protected, sometimes they're not protected. I think you told me about that if somebody was doing a parody, that could be a derivative work but that would be allowable usage? 


Daniel Gross

Right. So, the case law consistently balances these factors against each other, and a derivative work you'd have to balance the purpose of the work, and that's the question of whether it's transformative for example, or it's non-commercial, and you'd balance that against whether it undermines the original market for the work. So, even if it were a commercial work that undermined the market, it may still be fair if it's sufficiently transformative. So the classic case on that, which all the lawyers here are familiar with, is Campbell v. Acuff-Rose. That was in '94 and that was a rap, let's say parody of Roy Orbison's "Pretty Woman" that was determined to be fair use by virtue of the fact that it was, in the view of the court, commenting on in some way as a parody the original underlying work. 


Mark Gross

Okay. So, I think that gets into the heart maybe of a large language model, or ChatGPTs, or whatever he talks about, it's taking pieces from other places and maybe taking them in small pieces, but is that a transformative work? Is that a derivative work? Is that a really hard question to answer these days? 


Daniel Gross

So, it is a really hard question to answer. The starting point is, I don't know how much of your audience is technical and how much of your audience knows how these models really work, so there are different stages at which you'd want to evaluate it. So, first of all there's building the model and designing the model, there's training the model, and then there's generating output from that model. So, training and designing is going to be the most technical aspects, designing, structuring how that data is going to be considered and evaluated within the context of the model. But the training process is really about ingesting work and then, sorry, ingesting some set of content and then deciding how you're going to structure that. So, machine learning will start by, it'll start ingesting as much data as possible. As I said, dataset will be ideally as big as you can get. Data's pulled from, say the open internet from, book corpuses, anywhere else you can find it. And then in terms of what it does with that, first it'll typically make an internal copy.


16:00

And then what ChatGPT in particular does, so GPT stands for "generative pre-trained transformer." So, pre-trained means that it takes all that content first, it'll have access to that content, and then it'll be reorganized using what they call a transformer architecture. And that means it's tokenized, it's broken down into component parts. Those individual component parts are then utilized such that the model can understand the content. So, you'll have a first stage where all the content comes in and exists somewhere in the system, and then it's broken down prior to use such that it's most likely no longer recognizable as substantial chunks of the copyrighted works, if that makes sense. Then the question is what are you left with after this? It's tokenized, and that tokenization is used to organize it into outputs responsive to specific inputs, and I don't think we want to get any more technical than that at this point. We've probably already gone past where we should have, but –


Mark Gross

No, I think it's a reasonable level, but I think you're taking it in, so you're taking in these works which are likely to be copyrighted. Is merely ingesting this information some kind of violation? Because that's not the intended use of it. I think some court cases have been about that. I mean, it wasn't intended till you suck it into a computer to build a database. 


Daniel Gross

Right. So, there have been a bunch of lawsuits filed against OpenAI and a few other companies that assert that it is. From the case law it's not entirely clear, but the most relevant cases imply that that in itself is not a traditional copywrited use. So for example, there's the Google Books cases where they basically took tons of books out of libraries, scanned them, and plugged them into this model so they could be found. And then when in that case it wasn't really AI, it was recording those books. And then you could search those books, but the modernized version of Google Books wouldn't allow you to access the book as a whole, it would only allow you to access certain parts of it. And ultimately that resulted in settlement, but the intake part of that appears to have been fair use.


Similarly, there's older case law from the '90s that I included in the handout related to internal copying. Basically if there's an earlier stage of copying a work, but that copying is not ultimately the output you generate. So for example, copying video game codes such that you could reverse engineer some part of it, in that case for interoperability, but not using any more of that code than you need specifically for the purpose of interoperability would not be an infringing use, but the initial copying necessary to allow you to do that was considered fair use. So, the intake itself in my view is most likely fair use, but there are several cases on the issue that assert otherwise, and therefore it is very much an open issue at this point. 


Mark Gross

And those cases have not been decided yet, I guess, the ones that where it's not fair use, what's not considered fair?


Daniel Gross

So, they're in litigation now and litigation here will take at least some time, and there are multiple cases going on each with slightly different fact patterns. So, I think it'll be a while before we get a clear picture, but hopefully sooner than later we'll figure out which parts of it are clearly allowed. 


Mark Gross

Right. So, it's likely that just scraping the internet for information is probably okay, or a low-risk activity.


20:06

You can collect information and have it. So, now the question is what do you do with it? And that's I think where the question of how big a piece you use and all those other factors that you mentioned. So, you showed me, you asked one of the eight engines to describe Data Conversion Laboratory. Can we look at that slide? 


Daniel Gross

Yeah. So, I asked ChatGPT what Data Conversion Laboratory is, and what ChatGPT does, as we talked about, is it reads the internet, in this case most likely your website and any other additional content about you that's out there, but then that information becomes tokenized. It's not maintained in sentence form, for example. So, it gets broken down and then if you ask it a question like what is Data Conversion Laboratory? You end up with a sequence of sentences that, I mean, the model is trained and tuned pretty well, so it makes sense and hopefully is accurate, but these sentences do not exist on your website. So, it was reconfigured. Data Conversion Laboratory, parenthetical "DCL," that little block exists there. And if they were mis-describing you or if they were using that to describe something else, maybe that would be a trademark issue but it's not a large enough block to be a copyright issue.


The rest of it, if you search for those sentences, those word sequences on your website, they don't exist. So, this would be more analogous to me reading your website, and then in my mind reformulating this in a way that makes sense to myself, the information that I've understood, and then outputting that to you. Now, recognizing there's no internal understanding in this system, it is some algorithmic connection, an algorithmic link between the input and the output, but still it's broken down to a point where the recreation is not as clearly a copyright infringement as it would be if you just take that familiar paragraph and output it. 


Mark Gross

Right. And I thought this was fascinating. It's pretty good. And as you say, none of this appears as a full sentence any place that we were able to find on our website, so –


Daniel Gross

People are using ChatGPT to generate marketing copy, but you don't necessarily know the quality of the input and therefore you can't really trust the output necessarily. 


Mark Gross

Right. And we've had several famous cases that were talked about, about it generating all kinds of stuff that didn't exist in hallucinations and stuff like that. That's not really our topic today. So, this comes to the point of who's the author of something like this? If somebody, let's say it's a more substantial two paragraphs, who's the author? ChatGPT constructed something, a page or two, or three pages in the style of something based on information it has. 


Daniel Gross

Right. So, we've jumped from the input to the output and that gives us a picture of the whole model because what you have here is somebody is building a model, somebody is feeding that model a bunch of information, the training content, and then somebody is prompting the model to output something. So, when you put that all together, what happens if I personally design a model, and then train that model on a very limited set of data, and then I generate the output? Let's say I design a model, give it all the Stephen King books, and then tell it to output a novel in the style of Stephen King and then I put that for sale.


24:00

It's not a Stephen King novel, I don't put it under his name necessarily, but is that a copyright infringement? And if you prompt it to generate a Stephen King novel from Stephen King-supporting content, then you may end up with a derivative work, kind of inherently. So then the question is, what happens if you break that up? What happens if one person designs the model, a second person trains that model, and a third person then prompts it to output something? Is there still an infringement there? I would say probably yes, but who's responsible for it? 


Mark Gross

You've split the responsibility across multiple people, but is there an infringement? I mean, I guess that's the question though. It's only Stephen King novels and now you've produced something in a style Stephen King. It's none of those novels though, so it's an original work at that point but it's sort of derivative. 


Daniel Gross

Well, we could simplify it and tell it "Now that you've read these 40 books, give me a sequel of one of those books." If you say "Give me a sequel of one of those books," then, well tell me, I guess the equivalent would be let's say I sit down and read 40 books, spend a year reading only those books, that would be quite a year. And then I decide to write a sequel specifically of one of those books in the style that I have just ingested, let's say. That would be a derivative work, that would be a sequel of something that exists written in that style. But a sequel doesn't even necessarily need to be written in that style to be derivative. And similarly if I tell it "Create a movie out of one of those," that would certainly be derivative. 


Mark Gross

Okay, and is that infringing?

 

Daniel Gross

So, most likely as derivative work that would be infringing. The question then is how far can you stretch the model in order to prevent it from being infringing? So for example, let's say I start in the same place and feed this model that I've created and designed, feed it 40 Stephen King books, and then I tell it to output a novel based on those Stephen King books in the style of Danielle Steele and see what the model comes out with. So, I included an example of what happens when you do that in the handout, but that would seem to be more likely to be transformative, less likely to be a straight derivative work of the underlying works that it was fed. 


Mark Gross

Okay. So, the fact that it's transformative now would mean that it may not be infringing. 


Daniel Gross

And whether it's transformative, whether it's the same commercial use is an open question there. But the farther you get from a traditional derivative work, the the better you are. So, the most recent Supreme Court case on the fair use issue, which I included in a handout alongside Campbell v. Acuff-Rose, was Warhol versus Goldsmith, which just came out pretty recently. And that was a Warhol-style painting, maybe I should have included that, that was a Warhol work based on a Goldsmith photograph of Prince that was originally licensed for use but was then later reused without a license going back to Goldsmith, without going back to the original photographer.


27:55

So, the opinion for the court was written by Justice Sotomayor, who said that because it was being licensed for the same commercial purpose, that it's basically a substitute so long as it is a photo or an image used by a magazine and licensed for that purpose.


As opposed to Elena Kagan, who wrote a very aggressive descent, saying that once you're changing the style it's no longer a substitute work, that somebody who was licensing the Goldsmith photo would not be interested in the Warhol artwork and vice versa, that they're not direct substitutes. So, the question then becomes are we talking about, in the scenarios I just laid out, are we really talking about the equivalent of an Instagram filter, basically applying some filter to the underlying work and thereby saying it's different? And if so, is that enough of a change to differ it from the perspective of fair use? Is it transformative? Maybe, maybe not. It's all sliding scale. So, is it transformative, is it commercial? So, you end up having to evaluate each work independently, which is obviously not the best outcome for a lot of our audience, but –


Mark Gross

Well, there is areas here, and as some of it is a sliding scale, so it's a sliding scale on four factors you mentioned before. So, if this particular work was not for commercial use but rather, and I think you mentioned the idea of fan fiction, which is generally not commercial, right? 


Daniel Gross

Warhol is art, but for commercial purposes. Fan fiction, I think, is an interesting model here. So, there are large communities online where people write unofficial sequels to popular works. So, they publish it on these forums, they share it with friends. And the unwritten rules there are that these fanfictions, you could publicize them, but as long as you're not commercializing them, as long as you're not selling them an author's not going to come after you. And some consider it actual fair use, some say that it's so non-commercial that it's not really going to be pursued. But either way it's effectively fair use. 


Now, what happens if one of those works becomes really popular and you decide you want to commercialize it? So, the most famous recent case was Fifty Shades of Gray, which started out as fanfiction of Twilight. And when it became popular, the author, E. L. James, wanted to publish it. And in doing so, she basically stripped names, scenes that were taken from the original book and rewrote it. So, that's referred to as filing off the serial numbers. So, you basically take this work and you turn it into something that you can commercialize.


So, there's arguably a link, and I've seen people make the argument that there's a link from the original work to the fanfiction, which would be a derivative work if it's infringing, and then there's a link from the fanfiction to the final work, which would make that another derivative work and maybe would give some rights to the original author. But as far as I know, that hasn't been directly litigated. And generally it's considered acceptable, even if it may not be if it came to the courts. And then what we see is that we end up with this output from ChatGPT that may be the equivalent of fanfiction. 


32:00

And in that case maybe it's ultimately okay so long as you're not commercializing it, because once you commercialize it then you have to look a lot more closely at the other fair use factors in terms of how transformative it really is. 


Mark Gross

Okay, so let me see. Let me unpack what you just said and see if I'm understanding that. So, using a large language model, we're using ChatGPT as our code name for that, I guess, and it's got a collection of information that is copyrighted, it's pulling it together, it's creating something that is a combination of things, and as long as you use it and now it prints it for your own use. So, it's for your own use but does that become an infringing product, I mean, if you're not selling it? Or if ChatGPT is charging you for it, has that become a commercial use that ChatGPT is not allowed to produce? 


Daniel Gross

So, ChatGPT's use is commercial, there's no question about that. ChatGPT's internal use is commercial, their building of this model. The question is what about the output? So, I guess that comes back to the question of what the potential infringing use is. The potentially infringing use, or sorry, the potentially infringing text, the thing we're talking about. And there I think we're talking about the output of the model, and the question is does the output of the model infringe copyright? Because we separately talked about the training data and the intake. If we believe that the intake itself isn't infringing, then we go back onto the output. So, there are scenarios in which the output itself may be an infringing use, and then you'd look at those same factors. Does it displace a commercial use?


So for example, if I tell it to give me a made-up work, like I said before, even if it's based on these existing works, that may not be a commercial use. But if I tell it give a summary of some classic so that I don't have to buy a CliffsNotes or something like that, give me the CliffsNotes of something, then maybe that would be a commercial use that people sometimes otherwise pay for. And I think we end up with a lot of gray area here that will be hashed out as we move forward. So, the reason we started having this conversation, as I recall, is there were a number of lawsuits. The first one may have been the Sarah Silverman lawsuit against OpenAI and a few others, where she separately asserted that the intake was illegal inherently, that the documents that were used for intake were acquired illegally, which is a separate question we haven't really discussed here. And she separately asserted that the output was infringing, was illegal.


So, then the question is does the output actually displace her work? And one thing I thought was interesting about that output, which was included in the lawsuit and is included in the documentation that being distributed, is that at the end of it, basically the question the lawyer asked in that case was give me a summary of Sarah Silverman's book, in that case Bedwetter, and it basically cataloged these are the events that happened. And then the last paragraph is that the entirety of this is – 


35:58

...let me see if I have the text in my notes. Let's see. So, the final sentence of this summary, so I don't have it written down exactly, but it was, and the entirety of the work in the book is told in the characteristic style and wit of Sarah Silverman, that it's basically saying explicitly, and this is not what you'd get if you read the book, buy the book if you want that type of content. So, it seems to almost be making an argument for itself as not a substitute for the original work. And I thought that was interesting because that was included, and that would seem to push in the opposite direction. Now, if it actually output the exact same stories in the same exact style, but maybe changed some words because it was effectively recreated, that might be a stronger argument for infringement as a derivative work, in my view. 


Mark Gross

So, the ingestion, from what we've said so far, was probably legal for her to do, because you've said – 


Daniel Gross

So, she separately asserted that the books were illegally acquired, and that raises a separate issue that we haven't really touched on. 


Mark Gross

Yeah, I'm not sure what that meant. I don't know if everybody knows the case. I don't know all the pieces of the case, but why would it have been illegally acquired? I, she didn't buy them?

 

Daniel Gross

There were a few fairly well-known book corpuses floating around the internet. Some of them were acquired legally or generated legally, were scanned from libraries or something like that. Others were pirated, and she's asserting that they were acquired through one of those pirated ones. And this actually is an important difference because there's an author's guild letter that was published, that was signed onto by a very large number, tens of thousands of authors, that asserts that we need to stop this. And it asserts that it's clearly illegal if the books were illegally acquired, and if the books are legally acquired, then my read of it is that appeals to fairness in the sense that it's wrong to not give authors the benefit of this much larger use, even if the books were originally legally acquired.


And that was what came up in the Google Books case, where Google was scanning books out of libraries but using them much more broadly than the original library use. So for example, I can't legally acquire a book and then give the world access to those books. And Google doesn't do that now, they originally were doing something like that, but that's why it's limited excerpts now, because presenting those legally acquired books as a whole would clearly be illegal. But if you're breaking them down this way, and this is what we were talking about at the intake of these models, doing the intake, then using an internal use of a legally acquired work is mostly likely not a copyright infringement. But once we legally acquire it, things change. 


Mark Gross

So, to that extent, then, is the protection, of limiting, if the model is taught not to use too large a segment in terms of what it's building, is that a protection at that point? Is that going to be a –


Daniel Gross

So yeah, and I see we're actually running low on time.


40:01

Mark Gross

We have a little time.


Daniel Gross

Yeah, we have a little time left. Okay. So, if we take the position that the intake is generally okay, so long as works are legally acquired, separate issue, and the model itself isn't necessarily doing something illegal but the output may ultimately be a problem if commercialized, then the question is can we somehow handicap that output? Can we set rules? Can we tell it don't output more than two sentences at a time? Can we tell it to attribute things properly? Well Google Books, part of their settlement was we'll attribute it and we'll also point you to a place where you could buy the book. So, that type of thing can be negotiated in settlement, but you could certainly enhance the case for fair use if you limit usage, if you tell it not to give certain types of answers, and if you maybe limit the types of queries that you are willing to accept. 


Mark Gross

Okay. So, that's the kind of protections that could be put in there. Let's just shift gears a little bit because I think some of our audience is interested in really, because the whole ChatGPT and that whole thing is scanning the whole internet, but that includes a lot of information that's not relevant, not useful in particular areas. A lot of the publishers that are on here have very large collections of their own, and those would allow for specialized large language models. Well, if they owned a copyright on everything that they're putting on, there wouldn't be any issues in any of these? Or are there issues there also?

 

Daniel Gross

This certainly would not be an issue. There's still a question of if somebody uses that model to generate output and then puts it on sale, are they infringing your copyright, you as the model owner? This private corpus of information that's being used for training has other advantages as well. Because you said you've already done webinars on the hallucinations and the data issues involved. So, if you can control the corpus of information, and there are already companies building sets of data, datasets, huge datasets that are being marketed for training, then if you can control that, first of all, you can control licensing and you can control what people can do with the output. But also you can probably charge more for it if it's vetted and if it's valuable. So, well-structured content and well vetted data has inherent value. 


Mark Gross

Right. And while structuring of course is what DCL does, and to the extent that information can be structured, it certainly provides better answers. I guess the question, even if you own the data, if it's just the quality of data putting together will – but that's not the legal question. The legal question is how good is it going to be, and do you really get rid of the hallucinations? It's also something that's come up, and we don't know where hallucinations come from, right? I mean, is it because there's bad data out there? Or it's because the machine has decided to do something on its own? 


Daniel Gross

Yeah. One of the joys of large language models is that the people who program them don't actually know how they work once they're trained. So, you don't necessarily know what's causing this model to connect pieces of data together, and therefore you can't really, I mean, you could tune it but you can't directly prevent those hallucinations. But I think that's a topic for a different day. 


Mark Gross

I think that's a whole separate topic for some later time.


44:01

Yeah, we are getting close to the end of the time we have allocated. Any specific things you'd like that we left out? There's a lot of stuff we left out, I'm sure, but anything you want to – 


Daniel Gross

Maybe we can – there's a lot more to say about the legal aspects of this, the question is where are we going? And I think one of the places we're going is commercial models that can be marketed for accuracy, for rights and things like that. And if you're the source of the data, you can control any licenses associated with the output of that model. So, if I'm designing a model for you to use for marketing purposes, I can give you rights in that copy. I can make it your copy. Separately you can make less powerful models. Like I said, there are ways to hamstering the models such that the output is more likely to be safe, and that may be a less commercial version of this model.


Meanwhile, one thing we didn't talk about is the Napster and Grokster cases where basically similar things happen, well not similar things necessarily, but music was being served to users over the internet. And there was lawsuits based on the provision of that underlying music to the users. And the end result there was new sales models that allowed these rights to be licensed. So, you end up with modern platforms like Napster – not like Napster, like Pandora or other advertising or subscription based services that can legalize this by basically funneling those fees back to the original content owners as licensing fees. So, there is space for market innovations that will ultimately lead to legal funneling of copyright protected works to end users, or ways to use things more freely. But I think that's different and that's something that can be controlled ultimately by contract and by licenses. But that's different than what we're talking about in the context of traditional copyright law and whether you have an infringement there. 


Mark Gross

Right. Okay, Marianne's on, do you have some questions for us? 


Marianne Calilhanna

We do have some questions. So, is it your understanding that any content that is created by a large language model-based tool is not copyrightable? Is there an amount of editing that can enable the output from an LLM to be copyrightable? 


Daniel Gross

So, actually I understand it's actually a major problem in the use of models for special effects in movies, for AI models, because the copyright office has kind of moved the standard a bit recently in terms of how much human input you have to show, the author input, authorship you need to show in order to register a copyright. Whether you can copyright your own work will likely depend on how much input you have. So if you tell it "Give me a story in the style of Stephen King," like I was saying before, then most likely you don't have authorship. But if you put together a sequence of 200 prompts that you use to kind of sculpt a story that ultimately is output, then you're more likely to have some authorship.


47:59

Similarly, I think there was a recent case related to a comic book that was put together by a model, that was output by a model, and there was authorship in the speech bubbles only. The images for the comic book were output and then speech was added by the author. And the copyright granted limited authorship specifically in that speech that's written. Now, there are ways to manipulate things so that you can get some copyright protection. And the classic problem for protecting something that's not copyrightable is cookbooks. So, a menu is not copyrightable because it's fact, and even if you've created it, you're telling somebody this is the way to make this food product. So, cookbooks tend to have a lot of structure. They try to protect the formatting, but also they have a lot of images, a lot of photos of food, and that's to make sure that you have some sort of protection in that content, even if it gets ingested by some model and output as a straight recipe independent of that. Because a recipe, a phone book, something like that, the phone numbers in a phone book are not protectable, they're facts. 


Marianne Calilhanna

Okay. Another question, if someone writes a musical based on a script, is that transformative? 


Daniel Gross

Yeah, so that would typically be a standard derivative work. The same way a movie version of something would be a derivative work, a musical either based on that movie or based on the original work would still be derivative. 


Marianne Calilhanna

All right, thank you. What about commercializing gated website copy? For example, a user would need login credentials to access that content. Can ChatGPT harvest and reuse that data without permission to the creator? 


Daniel Gross

So, one of the end results of a few of the Google lawsuits with crawling, and I think actually I mentioned the Perfect 10 lawsuit in the notes as well, was to provide some sort of opt-out. So, there can be additional rules apply to this. And there's currently a request for comment put out by the copyright office itself asking, first of all what the rules currently are, what the rules should be, and what changes should be made to the rules to get them where you want. So, we may end up in a regime in which an opt-out is necessary. So, there's code you could put in your website to prevent Google from crawling it for search purposes, for example. In the same way some of the models are already creating ways for authors, for image providers to say "Don't look at my website."


Google Images does not look at websites if you tell them not to. So, if you password-protect your content and you make access available only by signing some license agreement, some EULA, an end-user license agreement, a clickthrough thing, then copyright or whether the usage of that to break it down and output in a different form, whether that's copyright infringement is one question, that's what we've talked about here. But you also have a contract violation of that licensing agreement that allowed you access to it in the first place. So, if you're protecting your content in one of those other ways, that gives you another recourse. 


Marianne Calilhanna

Thank you. Is crawling a website a legal means of acquiring copyrighted content? 


52:00

Daniel Gross

So, Google's been doing that for years and it was addressed tangentially in the Perfect 10 case where they're crawling all this content, and that's that intermediate copying we talked about before, that you're reviewing it and intaking but not actually using it as is. You're breaking it down into a model. And that has generally not been considered to be a copyright infringement with the caveat that settlements that followed those cases resulted in the opt-out regime that we just talked about. 


Marianne Calilhanna

Okay. If the owner of a set of IP uses AI to create derivative works of their own IP, do they still own the rights to the new content even though it was created by AI and not a human? 


Daniel Gross

So, that's actually a very interesting question, and I think there are two different approaches you can take. One is, is it a derivative work? In which case there may be some rights involved in it that stem from the original copyrighted work. And the second question is how much input have you put into it? So, if it's not derivative work, if it is something fundamentally new that you've used the model to create, did you personally modify the model to get you to result you want? And did you personally tune your inputs to enhance that output? And if so, maybe you have some new copyright in that. Although, there's a separate question of how limited that is.


So, it might be limited to the things that you have tuned. And this is the opposite question of what we've been dealing with, which is, is there infringement? Now, if there's infringement then there is some sort of use of the underlying work. So, if it's infringement because it's a derivative work, then the original author, if they own the rights, may have some rights in that derivative work inherently by virtue of that original ownership. But if it's fair use because it's not a derivative work, then it's a separate question of whether you've actually acquired any rights in it. And that has to do with that same question that's been raised a couple of times now of whether you have any authorship. 


Marianne Calilhanna

Back to the website crawling: can crawled content be used to train a model? Training does not generate output.


Daniel Gross

So, that's where we – 


Marianne Calilhanna

Or does it? 


Daniel Gross

Well, it depends to an extent on the model, but it won't directly. 


Marianne Calilhanna

Yeah. 


Daniel Gross

So, if you train a model and it's never tokenized internally, and you could tell it," Output the entirety of War and Peace" – well, War and Peace is not copyrighted, but give me the entirety of some work and it outputs that, that would most likely be, well, that would certainly be an infringement of a copyright work. But if you've broken it down and you're not actually using it to output an infringing work, then maybe not. And that was the first half of this discussion, that intermediate copying is most likely not an infringement. Crawling a website is most likely not inherently an infringement. So, if that's what you're doing, then so long as you're not later outputting something that is an infringement, then most likely you're not infringing it.

 

Marianne Calilhanna

Okay. We're going to end on one last question. We still have more.


55:58

So for those questions that we have not been able to answer, we will do our best after this to follow up. And Daniel, I'm also going to ask you to share the CLE code. 


Daniel Gross

I was going to say before we –


Marianne Calilhanna

Why don't you go ahead and share that right now? 


Daniel Gross

share the CLE code, which is 2023 – 2023IIPS1031. So 2023IIPS1031. So if you want CLE for that*, plug that into the document that was distributed earlier, send it back to the email address that's in the chat, that's Harris Wolin at Myers Wolin dot com, and get the CLE credit.


*[Please note this referred to attending the live webinar.]


Marianne Calilhanna

Okay. Let's see if we can answer this last question in the next two minutes. Many students are required to use turnitin.com to check for plagiarism in their papers. Has Turnitin been challenged legally, or has Turnitin obtained licenses from every publisher or author? 


Daniel Gross

So, I'm not familiar directly with turnitin.com. I do know that there are a few websites out there that are now designed to check whether something is AI generated, and for images, for example, they tend to give inconsistent results. So, it's not clear how accurate it is. I do know that when I was in college already there were websites and there were engines that were sold to professors or to schools that you could run essays through and see if it plagiarized some significant portion. That was typically looking for a plagiarized string of words over a certain length. So for example, the Data Conversion Laboratory description that ChatGPT gave me probably would not trigger those types of checks. Although, if you have access you could certainly run it through and see. 


Mark Gross

Right. Yeah, Turnitin does actually go against a large library of them, but I really don't know what the arrangements of it there are. That's a good question. I'll try to find out a little more about that. 


Daniel Gross

Well, typically, going back, I guess 20 years, those systems tended to look for strings of words over a certain length because that would be something the professor could easily show as plagiarism. The interest there wasn't necessarily copyright infringement, though; it was plagiarism and finding the original underlying work. 


Mark Gross

Right. Yeah, I don't know what's got the information from the – oh, yeah, that's a whole other –


Marianne Calilhanna

That's another webinar. 


Mark Gross

Another webinar. 


Marianne Calilhanna

Well, we are just about at the top of the hour. Daniel, thank you so much. It's been lovely getting to know you as we've been planning this event. Thank everyone to take a little bit of your time out of your day and attend this webinar. The DCL Learning Series comprises webinars such as this, a monthly newsletter, and of course our blog. You can access many other webinars related to content structure, AI, XML standards, and more from the on-demand webinar section of our website at dataconversionlaboratory.com. We do hope to see you at future webinars and have a great rest of your day. This concludes today's broadcast. 


Mark Gross

Okay, bye.



bottom of page