
DCL Learning Series
Fair Use and Generative AI: Navigating Legal Frontiers
Marianne Calilhanna
Hello and welcome to the DCL Learning Series. Today's webinar is titled "Fair Use and Generative AI: Navigating Legal Frontiers." My name is Marianne Calilhanna, and I'm the Vice President of Marketing here at Data Conversion Laboratory. I'm really happy you're here, and I'm very excited for this conversation. But before we begin, let me just go over a few housekeeping items. This webinar is being recorded and it will be available in the on-demand section of our website at dataconversionlaboratory.com. My colleague Leigh Anne is working behind the scenes, and if you have any issue with this GoToWebinar platform, just send a chat, and she can help you out. We have good news for the attorneys in the house today: attending this webinar qualifies you for one CLE credit.*
*[Please note this referred to attending the live webinar.]
At the end of the conversation, we will distribute documents that need to be completed and returned to Harris Wolin with the CLE code on them. We'll provide the instructions and direct you to get those documents at the end. Finally, we have built-in time to answer any questions you have. You can feel free to submit those questions via the question dialogue box. Submit them as they come to mind. We will reserve time at the end to answer them. Now, I am really happy to introduce today's speakers: Mark Gross, President of Data Conversion Laboratory, or DCL, as we are also known, and Daniel Gross, Partner at Myers Wolin. Welcome, gentlemen. I'm going to turn it over to you.
Mark Gross
Okay. Thank you, Marianne. First, thank you to Daniel for joining on this webinar. This is an area of tremendous discussion and commercial interest throughout the publishing industry, the book distribution industry, libraries all over the place. And it's really good to have an expert on to be able to speak to it because there's a lot of misunderstanding in the industry. Also, welcome to all the lawyers are on that qualifies for CLE credit, which is really incredible. It's interesting that copyright was not something on the Wall Street Journal front page three years ago. This has sort of got mainstreamed now with LLMs and all this, and become completely different. And even in today's Wall Street Journalism article, OpenAI video, on the front page, it tests copyright guardrails.
So this is very fast and there's a lot going on over here. And last time I had Daniel on, this is a redux in a way of a webinar we did two years ago, which is the last time Daniel's on this webinar. And we went through a lot of similar topics, but my question to, "So what does it mean?" I think a lot of times the answer was, "We don't know yet. There's no case law." Well, in the last two years, there's a lot of case law, and there's new case law developing every day. So I think it's going to be very interesting to review where we are now and where things are going because the impact is huge.
4:00
So, Daniel, thank you very much. Before we start, our audience consists of people in a wide variety of industries, and it's publishing, and libraries, and people who just handle information in many different ways. And are all interested in what they can do, and where they can get into trouble, and all those kind of things. And a lot of this is based on just basic copyright law, which has been around since, I think, 1790. And there's a concept, the major concept around all this, depends on is something called fair use and there's the four factors. So, I think first if you can do a quick background on the four factors and what that means in the concept of fair use, I think that would be a great way to start.
Daniel Gross
Sure. And yeah, thank you for having me on. This is not just a lot of interest in the publishing world, it's also in the legal world, which was not in your list of groups that care about this. There's a lot of movement right now, and we're all very curious to see how this shakes out. Thank you to Marianne and Mark for having me on. Thank you for Harris for turning this into an IIPS event so that we could all get CLE credit. In terms of fair use, I want to start by giving some context. The models we're talking about, mainstream AI implementations, depend on huge amounts of training data. They ingest on the order of terabytes, 45 terabytes for the GPT-4 model, for example. So we're talking about a huge amount of data that's acquired by either buying books and scanning them, by scraping the internet, by acquiring pirated data sets, corpuses of books.
For example, a lot of you are familiar with the Google Books cases from a decade or almost two decades now. In those cases, it was getting books legally from libraries and making them available to the general public in various ways. All of this depends on fair use. So once a potential infringement or an alleged infringement is identified, one way out that is to determine that that the particular use you're talking about is fair. So in order to evaluate whether it is an acceptable fair use, we look at the four factors that are on screen right now. So the purpose and character of the use, whether a particular use is commercial or non-commercial, whether it's educational, for example, or whether it's transformative, whether you're using the work in a fundamentally different way than it was originally intended for. That's the first factor.
The second factor is the nature of the copyrighted work. So the works we're talking about are generally going to be either text, written text, or images or video. And those are all traditional copyrighted works. That's true for both input and output. The amount and substantiality of the portion used in relation to copyright work as a whole. So are you using a sentence? Are you using paragraph? Are you using the entire work? And also, how important does that particular paragraph to the work as a whole? So if you have a large book, but there are three pages or one chapter of it that are really fundamentally new, and that's the reason that somebody is going to be buying that book because that is the concept they need, then that may have more substantiality, even if the amount or portion of the work is not huge.
Then the fourth factor is the effect of the use on the potential market for the value of that particular copyrighted work. We will talk about a few different things that can mean. Whether we're talking about the market for the specific work, whether we're talking about the market for that category of work, whether we have to be talking about a specific, known market, whether it's a market that exists.
8:00
Are we talking about the market for selling a book? Are we talking about the market for model training materials? So existing and developing markets can both be considered for this purpose. Now, the most recent decision that we have on this, the most recent Supreme Court decision, was a few years ago. We talked about that at length on our last discussion because it had just come down. But that was Andy Warhol versus Goldsmith. So Lynn Goldsmith took pictures of a young Prince a long time ago, sold them to Vanity Fair. Vanity Fair licensed them to Andy Warhol for one use. He took that picture, modified it, like Andy Warhol tended it to, and then it was used that one time in Vanity Fair magazine as an image to accompany an article on Prince.
Now, Andy Warhol then continued to – he held onto that image and turned it into his classic Prince series, which was then displayed as art for a long time. And we'll see that this can end up being used in different ways. So the Supreme Court looked at a case where when Prince died in 2016, Condé Nast then licensed – they wanted a picture of Prince to put alongside a tribute to him. So instead of licensing this photo from Goldsmith, they licensed the Warhol work from the Andy Warhol Foundation. So Goldsmith sued saying that this is still her original image. And the Supreme Court ultimately held that, well, in a 7 to 2 decision for the majority, Sonia Sotomayor wrote that the work shared a commercial purpose.
So this transformation by Andy Warhol is fundamentally the same as the original image and as evidence for that, and that is kind of looking at the first factor that we talked about, how relevant the nature of the underlying use because it's the same image in some fundamental way. And she also talked about the fourth factor that this implicated the marketplace because the Goldsmith photo was used by other magazines in their tribute issues while the Warhol image was used in this particular article. So she framed it as parallel markets. Now, we could look at it at a different way. And in a dissent, Elena Kagan wrote that she focused on the value of transformation in art. So she said that this is fundamentally different and that she would wouldn't ignore.
This is not as simple as, say, applying a generic filter to an image and using a modification. It's artwork and it has to be considered as such. So she would focus instead almost entirely on the first factor and say that this is a fundamentally different use. So we see this tension throughout. Most cases, we'll look at the first and fourth factor and they'll maybe reach different decisions as to which is the most important. But the most important is typically either that first one, purpose and character of use, whether it's the same work, whether it's in some fundamental way used for the same reason or the effect of the use on the marketplace. Those are typically considered the two most important factors, and that's what we'll be talking about here.
Mark Gross
One thing that comes out of the way the factors are handled and stuff is it's a lot of subjectivity there in terms of how it can be dealt with. The idea whether is it a work of art or not a work art becomes a subjective decision. I can see where caseloads become very, very important as this moves along.
12:03
Daniel Gross
Well, that's a question the Supreme Court has been struggling with for a long time as to whether something is art, whether it's a fundamental change. And there have been a lot of cases in the last few decades and longer discussing that.
Mark Gross
So moving on to technology, we're talking about this topic. I wouldn't think we have to bury deeply into technology, but we do, because a lot depends on exactly – we're talking about AI models and exactly how they're using this data and what happens today as it goes along. I think the questions are: what does infringement really depend on? And is it the scraping? Is it the model structure? Where in the place is it infringement? And maybe it's all of the above. So you can just give us – this could be a three-hour talk, but we can get it down to 10 minutes or so on how this comes together.
Daniel Gross
Sure. And for context here, the copyright code does make allowances for things that are changed in significant ways and then continuing to be copyright infringement. So we have a concept of derivative works that if something is based on or derived from some copyrighted materials, then in a lot of cases, like a translation of a work is derivative work and copyrighted and it would still be owned by the original party. So a sequel to a movie, the original author of the movie, whoever created it would still have some rights there as well. And that determination is often based on whether that new work that is allegedly derivative includes characters, settings, or other copyrightable aspects of the original work. So to an extent, we're going to be looking at how significant the changes that models make to the underlying work actually are and what that output ends up being.
So to take a step back, how does model training work? And I know a lot of the audience has different levels of technical understanding of this. So in general, all the model designs, so there are a lot of different model designs, they'll work in different ways, but all of them rely on initially ingesting some huge data set, ideally as large as possible. The entire open internet, book corpuses of thousands or hundreds of thousands of books, whether they're legally acquired or pirated, purchased books, borrowed books, the Google Books scan, they were famously taking books out of libraries and scanning them.
So, initially, however you acquire that material, it's copied usually multiple times as part of the ingesting process. You have some interim library that's then used for training. Once you have these works cleaned up, maybe lemmatized or otherwise cleaned up, deduplicated, et cetera, then you'll take that and you'll break it down, or at least the modern LLM model designs will break it down into small components, tokens, tokenizing as part of that training process. We can generally assume that those individual tokens are small enough so as to not implicate copyright issues. Then some training process is applied to that. Different models will guide that training in different ways. It basically creates relationships between those underlying tokens. Once a model is partially trained, it can then be further tuned using traditional AI reward functions that you'll have the model, run some process, and then tell it if it did a good job or not, and use that to modify the underlying model.
16:05
So this will get refined over a lot of time, and then you could then use that model to generate an output by running data through it in various ways. So what is the model then? The model is some complex set of instructions that's internal to the model and it's a map of relationships within the tokens. Those instructions will then guide some output by way of those relationships between tokens. Okay, so tokenization. Once the original training materials are tokenized, is that inherently transformative? Is it something fundamentally different? Is it used for something fundamentally different? That's I think the leftmost bullet point on the slide. What if those tokens could be somehow reassembled into their original components? If it was previously transformative, is it no longer transformative if the output looks too much like the input did?
So those are things that we'll see in the cases. Interim copying and legal precedents. Now, interim copying was a big thing in – there were a few court cases on it in the '80s and '90s. A lot of those court cases got preempted by what's called the DMCA, the General Millennium Copyright Act. But in the '80s and '90s, this was an issue for video games. So Sega, for example, had their Sega Genesis and they wanted proprietary rights to be the only ones to make games for that. So they put in some software that would stop third party games from playing there. Now, Accolade bought some Sega Genesis cartridges. They pulled the code out of it and they tested different parts of that code to see which of that would allow them to get past Sega's control of whether it would play. So they bypassed access controls and they used that to figure out which snippets of codes were required to allow you to play their third-party video games and they succeeded.
So Sega sued because in developing that, Accolade had to copy over and over again the underlying code and modify in various ways to test it out. And the courts found that this is not a legitimate copyright need. The output is not the copyright materials, so there's no real infringement there. And that interim copying was really for a technical purpose. It wasn't for copyright purpose and therefore there was no infringement. There are other similar cases in the video game world from that time. In the '90s and the DMCA was passed, which basically created a separate cause of action for bypassing access controls. So this became moot, but we have all that case law that says that interim copying wasn't really an issue back then. So to what extent does this interim copying that we have, we ingest a lot of material, then we do various things with it, but while doing those various things during the model training, we have to copy this over and over again in various ways. That includes the preliminary cleanup and all that.
So does that initial copying implicate copyright issues? And derivative works, we just talked about derivative works and how does that relate to output similarity? So separate from whether the model itself can infringe, separate from whether the training materials can infringe, there's a question of whether the output of the model can infringe. So that can either be by recreating the initial materials that were used for training by stringing together these tokens in the way they were originally broken down from.
19:57
Or it can be by taking a bunch of that content, taking the settings, taking characters from those initial works and putting together these tokens in ways that bring those copyrightable elements back to life as part of some derivative work. So does infringement then depend on how similar the output is to the underlying data? Or maybe it doesn't matter. Maybe the output has such a huge impact on the resulting marketplace for similar materials that whether the output is similar to the stuff that you started with doesn't matter. Maybe it's a derivative work no matter what, and therefore the output can cause problems because it destroys the marketplace for competing works. So between the ingestion and the output, maybe the marketplace analysis is really what matters. Finally, the importance of technical understanding. We were talking about two different things from that perspective. One is the importance for the courts to understand how these different models work because the way these different models work will lead to different results in the legal analysis.
And second of all, it's important for publishers to understand how these models work so they could identify and avoid vectors for potential liability based on how the courts treated these during their analysis. So what this will come down to, and we'll see this in the case law, I think we're between slides there, we'll see this in the case law, what this comes to down is what is the training and output analysis? What's it analogous to? What's it similar to? Is it the same as me sitting down and reading hundreds of thousands of books and then writing something based on that? If I'm a human and I sit down and I read five books and write something similar to it based on it, we don't call that a derivative work necessarily. Maybe it's similar to fan fiction writing.
But if I read hundreds or thousands of books, that's just what people do. And then anything you write is ultimately tied back to stuff that you read in the past. So is that what models are doing? Courts say yes, some courts say no. Or is this more similar to me taking a pre-existing book and applying a tool, like a filter? Which was what the court thought in the Warhol decision. That decision of the court was basically you're taking a filter and applying it to an image or in this case a book. And the output is really just a modified version of the input. It doesn't matter if the output can work as a substitute for the original work, if it's similar enough that it replaces it. Anyway, all of these things are discussed in different cases, which are all faced with different fact patterns.
Mark Gross
Right. It's remarkable to me, and you've just touched on this a few times in the last few minutes, is that these are all analogies to existing law that's been around for a long time. This concept of interim works has been around for a long time, and you're now applying that to building this model, which has never been done before and using those same concepts – and I guess the way you have to explain it to the court, and this is the way these are going to be decided as we go along because they do have analogy to what we think of as a book and how it works and knowledge and how it works.
Daniel Gross
So a few months ago, the Copyright Office put out their primer on this, which was probably about a hundred pages, or maybe longer.
24:02
And the fundamental question that that primer keeps going back to is, are the existing laws sufficient to handle this, or are we dealing with something fundamentally new that needs to be directly addressed by the Copyright Office and by Congress? So we could either let the existing rules apply to this and let the outcome come down where it may, or we can lobby Congress and say that we want a specific outcome. We either want this to be allowed or not allowed, or we want there to be specific controls required.
And then you could have Congress create that law and implement it, which is what we've talked about in the '90s when we had works – there were access controls being used to stop people from getting to stuff online. There was no copyright problem with that. Maybe it can be controlled by contract. But Congress stepped in and creating the DMCA, which created rules surrounding whether you can bypass access controls. So a lot of this can be controlled if there's a specific outcome you want. You still need to figure out what outcome you want. But really where we are now is figuring out how the existing rules can and should be applied to this.
Mark Gross
Is that the position of the Copyright Office that existing rules will be able to apply, or do they feel that there's a need for new legislation?
Daniel Gross
I think it's ambiguous. There may have been more recent press releases since then. There, they were kind of requesting comment, trying to get a feel for how attorneys feel about it. My read of it was that the Copyright Office wants clear guidelines, and as we'll see, the courts are kind of reaching different positions, so it's tough to apply it. But really, the Copyright Office isn't the one that has to make these decisions because the Copyright Office doesn't police infringement. They'll police registration, and they have to worry about whether the output of a model is copyrightable. So they're concerned with that, but the Copyright Office isn't what determines whether a particular infringement is generating liability.
Mark Gross
Okay. So I'm sure there'll be a lot of activity over there, but this brings us to case law and the recent case law that's been out there, some major cases, some cases that people don't talk about, and it's probably cases that came out since we put these slides together. But let's stick to the major ones we talk about, but this is very exciting.
Daniel Gross
Yeah, so we're going to talk about two specific cases because these are the ones that we got substantive decisions in, effectively final decisions, although it could be appealed and likely will be where settlement is not reached. There are tens. I think there are over 50 different lawsuits now, including some that were consolidated against the various AI companies. This is just in the US. A lot of these have parallels internationally as well, and it'll be interesting so you have different jurisdictions handle the same issues when they're faced with them. Very few have led to substantive decisions or at least final. So that's why we're talking about these two.
I'm going to start with Bartz because we actually have some views on that from last week where a final settlement was approved after being bounced back by the judge a few times. Both of these cases are Northern District of California, so it's interesting to see different language between them. Okay, so Bartz versus Anthropic. Anthropic is the Claude model. Anthropic compiled a large library of books that were either pirated or purchased, some pirated, some purchased. They started by pirating everything. There's some evidence that they knew that was a problem.
28:02
And that they were discussing it as a problem. Then they started buying books, ripping off the bindings and scanning them and then processing them internally, deduplicating, lemmatizing, and then using that as part of their dataset. First they copied it all to a central library to create a training set. Then they cleaned it up removed repeating text, headers, footers, page numbers, repeated books, et cetera. Then they tokenized, that included the stemming and lemmatizing individual words.
Then they compressed that all into a model. Once the model's formed, the library was maintained. So they kept all the works that they had either pirated or scanned in some readable format, a text format. They decided they were going to keep those books forever because they wanted to make a library of human knowledge that was regardless of whether they were going to be used in training. All those books could be recreated often through the model, except that the model implementation prevented it from happening. So they controlled the output such that it would limit the amount of text that was used for training and how it would be output. So in June, we got an order on that case that started by saying that that use to train models was exceedingly transformative and was clearly a fair use. The judge explicitly analogizes this to reading and writing.
So authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts too and then writes new texts. So they thought this was directly analogous to you reading and then writing something based on it. Digitization of purchased books was fair use because it replaced legally owned copies. Storage of pirated texts was not fair use. The results could have been different with respect to the models if the authors alleged infringing output. They did not, and the judge called that out. He said explicitly, that if some of the output could be shown to infringe the training materials, then that would be a separate thing that could implicate copyright issues. There are some conflicting statements about the use of the pirated copies for model training, but the judge explicitly said that anything that was pirated and then used for library purposes was an infringement. He said that it's hard to see how something that's initially pirated could be cured and thereby be usable in a model.
But he also seemed to say that use for training a model was so fundamentally transformative that it couldn't really create an infringement. He also basically said that, "Thankfully, I don't have to decide this and I'm going to just be happy to say that there's an infringement. If these pirated works were only used for model training, then I would have a tough decision in front of me, but I don't have to decide that because they were also used for storage and that storage is clearly a issue." copyright So that was a decision we got, which basically said that that model training isn't an infringement, but that there is an infringement implicated anyway because of storage of all these books. So that led to a settlement agreement reported first in late August, and the parties have been presenting this to the judge. Even though the model training itself was not considered a copyright infringement, it led to the largest copyright settlement in history, one and a half billion dollars to cover infringement of a half million works, 4800something thousand works. So that grants about $3,000 per work infringed.
31:58
Now, one of the reasons we have such a huge settlement here is because when we're talking about statutory damages, which is that the copyright statute basically says that you could either show damages, like real-world damages and collect on that basis, or you can say that because there's a copyright infringement, there is some automatic amount of damages and basically collect damages per work. And the copyright code explicitly says that this is per work regardless of how many infringements there are. So the statute says those statutory damages are between $750 and $30,000 per work. So this $3,000 settlement recognizes the fact that whether or not the model training implicated copyright issues ended up being irrelevant because as long as there's any infringement associated with each of those individual works, it's the same. Having another 30 uses of those works would not enhance those damages further necessarily. So basically, Anthropic won on the questionable issue, but lost in terms of paying effectively the same damage as they otherwise would.
Mark Gross
Before you leave that, are we saying that if they would've done everything they did except not kept any of that material in the library after they digitized it and threw it out, they would not have been infringing?
Daniel Gross
Yeah, so the judge didn't want to say that with respect to the pirated works. In some places in the decision, he implies that if the pirated works aren't kept in a readable form somewhere in there, then there wouldn't really be any use that could implicate copyright issues. He also says that the market impact of the model output is not relevant unless there's some infringement there, unless the output itself infringes in some way. Even if it destroys the market for the underlying works, the model output would not be a problem if the relevant copyrights are the ones at the input stage, not at the output stage. So going back to the Supreme Court decision we talked about, the focus here is clearly on that first factor. It basically says that this is so transformative that the impact on the market doesn't matter.
Now, the Kadrey decision, Kadrey versus Meta technically reaches the same decision, but implies something entirely different about these models. The judge in that case, Judge Chhabria instead basically says that there's so much market harm here, or there would be so much market harm here, but the plaintiffs never pleaded it. So stepping back, this is Meta, so this is the LLaMA model. There were similar questions raised to what we just talked about in Bartz related to training model, but the case also included some questions about whether the model itself could infringe, and it asserted that the output was infringing. In November of 2023, the judge dismissed most of the lawsuit, leaving only the allegations that the model train itself as infringing. So in the more recent order, we get the same question as we had in Bartz. He said that the assertion that the model itself would be an infringing derivative work is nonsensical, because there's no way to read that. There's no way to view the model itself as some recasting or an adaptation of the underlying work. It doesn't have any of those materials inside of it.
35:57
He also said that the assertion of the output as infringing derivative works, that that requires that you actually point to some infringing output, some output that incorporates in some form a portion of the underlying books. So that leaves us with the model training itself, the same question that Bartz was dealing with. So first, he agrees with Bartz that they see this as transformative, it's clearly transformative, but he doesn't think that's the end of the analysis. He applies a lot more weight to market harm. He disagrees with Judge Alsup from the Bartz case. He disagrees with the analogy to school children, he mentions it explicitly and says, "I don't agree with that." He says that this is more similar to a tool that can then be used by a third party.
So that sounds a lot like the filter analogy that the Supreme Court case was thinking along the lines of, that this is applying a filter to some underlying work. So he ruled in favor of Meta, in favor of the model training, but he explicitly said that's only because the plaintiffs failed to show market harm. He implied that he thought they'd be able to if they crafted their lawsuit a little more clearly in that way. He indicates that the initial training would be infringing if market harm related to the output is shown. So this appears to take a really broad view of market harm, because this would be harm to an abstract market, not a market for a specific book. So that basically says that because this output could destroy a market, such that it would reduce demand for human-developed works, because it would make that market more competitive or it would block that market in some way, and that could itself lead to infringement.
Which is actually very similar to – one of the things we talked about on our discussion two years ago was a letter that the Authors Guild put forward that basically said that regardless of the rest of the analysis, we really think the courts should consider the market harm that comes from this, where you're forcing authors to effectively supply the training materials to undermine their own professions. And this judge seems to endorse that view to an extent, that because this output would destroy the market for the initial work, even if that output is not itself infringing work, that could lead to infringement. So in short, the Bartz decision implies that AI training is inherently a fair use in view of how transformative it is. So that's the first factor. That's what Justice Kagan was saying in the Supreme Court. The Kadrey decision implies that AI training is not necessarily fair, and that it wouldn't be if there was market harm that resulted from it. So he would say that the last fair use factor dominates, that this is a market harm question, and that mirrors Justice Sotomayor's opinion for the majority in Supreme Court. So.
Mark Gross
This concept of market harm –
Daniel Gross
Sorry?
Mark Gross
This concept of market harm being such a determinant, it seems to me at odds with what you do in other areas of corporate endeavor. If you invent something that will destroy an industry, destroy digital photography, which counteracts film, that's a market harm, but nobody would say that you're not allowed to present that as an invention and develop it.
Daniel Gross
Yeah. So I have a lot of trouble with this tension between market harm and market innovation.
39:58
If something is innovated out of existence, then that's different than something being, let's say, cannibalized out of existence. So yeah, I agree that there's a lot of tension here between progress and market harm. Which –
Mark Gross
– transformative and market harm.
Daniel Gross
Yeah, right. Well, that's part of the fair use analysis. So the reason that plays out differently is because, in the cases that you're talking about where some technology renders some earlier thing obsolete, the starting point isn't an infringement. The copyright code assumes that if you're talking about fair use, you're talking about there being an infringement somewhere. So fair use is an affirmative defense to that in order to say that that affirmative defense should be applied, then you can use market harm as one of those analysis. But we're already talking about fairness, which is, like you said at the beginning, inherently subjective.
Mark Gross
Yeah. So I know there's a lot more case law pieces that you've put together, and I think we need to move into more of the next step of, there's just so much here, of just going through the various question, again, and this is summarizing where things stand, which I think is where this slide is going.
Daniel Gross
Yeah. So I want to mention some of the other cases. There's more about in the materials we provided with this presentation, but the extra, I think, three cases I included there relate to how we evaluate the output if the output is really similar. And that becomes a major issue in the context of these image filters. We're talking mostly about text, but in the context of image, some of the models can be shown to – you can kind of force it to recreate the underlying images that it was trained from. And that's partially because of the different mechanics of how those models are trained. So it'll be interesting to see how the courts handle those, whether they handle them as from a similar perspective or as fundamentally different. Now, fair use. Is the use of data to train inherently fair use? So that that's a conflict between these two decisions we have so far. Although the Bartz court may think otherwise if the training materials, the training data exists in some real way in the model. And that's the question with respect to the diffusion-based image models.
So that'll be interesting. Now, pirated works and liability. The Bartz court addressed this head-on and then kind of backed off a little. The use of pirated works is, according to that court, the only thing that can create liability in this context. But it's unclear whether the fact that the underlying training works were pirated is going to necessarily lead to an infringement, or whether the use for training a model is so transformative such that it can avoid infringement anyway. What we do know is that if there are works that are pirated and then they are maintained as a library and accessible as text, that that will be a problem. If you're training a model, regardless of what you do, you do not want to be maintaining pirated works in any readable form. But really, the guidance from all of these cases is that you should be trying to acquire these books in some legal way. That the downloading and storing of these corpuses of pirated works may be a problem.
44:00
Now, subjectivity of fair use, this is a real issue that we're seeing. The courts are making different decisions because of this subjectivity question. So how transformative is fundamentally transformative? And at what point does that overwhelm other factors? What is meant by market harm? Is market harm evaluated broadly or narrowly? Is it in terms of the specific book that we're alleging infringement for? Or are we saying that if the output replaces some specific underlying book, that's market harm? Or are we saying that if the overall market for printed material is suppressed by the overabundance of available AI generated text, then that is itself market harm? Now, publisher concerns. The real question we all want answers to is, what can publishers do? And two years ago, we basically said, we don't know, other than protecting your own works, making them not freely available. Turn them into pirated works to the extent that you can, such that there is still some way to a assert infringement. And that could either be by not making digital versions, keeping DMCA-compatible access controls that require bypassing.
In the academic world, a lot of things can be behind paywalls. And to get past those paywalls, you need to contract and basically say, "I'm not going to use it." Now, if a company building an AI model bypasses some access control, they log in, download tens of thousands, hundreds of thousands of academic papers, and use that to train a model, that may not end up being a copyright issue. Or it may be. But it would at least implicate contract issues. So there would be a cause of action there. So you can create other causes of action in that sense. One of the cases we didn't talk about was the Getty case, where they showed that image outputs showed their logo on it because so much of the underlying training materials had the Getty logo on it. That created or implicated a trademark issue as well. So you can create other avenues for liability, other than the direct copyright issues. But there are other ways to protect works, and I know we talked about this last time. You can maintain things in these siloed databases such that you could only get to things through a paywall.
For academic works, you could also contract with these companies developing AI models to license them directly. If you're licensing your works, and we didn't really talk about the market for the underlying works from a respective of licensing corpuses, but if you could show that the same corpus of work that's being pirated is also available for licensing for that same purpose – so if I'm selling my collection of copyrighted data as a training set, for example, then it would be much harder to show that taking that underlying printed material and using it for training does not implicate that particular market. It may not implicate the market for written works, but if there is an existing market for training materials, then you could implicate that market separately, if that makes sense.
So generally, a focus could be on managing your ability to contract for access to your materials. But it'll be very interesting to see what direction the remaining courts go, and if we get some final answer on the pirate materials question, if we get some final answer on the market harm question.
48:03
And we still have a large wave of cases from the New York courts that are going to come down. And if they end up taking a different approach than the California courts did, then it'll be interesting to see how the Supreme Court ultimately handles this or how the higher courts take it.
Mark Gross
So just on your last point, these are California cases, there'll be New York cases coming down. If they're in conflict, does that automatically become a Supreme Court case, or somebody has to bring it as a Supreme Court? How does that –
Daniel Gross
Well, somebody would have to bring it. It would be a basis for appeal. But based on how much money is riding on these individual cases and how many third-party organizations are interested in the result of this, there will likely be some funding to take this to Supreme Court, unless, like the Copyright Office may want, unless Congress acts to make a little more clear what the ruling here should be.
Mark Gross
So with Anthropic and that case and a billion and a half dollars in fines, we're talking pretty soon about real money. And this is not the only one out there. So I can see those driving a need for clarification, even at a Supreme Court level.
Daniel Gross
Yeah, and that's really the first precedent we get as a basis for a settlement in these cases. So that will just more clearly show to all of the other parties involved in this how much money is involved, because each of these models is trained on hundreds of thousands of works, millions of works or more. So there will be the funding to appeal these cases if there's hope for appeal, if there's hope for a different decision. And that will certainly be the case if New York goes in a different direction than California did.
Mark Gross
But ultimately, in terms to what publishers can do, if you're publishing a book, you're putting out in public publishers who have databases and behind the table. It's the same as it's always been, protect your data and make it difficult to get to. And then if somebody's stealing it, they're very open to liability when they get caught.
Daniel Gross
Right. Well, the problem is, as a small publisher – so if you're talking about a half million books, then you're talking about real money, because you're talking about, in this case, the settlement was $3,000 per work. But the maximum you can get on statutory damages is $75,000. So if you're only talking about a few works, if you're a small publisher, you may not have the ability or the will to pursue that in a meaningful way. Which is why a lot of these cases turned into class actions, or there were petitions to turn them into class actions. And that's why a lot of these cases also got combined, because similar issues, and you want to consolidate.
But yeah, make your works more difficult to get to and keep an eye on – know if your works are actually included in these corpuses. It turns out, I didn't know this until a few years ago, but apparently, the contents and the existence of these corpuses of pirated books are pretty well known. The publishers were able to identify which corpuses were used in training of the models. Even though even OpenAI was very secretive about it, the publishers were able to figure out which of those corpuses were used during training by virtue of what the models knew about the underlying works. So even if the output isn't an infringing work, you may be able to use it to probe what works were used for training purposes.
52:00
Mark Gross
Right. Marianne, do we have questions that we want to put in? We've gone a little overtime here.
Marianne Calilhanna
Thank you for this conversation. This is great. I could spend another hour listening to some of these details, but we don't have that. So Daniel said that the CLE code is there. If you look in the GoToWebinar interface, you should see a little icon in the upper right corner of the screen with the document. You can click on that and you can download the documents you'll be required to complete and put that code in to get your credits.* Leigh Anne is going to push instructions on the chat in terms of how to complete the documentations and where they should go.
*[Please note this referred to attending the live webinar.]
If you do have any questions regarding any of this, you can see this info@dclab.com email address. Shoot us an email and we can help you out. We have some questions here. We're not going to get to all of them, so we'll follow up individually with folks who've submitted a question. But Daniel, one interesting question regarding the Bartz case: when you talked about deleting the data after training, did the courts require any model data derived from the infringing storage to be deleted?
Daniel Gross
No. So first of all, it's unclear how the outcome of analysis would've been if the only use of the pirated works were for training the model. But the court said that the use for training purposes was so transformative that that use did not implicate a copyright infringement at all. The underlying library itself was the issue. So once those works are tokenized, the implication of the case was that the tokenization transforms the work such that they are no longer the original work, and that the use there at that point will be fair.
Marianne Calilhanna
All right. I did love your comment, Daniel, that publishers should create other avenues of liability. I think that's very sage, non-legal advice. Okay, another question. Might there be a market for service for publishers that monitors LLMs for infringement, or alternatively, that certifies LLM as fair use conformant? Interesting.
Mark Gross
I love your question. Always a business opportunity.
Daniel Gross
Yeah, that's actually a great question and interesting idea. But in terms of monitoring the LLMs for infringement, even if you could identify that a work was ingested during the training process, would that be an infringement? Now, if you're talking about monitoring the output, a lot of these models now have controls on the output that prevent the manufacturing of the original works. But this is coming up in, I believe, the New York Times case, which I included an image from the original complaint in the notes where – I think it was the New York Times. They showed that the materials so closely mapped that the output itself should be infringing.
In that case, OpenAI is asserting that the New York Times, whoever they hired, the engineers that they hired to put that together, were manipulating the prompts to force an infringement, such that they were modifying their prompts over and over again until the output got closer and closer to the original article.
56:03
So if you have a company like that that monitors LLMs for infringement, how do you distinguish between that and forcing infringement through prompt engineering or something like that? So it's an interesting idea, but the fact that you can force a model to generate some output like that is not necessarily the same as the model providing that output in the abstract, if that makes sense.
Marianne Calilhanna
Right. I'm thinking about back in the old days when I started in publishing, reference books and cartography. So encyclopedias, we would have hidden words that we would put and then cartographers put hidden streets on maps.
Daniel Gross
So the case that we talked about during this presentation, the Sega Genesis case, when Sega realized that that interim copying wouldn't necessarily implicate copyright infringement, what they did was they modified the code to include the Sega logo. And part of the things that you would then need to do to force your game to boot up in the Sega Genesis system was to display the Sega logo design as a startup screen. And they made that part of the compliance issue, and they then asserted that anybody who used that would infringe their trademarks. And the court basically said that you're manipulating the issue. And if there is a trademark issue in that context, that leads to consumers being confused as to the source of a product, which is the whole point of trademark, then that is Sega Genesis effectively infringing their own trademark by creating consumer confusion. So you could force that.
The dictionary issue was – I know the companies used to seed words into there to show that it was copied. So first of all, dictionary definitions, you copy them line by line, that would be copyright issue. But for a phone book, phone book companies used to also include fake names and numbers to show who was copying the database. But the courts in those cases found that you can't copyright information itself. So as long as it's reformatted at the output, copying somebody's database, unless you're infringing some other rule, would not implicate copyright infringement, because that is not copyright material to begin with.
Marianne Calilhanna
Well, we still have questions coming in, but we do have to end here. I want to thank everyone for taking time to join us for this conversation. Thank you so much, Daniel, for joining Mark and myself.
Daniel Gross
Sure, thank you.
Marianne Calilhanna
And very quickly, I just want to make sure everyone knows that the DC Learning series comprises webinars like this, a monthly newsletter, and our blog. You can access many other webinars related to content structure, XML standards, AI, LLMs, and more from the On-Demand Webinars section of our website at dataconversionlaboratory.com. Stay tuned for future webinars. Thank you so much, and have a great afternoon.
Mark Gross
Thank you, Marianne, and Leigh Anne behind the scenes. Thank you, Daniel. This is great.
Daniel Gross
All right, thank you.
Marianne Calilhanna
This concludes today's program.
