There’s a lot of handwringing at the moment around generative AI and copyright. Well, a lot more than handwringing. Some suggest that 2024 will be the year the AI bubble is burst by copyright lawsuits and by moves to force the industry to protect the rights (and livelihood) of creators. The issues are real and complex, but most are in fact not as new or unsolvable as they may at first appear.
Let’s look at the three key legal questions that have surfaced in relation to the inputs and outputs of generative AI1:
Does training AI models on copyrighted data break the law?
Should AI-produced content itself be copyrightable?
Does AI-generated output infringe on the copyright of content it resembles?
These questions are much easier to resolve than the underlying socio-economic challenge of how AI impacts the creative industries and what new financial mechanisms should be invented to deliver value to human creation. But I think that separating these two categories of problem will be helpful in shifting attention to the bigger economic question.
Ultimately I predict that in 2024-25 the legal issues will go away, leaving the social and economic issue of how AI impacts the creative industries, which is essentially the same as the broader question of what jobs AI will replace.
Let’s get into it.
Fair Use
The question of training LLMs on copyrighted input is mostly being litigated via lawsuits that hinge on the definition of fair use under US law. This is not new. Copyright protects original, human-created work from being copied (the clue is in the name). The fair use doctrine creates exceptions (to promote free expression and the dissemination of knowledge2) by permitting use of copyrighted content in certain circumstances. Exactly what fair use means has been challenged (and litigated) by every major shift in technology—from the Betamax case (that allowed home video recording) to Google Books (which greenlit the company’s efforts to digitise the world’s books) to various search engine cases (legitimising the caching of websites, the display of thumbnails, and the excerpting of news articles).
The courts have been notoriously unpredictable (compare the Ed Sheeran and Blurred Lines verdicts, as Nilay Patel likes to point out) when it comes to adjudicating fair use. This is because the test for it is highly subjective and includes questions like: What is the purpose of use (ie commercial vs non-profit)? How much of the content is used? And what is the effect on the potential market for the original work? (The last one is important—we’ll come back to that).
But it’s clear that copyright has moved well beyond restricting the technical creation of copies in our digital world. So the remaining question facing judges is whether the way an LLM is trained on content constitutes something more than technical copying for fair use purposes. One argument is that the model is simply reading, just as a human would, in order to learn. We don’t sue people for their content output in life on the basis that they’re ripping off everything they’ve ever read and learned. The potential chink in that argument is that the way an LLM learns is of course super-human: it creates a statistical representation of everything fed into it and forgets nothing (even if it doesn’t retain exact copies of all the content ingested). But it’s autocomplete engine is so good that it can statistically predict the most likely next word or the best fit pixel, and that often just happens to be exactly what the original source looked like.
So while there is some debate on whether training inputs to AI breach copyright, that question too will ultimately depend on what the output looks like. And that is in fact a lot easier for the AI foundation model vendors to fix. The more prominent live court cases provide useful examples:
Silverman et al vs OpenAI and Meta: the authors say the LLMs were trained on illegally obtained copies of their books, and demonstrate this by showing how they can be queried and return plausible summaries which—in their argument—should be considered derivative works requiring a licence (more on this below). Her broad claims met with fairly swift pushback from judges.
Getty vs Stability AI: when asked to generate new images, Stability AI sometimes spits out a photo that includes a Getty watermark. Getty’s claim focuses specifically on the fact that it has not licenced its images to Stability (although it has done so for other AI art generation tools).3
NYT vs OpenAI: the New York Times showed that ChatGPT is capable of replicating whole paragraphs from its articles when given a prompt of the first few lines. While the NYT’s approach to coaxing the infringement out of ChatGPT is being actively challenged (and can no longer be replicated as OpenAI seems to added filters that prevent it), the strength in the NYT’s argument (going back to the fair use test (3) above) is the company’s claim that ChatGPT’s output harms its market potential by (a) allowing readers to bypass its paywall, and (b) damaging its reputation by inserting hallucinations and errors into its content.
Each of these cases zeroes in on a different element of the fair use test, which will make them interesting to watch (even if they don’t set a hard precedent that makes fair use rulings less unpredictable). But ultimately, the models can take the wind out of future litigants’ sails through fixes to their output. For this reason, I expect the training question to be resolved fairly quickly. Vendors will make it harder for their models to spit out exact replicas of training content, while the overall training approach will probably (?) end up being more like indexing the web under fair use (in the US4).5
Note that if copyright does not stop AI training, it doesn’t mean that we can’t or shouldn’t compensate creators for the use of their content. With every change in content distribution technology copyright is used as the first line of defence (see Napster), but ultimately there remains a moral and social imperative to value content and its authors, which can create a space for a financial arrangement (see Spotify and the music industry, Google and newspapers, and the current negotiations between OpenAI and media).
SIDEBAR: CoPilots' copyright conundrum
Perhaps the earliest lawsuit-triggering question of the modern AI era relates to the use of GitHub's incredibly powerful CoPilot feature to assist software developers. CoPilot supercharges developer productivity by suggesting code snippets, helping refine existing code, identifying bugs and vulnerabilities, and generally acting as a knowledgeable pair programmer.
The problem for companies whose engineers are using CoPilot (and if you think yours aren't you're probably wrong) is that it has been trained on a wide range of code repositories with every conceivable type of licence (including open-source). If developers use output that is identical to existing code, it could expose the company to licence or copyright infringement claims. This was crystallised in a lawsuit against GitHub, its owner Microsoft and OpenAI, which is still pending. But - like Adobe - Microsoft is not waiting for the outcome of that case, and has instead provided its customers with a broad indemnity in case they are sued for infringing copyright with output from GitHub.
A thornier question, which I don't believe is addressed by the indemnity, is whether mixing in CoPilot-generated code compromises the developer's own ability to protect its software IP via copyright. The CEO of a very successful SaaS business recently shared how this question paralysed their plans to introduce CoPilot, and how they finally unblocked it. They reviewed all their software IP and triaged it into red-amber-green buckets, with red being the most valuable, unique special sauce of the company; green being generic code; and amber somewhere in between. They then put in place a process to allow the use of CoPilot for all green code, amber code with a human checkpoint to confirm, and not at all for red code. This ensures not only that the critical IP of the company can confirmably be said to be 100% human-generated (and hence copyrightable in all jurisdictions), but it also sidesteps the risk of any claims of open-source code contamination. Finally, a side benefit - it forced the company to really think about what the most valuable bits of its code are, giving it a better understanding of its competitive position.
This highlights an interesting meta-effect of introducing AI into our knowledge work. It will force us to really REALLY think about where true value lies, what makes up our most competitive intellectual property. For too long we’ve simply tried to throw the IP steel blanket over everything, via broad IP ownership clauses and non-competes in employment agreements; a proliferation of NDAs no one actually reads; aggressive patenting and trademark strategies. I can imagine that over time, we’ll think about this very differently - with a small number of people considered true IP creators, and a small proportion of the company’s special sauce being truly critical. Perhaps it will lead to a new wave of AI-sourcing that is analogous to the outsourcing trend of the 1990s… something to explore further.
Copyright protections for AI-produced content
At first blush this question seems to have a simple answer, most vividly articulated in the Monkey selfie ruling 10 years ago and confirmed by the US Copyright Office in 20236—that human authorship and creativity are required for content to be copyrightable. Content produced solely by AI—even if deemed ‘original’ in some weird machine-artsy way—cannot be copyrighted. Where AI- and human-produced content are combined, only the human elements can be protected.7 Both the EU and Germany also require that a work originate from a human to qualify for copyright. The outlier jurisdictions may be the UK and countries that inherited its outdated copyright laws (Ireland, New Zealand, South Africa), where computer-generated works can be copyrighted in the name of the person who enabled the system to do so. This has yet to be tested in the new AI era.8
A highly contrarian view, however, comes from perhaps the most authoritative source on this topic, Creative Commons founder Lawrence Lessig, who believes that AI-created work ought to be copyrightable “if and only if the AI system itself registers the work and includes in the registration provenance so that I know exactly who created it and when.”9 His argument is ultimately economic. He cites a particularly important parable from the earliest days of copyright protection:
You’ll remember this, I hope, from copyright law: the birth of America, foreign authors got no American copyright, and all the Americans thought, “This is great. We’re protecting the Americans against the foreigners.” Of course, what that meant is that all the English books were much cheaper than the American books. The American authors were at a disadvantage because the English authors weren’t getting copyright. The American authors began to push, “Give everybody copyright so that there’s no un-level playing field.”10
He argues that if we don’t allow AI-generated content to be protected in the same way as human-generated content, the price of that content will drop to zero and will swamp the market, putting human creators out of business. Allowing AI copyrights would effectively create a level playing field for competition and survival.
However, making such a change in the US or EU would require more political capital than anyone can feasibly muster these days, so I would not hold my breath. For this kind of thinking, surprisingly, China is more willing to experiment: in a recent court ruling giving copyright protection to an image created with Stable Diffusion, the judge echoed the concern that creative industries could be negatively impacted if if AI-generated content is not recognised as artwork.
Infringing output of generative AI
The final copyright conundrum is whether output from generative AI, which is by definition an assembly of pixels or words that the model has predicted would satisfy a user prompt, can ever be said to infringe on existing copyright. Let’s assume for the moment that the LLMs will get fixed so they are less likely to spit out exact replicas of training data (see above), then we still have the genuine social and economic problem of AIs being able to produce uncannily good works “in the style of” existing artists.
Styles are explicitly not protectable under copyright. As a human you can become uncannily good at mimicking someone’s work, but you can’t easily be sued for infringing it unless you replicate actual elements of it (the unpredictability of fair use rulings notwithstanding). This is perhaps the knottiest, almost entirely new, issue with generative AI, and the one that creates the biggest real economic problem to solve. There is a long history of legal wrangling around the concept of ‘derivative’ or ‘transformative’ work—most prominently during the emergence of DJ-led music remixing in the 1980s—which ultimately established that most such creations should get permission from the original copyright holders. While the line on whether a new work is sufficiently ‘transformed’ to be covered under fair use remains blurry, the recent Supreme Court ruling on Goldsmith v Warhol appears to have narrowed the scope of artistic reuse of existing copyrighted material.
But style per se is not derivative nor transformative, so copyright law is not likely to help creators in this fight. They now face direct competition from anyone with a keyboard and a few minutes of prompting time, threatening their livelihood. A new approach is needed. The one company that has taken the most thoughtful, purist approach to the risks of copyright in AI is Adobe. Not only has the company decided to train its AI entirely on content it has legal licence to (just in case the copyright debate goes against the AIs),11 it has taken the lead on innovating new protections for creators to protect their unique style. In December, Adobe proposed to Congress to establish a new Federal Anti-Impersonation Right (the FAIR Act), which would allow creators to sue anyone who uses an AI tool to intentionally impersonate their work for commercial gain. This is IMHO exactly the kinds of initiatives needed to plug the gaps—focusing narrowly on areas that can’t be addressed with existing rules or case law.
The FTC, the EU and the UK
Copyright and AI is seeping into all regulatory corners. The FTC is looking at copyright through the lens of consumer fairness, which it oversees through the FTC Act. It is focusing on whether consumers are deceived when AI authorship is undisclosed, and on how liability will work when the AI black box obfuscates the source of errors—who is liable? The user? The model? Can model makers dodge liability simply by disclosing how it was trained?
Meanwhile, the draft EU AI Act has triggered plenty of anxiety and debate. An early version included a seemingly unworkable obligation to identify all the copyrighted sources of data used in training. The final compromise text adopted in December softened this by dropping the requirement to separately itemise copyrighted inputs and instead “draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.” This still sounds vague, but a last-minute recital clarified that the summary can be a narrative description of the categories of datasets accessed for training data, which seems feasible (and is clearly an overall good idea given the providers’ well-documented reluctance to be transparent about their training data sources.)
The Act also restates that model makers must respect EU copyright law and—more important—specifically shoehorns AI model training in under the text and data mining (TDM) exception in Article 4 of the CDSM Directive, which allows the use of copyrighted data unless its owner has explicitly opted out of such use (in which case the operator has to seek permission).12
The UK situation is far messier. It did not transpose the European Copyright Directive in its own laws, but has its own TDM exception from 2014, leaving it with perhaps the most restrictive framework, allowing copies only for computational analysis in a non-commercial research context. In 2022, the Intellectual Property Office (IPO) published its AI and Copyright consultation, which proposed allowing the extraction of facts and data from lawfully accessed copyrighted works (seemingly endorsing the training of AIs on such content). But that is decades ago in AI-time…
More recently, the government has been working with the IPO on a code of practice on copyright and AI, originally due in late 2023, but still outstanding. So far it has dropped just two tantalising hints as to where it is going: (a) confirming that “reproduction of copyright-protected works by AI will infringe copyright, unless permitted under licence or an exception,” and (b) that that it has abandoned plans to enshrine a version of Europe’s TDM exception in UK copyright law.13 This feels a bit like Brexit-make-work. In order to balance innovation with protections (as the government repeats ad nauseam it intends), it will have to articulate a new exception that—I suspect—will end up looking a lot like the TDM.
Meanwhile, the Information Commissioner’s Office (ICO), has helpfully outlined in its Generative AI Call for Evidence how model operators should think about the UK GDPR when considering their legal basis for training the AI. In short, the only viable legal basis for collecting personal data in training is legitimate interest, and so they will have to be very diligent in performing an appropriate balancing test (Is there a valid interest? Is the data necessary? Do individuals’ rights override the interest?). Whilst this guidance does not directly help in the copyright debate, it at least provides a roadmap for thinking about privacy on the input side.
So, where does that leave us in relation to the key questions we started with?
Does training AI models on copyrighted data break the law?
Should AI-produced content itself be copyrightable?
Does AI-generated output infringe on the copyright of content it resembles?
I would say: 1 no (subject to 3); 2 no (Lessig notwithstanding); 3 a tale of two halves: spitting out replicas is a gnarly technical problem to be fixed; protecting creators whose style can be reproduced on the cheap is the remaining big challenge. This job replacement debate is hardly new—it accompanies every major technology shift. One optimistic perspective is that a flood of AI-generated content will dramatically raise the value of verifiable human-penned work, which is appealing if you’re in the top 10% of human creators, perhaps terrifying if you’re in the bottom half of your industry.
The core infrastructure we need to give this problem a chance to shake out in a positive way is a standard for content authentication and provenance, like the Content Authenticity Initiative. While its primary purpose is to authenticate content in order to protect trust in the media and combat deepfakes, a useful secondary benefit will be to enable identification of authors so they can get credit, potentially enforce their copyright, and—ultimately most important of all—have an ownership position from which to negotiate.14
For simplicity, I’ll use the terms AI, generative AI and LLMs interchangeably in this article, though I know they’re not the same thing. In all cases I mean generative AI that outputs text, images, videos, etc.
Official examples include criticism, comment, news reporting, teaching, scholarship, and research.
The irony is that the reason the model spits out images with the Getty imprimatur is because the watermark is so ubiquitous on high quality images that the AI effectively assumes including it is likely to satisfy the user request. Getty should be flattered.
Note that the position in other countries will vary. The European equivalent of fair use is the text and data mining (TDM) exception, which allows broad use of copyrighted material, but subject to copyright holders being able to opt out (though how this can be exercised in practice remains unclear).
Fixing the outputs may be harder than it seems. IEEE Spectrum has done amazing investigative work on how easy it is to coax copyrighted material out of AI image generators, and how hit-and-miss it can be to try to implement guardrails on user prompts to prevent it. That said, I’m not sure the model makers need to fix every possible infringing output extracted under duress—ie concerted efforts to circumvent guardrails—as the long-tail incidents can probably be covered through the terms of use. You can use any number of internet tools to create infringing content, but you (the user) will then be breaking the law.
There may yet be quasi-legislative action. The White House’s recent Executive Order on AI included a provision requiring the Copyright Office to make a recommendation to address the both the scope of protections for work that includes AI-generated content as well as the treatment of copyrighted material in AI training.
From the US Copyright Offices paper on Artificial Intelligence and Copyright (30 Aug 2023): “A second registration application, submitted in 2022, involved a work containing both human authorship and generative AI material. The work was a graphic novel with text written by the human applicant and illustrations created through the use of Midjourney, a generative AI system. After soliciting information from the applicant about the process of the work’s creation, the Office determined that copyright protected both the human authored text and human selection and arrangement of the text and images, but not the AI-generated images themselves. The Office explained that where a human author lacks sufficient creative control over the AI-generated components of a work, the human is not the “author” of those components for copyright purposes.“
Credit to—and for more on this see—the European Association of Communications Agencies (EACA) report on AI and copyright: Unveiling the legal challenges.
Lawrence Lessig on copyright, generative AI and the right to train, Walled Culture, 2 Nov 2023.
Decoder: Harvard professor Lawrence Lessig on why AI and social media are causing a free speech crisis for the internet, The Verge, 24 Oct 2023.
Adobe has in fact turned this approach into a major competitive advantage. By knowing definitely that its training content is covered by licence it is able to give the customers using Adobe products a comprehensive IP indemnity to protect them from potential copyright lawsuits. Brilliant. It’s worth listening to General Counsel Dana Rao on the story of how they got there.
In the recitals the Act also clarifies that where the training takes place is independent of the applicability of EU copyright law. If the model is deployed in the EU, it must comply irrespective of where the training takes place. An excellent summary of the copyright provisions can be found here: A first look at the copyright relevant parts in the final AI Act compromise, Kuwer Copyright Blog, 11 Dec 2023.
UK Government commits to developing AI/Copyright Code of Practice, Mishcon de Reya, 12 Jan 2024.
A useful summary of how this works is provided by jamkhayyam: Truth or Deception? The Double-Edged Sword of Digital Authenticity, 13 Dec 2023.
What a great piece! Love the the different perspectives and nuances. Really appreciate the time and energy you put into this.