Olivia Dhordain is a specialist in intellectual property law and now heads OUTBOXING IP Intellectual Property Counsel. In this article, she and Ciarán Bryce discuss the current state of play regarding IP and software created using AI platforms.
Introduction
The ability of Generative AI (GenAI) platforms to create text, audio and video, is being harnessed by all business lines in organizations today. More pointedly, GenAI is being heavily used for software development. A recent survey by Github concluded that 92% of US developers are already using GenAI to help them create code. A report from Gartner suggests that by 2027, 15% of new applications will be generated by AI without a human in the loop.
As a manager or a software developer, you might be asking yourself these practical questions:
- Who owns software created by a GenAI platform? Does the platform user have the right to claim copyright on the software?
- What rights does the GenAI platform owner retain on code created by the platform?
- A GenAI platform is able to create content by having been trained from very large volumes of existing Internet content. In the case of software creation, do the creators of that training content have any rights on the software created by the GenAI platform?
Our goal in this article is to try to answer these questions, and to leave software development teams with practical takeaways.
As we know, software benefits from copyright protection. The question around software ownership and licensing is not new and has a long history of sometimes bitter dispute between proponents of proprietary software and proponents of the free and open-source software licenses. The differences in these licensing models are quite important in the discussion on software ownership in AI-Assisted Programming, so we take the time to revisit licenses in Part I. But the arrival of GenAI adds a new layer to the intricacies to this debate with uncertainties as to the legality of software resulting from AI trained on prior copyrightable code, and by the fact that ownership on AI generated output is also in flux. We discuss these issues in Part II.
Table of Contents
I. Software Ownership, Copyright and Copyleft
II.1 Unresolved IP fundamentals
I. Software Ownership, Copyright and Copyleft
A software developer can claim copyright on his or her software [1]. The copyright covers the software in binary and source forms, and enables software creators to claim exclusive rights to copy, distribute, modify or exploit a work in any way. A developer cannot claim copyright on work that he has been commissioned to do by his employer. For this reason, we tend to speak of copyright holder rather than author.
One effect of copyright is to prevent a third party from distributing the software without the explicit consent of the copyright holder, including derived works where the new software is demonstrably the same or partially copied from the original software. A second effect is that the copyright holder can assign or license the use of the software to others.
Concretely, software developers can assert copyright, and their conditions of usage, on software by distributing their software along with a license. From a legal standpoint, the license is a unilateral contract in which the copyright holder defines the terms and conditions for use of the software. A license will indicate if the user of the software has the right to transfer a copy of the software to a third party. On a software level, a license is a text file that accompanies the program files and contains the conditions for users of the software to read. In the case of the Windows 11 operating system for instance, which is only distributed to end-users in binary form, the Microsoft license states that the user is not permitted to "publish, copy, rent, lease, or lend the software; transfer the software ... or work around any technical restrictions or limitations in the software". The license states that the user forgoes his permission to use the software if these terms are violated.
The Microsoft license is an example of a proprietary software license. Its aim is to ensure that the license holder can claim revenue for each copy of the software that is distributed. We also say that the license is closed source, since the source code of the software cannot be obtained. The world of computer science has always had a conflictual relationship with copyright as applied to code and software in general. Various licensing models were designed to counter the purely proprietary model.
Warning: short history lesson! The movement began with the creation of the Unix operating system around 1970 at Bell Labs. Before 1970, software was written specifically for some computer machine architecture, so software was not seen as a commodity independent of hardware. The goal of the Unix project was to develop a portable OS, that is, an OS that could run on any machine architecture. Unix helped to cement the schism between the hardware and software industries. Software developers could thereafter create software without knowing, or caring about, the machine architecture on which the software would run.
In the early 1970s, building on this "liberation" of soft from hard, Bell Labs gave a copy of the source code to the University of Berkeley for use by its researchers and students. Over the next few years, researchers developed new applications for Unix. Berkeley started distributing these applications with Unix, source code included, to other universities. This is how the idea began of sharing source code to improve understanding and quality of software [2].
But by the end of the 1970s and early 1980s, a growth in the development of proprietary software became the norm, even around the Unix operating system. Licenses rarely gave developers access to source code, which was contrary to the spirit generated at Berkeley University. To counter this, Richard Stallman founded the Gnu project in 1984. Gnu is the recursive acronym Gnu is Not Unix and the project's aim was to develop a Unix operating system where the source code is freely accessible to anyone.
Stallman founded the Free Software Foundation in 1985 – a creative midway approach combining free software and a mechanism allowing for some degree of commercial use. The foundation defines free software as software that anyone is free to copy, run, study, modify, improve, and distribute. This means that Free Software might not be free of charge. It is understood that a distributor is allowed to charge for the cost of distribution. Thus, a paying client pays for the CD and postal delivery rather than the actual software. This fine distinction has allowed for the development of business models around Free Software based on packaging, customization, and distribution.
The Free Software Foundation designed the Gnu General Public License (GPL) to help enforce its ideals. This license stipulates that if software with the GPL is modified and then distributed, the software must be redistributed by the user with the GPL. Thus, derived works are subject to the same licensing terms as the original software (i.e., source code must be freely accessible and cannot be commercialized as proprietary software by a software developer). This movement came to be known as copyleft [3]. The free software movement has been incredibly successful. Examples of well known software distributed under the GPL are the Linux kernel, the Java OpenJDK libraries, the Telegram instant messenger, and the WordPress Web framework.
Many companies felt the copyleft set-up was too constraining from an economic point of view, and sought more commercially-friendly licensing schemes. This led to the foundation in 1997 of the Open Source Initiative. Under its Open Source Definition, software distributed with an open-source license must 1) permit the licensee to distribute or sell the software, 2) the distribution must include all source code, and 3) the licensee is allowed to create derived works. The crucial difference to free software licenses is that nothing is specified as to the conditions under which the derived work can be licensed [4]. Well-known examples of open-source software include OpenOffice (the free alternative to Microsoft Office) which has an Apache license and the Firefox Web browser which has a Mozilla Public license.
II. Enter Generative AI
The arrival of GenAI modifies the creative experience with the presence of another actor. For instance, if ChatGPT is the GenAI platform used, then a platform provider – OpenAI – is actively contributing to the creation of the work. Several GenAI platforms tailored to software creation have appeared, e.g., AWS’s CodeWhisperer, Github’s Copilot, Mistral’s Codestral and StarCoder.
From a technical standpoint, a GenAI platform is software system that implements a mathematical model. The model is composed of elements like weights and parameters. We do not need to go into technical details, suffice to say that the model uses these elements to create content in response to the prompt entered by the user. The weights and parameters are "formed" during the training process, from the vast volumes of data used in training. Weights and parameters are unique to each GenAI platform and explain why GenAI platforms are perceived as performing differently by users.
Users can retrain GenAI models with new content with the aim of modifying weights and parameters. This process is known as fine-tuning. The purpose of fine-tuning is to have a model that produces better content in a specific context. For instance, software developers would rarely use ChatGPT to create code, as this is a "general public" GenAI platform. Developers use GenAI platforms like Code Whisperer, Codestral and Copilot whose models have been fine-tuned from other models using large volumes of program code samples. The legality of fine-tuning depends on the license of the GenAI model, defined by the platform owner, which brings us right back to the discussion of Part 1.
Many questions relating to IP as applied to AI-generated works have yet to be answered. We look at these questions in Section II.1. Answers to these questions have the potential to seriously impact IP on AI-Gen software, the contours of which are already blurred, as we see in Section II.2. Finally, in Section II.3, we briefly look at three popular AI-Gen software platforms and see how they address IP issues differently.
II.1 Unresolved IP fundamentals
Let’s consider two practical, yet often-asked questions.
Question 1: The training process is based on large volumes of existing content. Can AI-generated contents be safely used or are they inherently infringing copyright of the existing works?
Content generated by AI is not a spontaneous creation born of auto-genesis but rather the aggregated result of preexisting content on which the AI has been trained which is then rearranged in answer to a prompt. If this preexisting content is "publicly available" it may not all have fallen into the public domain [5], and many consider that using copyrighted works to train an AI without remunerating their authors amounts to copyright infringement. Copyright owners have multiplied class actions against Open AI, asking the Courts to recognize their copyright infringement claim. Meanwhile, Open AI argues that use of works to train their platform is fair use [6]. These cases are still pending.
The outcome of these lawsuits will be determining in forging the AI landscape of the future and the balance between content creators on the one hand and content dependent AI platforms on the other.
Question 2: What of the outputs generated by AI platforms? Are they copyrightable? And if so, who owns that copyright?
First and foremost, the AI model itself cannot be attributed copyright over the output, an AI not being vested with a legal personality and therefore not being able to hold any rights per se. By default, the only eligible party to claim ownership on the output is the user.
Again, court cases have multiplied across the US, China and more recently Europe: creators of AI-generated works have claimed authorship on such works and some decisions have been rendered …. Although the cases do not involve software companies, these cases allow us to observe the emergence of a few tendencies:
- The Court of Prague recently ruled that generic themes generated by an AI on a simple prompt cannot be copyrighted.
- The US copyright Office has rejected protection of images, videos, texts or audio content generated by AI … however, the decisions indirectly suggest that if the user can prove a "de minimis human intervention" and that he/she has had "some control" over the output, then there may be space for the user to claim copyright protection.
- Following this logic, the copyright office copyright has readily recognized the copyright-ability of works composed of a selection and arrangement of different AI-generated contents and their ownership by the individual having operated this selection and arrangement [7].
- In China, a court recently acknowledged the copyright protection on an AI-generated image. Although this acknowledgement may seem at odds with the other cases, the Chinese Court accepted that the image was the result of the user's "sweat of the brow", many prompts having been input to create the image. This position is in keeping with the criteria of the US copyright office.
These trends suggest that the sophistication and originality of the prompts may become the criteria to determine the necessary "de minimus contribution", and proof of those interventions will be critical when applying for copyright protection.
II.2 Gen-AI software – not for the faint of heart
Violation of prior rights and the impact of copyleft … AI generated code is fast evolving in unchartered territory.
Aspect 1: On copyright and copyleft of training data ...
Software produced by GenAI builds on preexisting code. This requires that careful attention be given by the developer to respecting the various license conditions and restrictions on code that he/she generates with the intention of distributing. When an AI generates code, it does not provide the sources such that an end user may well find him herself with an infringing code in breach of license terms or prior rights. On this very question, a software specific case was brought against GitHub and OpenAI’s GitHub Copilot AI-based coding product which is accused of using free and open-source software in its training data. This seeks to clarify if free and open-source licenses prohibit the code being used for training AI platforms, or if training is an instance of fair use.
Aspect 2: On claiming copyright of AI-generated code ...
As if that were not enough, any code generated by a user is subject to the terms and conditions of the AI platform itself. These define what you as a user are allowed to do with the platform and with content you create with it. Some platforms assign all rights to the user but promptly impose a royalty free worldwide sublicensable license on his/her content. In the case of ChatGPT at the present moment (June 2024), OpenAI will not claim copyright on works created by its users. However, the terms and conditions are set by the service provider, and they can change. OpenAI has updated its terms and conditions ten times since 2021.
Aspect 3: On fine-tuning the AI model ...
In some cases, the model owner allows users to modify the model – in the aim of fine-tuning. The “openness” of the GenAI model is the subject of intense debate. Just as the debate around openness of software (meaning access to source code) has raged since the 1970s, there is much debate today as to whether the internals, or models, of GenAI should be open. The model refers to AI details such as “weights” and “parameters”, as well the software that implements the model. One of the arguments for openness is that given the potential for GenAI to hallucinate or generate harmful content, their internals should be open to scrutiny by researchers in an effort to guardrail their behaviors. OpenAI's ChatGPT for instance is closed-source, so we do not know exactly how it works inside. On the other hand, Stable Diffusion, the image generating platform, is open-source. The platform model is distributed under the Creative ML OpenRAIL-M license. This license grants the user the right to distribute the model along with any modifications made. The license states however that the user loses these rights if the model, or derived model, is used for unlawful purposes.
II.3 Concrete examples
Let's look at three examples of how AI providers are navigating this myriad of intellectual property issues.
Github Copilot is a software code writing tool that has a GenAI motor. The AI model is developed by GitHub, OpenAI, and Microsoft. The tool is trained on code that appears in public code repositories. There is an estimated 28 million public code repositories on Github itself. The terms and conditions of Copilot give the user complete ownership of the code generated. On the other hand, no reference is made to the intellectual property of the software used to train the model.
StarCoder and StarCoderBase are AI platforms for code generation, trained on permissively licensed data from GitHub and Jupyter notebooks, with code samples in more than 80 programming languages. The license strives for both the open and responsible use of the accompanying model. On the issue of copyright, it states that the "Licensor claims no rights in the Output. You agree not to contravene any provision as stated in the License Agreement with your Use of the Output". Among the conditions of the license are that the code generated not be used in a way that creates harm to others, such as by creating malware. On the question of copyright of training data code, the model was trained using a dataset of code that only consists of permissively licensed code. In addition, there is an opt-out process so that code from specific code contributors can be removed from the training set. In an effort to be compliant with personal data protection regulatons, the creators also explain that Personal Identifiable Information such as names, passwords, and email addresses were removed from the training data.
Codestral is an AI-tool for code creation from Mistral, the French AI start-up. In the terms and conditions, the company claims "no ownership rights in and to the outputs". What is interesting about Codestral is its AI Non-Production license which prohibits the use of Codestral and its generated code in commercial software. The license goes on to explicitly ban "any internal usage by employees in the context of the company’s business activities". Some analysts postulate that the motivation for this clause is that Codestral is partly trained on copyrighted content, and they want to avoid a lawsuit.
III. Conclusion and Takeaways
Our exploration of IP questions related to AI-generated software leaves us with several learnings:
Lesson 1. Legal uncertainties in respect of third party rights with regard to the training of AI-generated platforms should lead to caution. Not all GenAI platforms address the intellectual property of training data with the same diligence.
TAKEAWAYS
- So far, Starcoder is the platform that addresses this issue in the safest manner for users because you can query the training data.
- Where you can, try to ascertain that no clearly identifiable prior code is visible in the training data. Many model providers provide access to their training data.
- If the software is for non-commercial use and you have no intention of distributing that software, risks are more limited. Remember, you are under no obligation to distribute software even if it has a copyleft license. For instance, you might use free software to develop the ERP of your organization; since the application is not intended for sharing, there is no copyleft license violation possible.
Lesson 2. Legal uncertainty also touches upon existence and ownership of exclusive rights on output generated by users.
TAKEAWAYS
- Companies should avoid building critical aspects of their business on AI generated code until questions of ownership become clearer.
Lesson 3. The terms and conditions of a GenAI platform determine whether your claim on software copyright can be exclusive or not. The current tendency is for GenAI platform providers not to dispute ownership of content by its users, but terms and conditions can change.
TAKEAWAYS
- Check the terms and conditions at every visit.
- Capture the date of your use to ensure you can prove under which terms and conditions your were generating output.
Authors: Olivia Dhordain and Ciarán Bryce
Notes
[1] If copyright protection does not require any registration, in some countries (US and China for example), registering a copyright with the public copyright office will strengthen the enforcement of these rights.
[2] An example of this sharing of Unix is a research project that integrated the TCP/IP networking stack – the motor of the current Internet – into Unix. This greatly contributed to the spread of the Internet in its current form. This code is used in many proprietary operating systems today.
[3] A French court recently fined Orange (formerly France Telecom) 900’000 EUR for using GPL licensed code in its Internet portal.
[4] Commonly-found examples of open-source licenses include the Gnu Lesser General Public License (LGPL), the BSD licenses, the Mozilla Public License (MPL) and the Apache License.
[5] Copyright generally lasts for a fixed period, e.g., the length of an author’s life plus 70 years. Once expired, copyrighted material enters into the public domain where it can be distributed without reticence.
[6] Copyright must allow for fair use – permission for limited use of copyright material without requiring permission from the rights holders. An example of fair use in the copyrighted software world is the right for teachers to talk about and illustrate the software in their classes..
[7] A copyright registration granted to the Zarya of the Dawn comic book was partially canceled because it included "non-human authorship". The book contains pictures created by feeding text prompts to Midjourney. The US Copyright Office decided that Kashtanova "is the author of the Work’s text as well as the selection, coordination, and arrangement of the Work’s written and visual elements". The images themselves, however, "are not the product of human authorship and the registration originally granted for them was canceled. Elisa Shupe used OpenAI's ChatGPT extensively while writing the book AI Machinations: Tangled Webs and Typed Words. The Copyright’s Office does not recognize her as author of the whole text which is usually the case for written works, but she is considered author of the "selection, coordination, and arrangement of text generated by artificial intelligence". Her lawyers provided an exhaustive log of how Shupe prompted ChatGPT, showing the custom commands she created and the edits she made.