Copyright Challenges in the Age of AI 

Meet The Authors

Olli Pitkänen

Olli Pitkänen

CLO

Dr. Olli Pitkänen is  proficient expert with extensive experience in ICT and law, leading multidisciplinary projects and providing expertise in legal aspects of ICT, IPRs, privacy, and data as a founder of an IT law firm and advisor to companies and the Finnish government.

Sami Jokela

Sami Jokela

CTO

Dr. Sami Jokela is a seasoned leader with 20+ years of experience in data, technology, and strategy, including roles at Nokia, co-founding startups, and leading Accenture’s technology and information insight practices.

Waltter Roslin

Waltter Roslin

Lawyer

Waltter is a lawyer focusing on questions concerning data sharing, governance, privacy and technology. He is also a PhD researcher at the University of Helsinki where his research focuses on the Finnish pharmaceutical reimbursement scheme.

PART I - Can a copyright holder’s exclusive right to make copies prevent AI developers from using copyrighted works in training data?

Introduction

Artificial Intelligence (AI) presents new challenges in many legal areas. One of those areas is the copyright system that has been developed for a quite different world and era. Companies and other actors developing or applying AI systems face difficulties when trying to comply with copyright law. Those are present, especially in three fields: 

  • Can a copyright holder’s exclusive right to make copies prevent AI developers from using copyrighted works in training data,  
  • Is the output of a generative AI system copyrightable and who is the author if AI is employed in the areas that have traditionally required human creativity, and 
  • Are AI models copyrightable?  

In this first part of the three-part posting, we analyse the right of copyright holders to prevent AI developers from using copyrighted works in training data, in particular from the perspective of EU law. 

A scale with one side featuring copyright symbols and the other side featuring symbols representing AI algorithms, with a question mark in the center.<br />
Caption: "Balancing Copyright Protection and AI Development: A Legal Dilemma.

Exclusive rights to prevent training AI 

Creative works are protected by copyright. National laws, EU directives, and international treaties govern it. Anything original and expressed is protected by copyright. The work does not need to be registered or copyright noticed (e.g. © mark) nor does it need to be artistic either. The original subject matter must be the author’s intellectual creation, and only the elements that are the expression of such creation are copyrighted. The author must have made creative choices while making the work.1  

For example, writing a longish text like a novel typically includes creative choices as the author chooses which words to use and in which order to put them. On the other hand, any single word of that work is not copyrightable. Therefore, larger texts and even longer extracts often include enough originality that they are copyrighted, but a single word or few words taken out of the text are not.  

What does this mean from the AI perspective? In machine learning, statistical models are trained using large amounts of data, e.g. text or images. The model then includes information on the probabilities of collocations of different words or elements of an image. To be more exact, in relation especially to Large Language Models (LLM), the original training text is replaced with tokens (unique numeric representation of each word) after which the model is trained to predict the most likely next token. When the model is used, a prompt text is given as an initial context that is then used similarly to predict the following sequence of tokens. Finally, those tokens are converted into words and sentences. Using such a model, a generative AI system, for example, can produce texts or images that resemble those created by human authors.  

 

 

From the point of view of copyright, the first question is whether anything relevant to copyright is happening in the process. Mere reading text or looking at pictures does not infringe copyright. Similarly, copying individual words or their tokens does not infringe copyright because, as noted above, individual words are not copyrightable. Copying larger sets of text or a whole image may infringe copyright. Thus, training a model may or may not infringe copyright, depending on the training algorithm: whether the training involves copying the author’s creative choices or analysing the distance between individual words. A typical, slightly simplified machine learning process consists of reading the text, stripping out potential non-important characters, and converting the result into a token series. After this, the results are typically stored as token vectors for the learning process that is then repeated multiple times over. Alternatively, the material is first stored as is and then converted on the fly during learning, but this is a much more inefficient approach than the previous one. It is likely that the token vectors also include the results of the creative choices that the original author has made. Therefore, still depending on the algorithm, it is plausible that a machine learning process makes copies of original works and is therefore relevant from the copyright perspective. 

At the time of writing this, The New York Times has just sued OpenAI and Microsoft for copyright infringement. In one example of how AI systems use The Times’s material, the media house claimed that ChatGPT reproduced almost verbatim results from Wirecutter, The Times’s product review site.2 OpenAI on the other hand denies that. The company says they have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs.3 We don’t yet know how the dispute will end, but if The Times is right, OpenAI’s ChatGPT will likely infringe copyright. It would be difficult to understand how the software output contains “almost verbatim” copies of the training data if they are not copied into the model first. On the other hand, if OpenAI is right, it is much harder to tell whether anything copyright-relevant is happening in the process. 

A snippet of binary code forming a copyrighted work (like a book or image).<br />
Caption: "Decoding Copyright: Can AI Translate Protected Works into Innovation?

Exceptions to allow training 

The second question is, if training a model is copyright-relevant, is there an exception or a limitation in the copyright law that would still allow training? 

The strong exclusive rights, e.g. the right to copy, to modify, to sell, and to display the work, that the copyright law provides to authors, have been tried to balance by exceptions and limitations. They vary from country to country. Often, they are enumerated in a copyright statute, but e.g. in the USA, they are included in fair use doctrine, an open limitation on copyright. Typically, the exceptions include acts of reproduction by libraries, educational establishments, museums or archives, and ephemeral recordings made by broadcasting organizations, illustration for teaching or research purposes, for the benefit of handicapped persons, for making current events available to the public, and for the purpose of citation or caricature. Especially, in many countries, it is legal to make copies of copyrighted works for private use. Recently, in Art 4 of the DSM directive4, the EU has required that the member states shall provide for an exception or limitation to copyright for reproductions and extractions of lawfully accessible works for the purposes of text and data mining unless the use of works has been expressly reserved by their rightsholders in an appropriate manner. Text and data mining in research organisations and cultural heritage institutions cannot be limited with such a reservation (Art. 3). 

It should be noted that the copyright holder’s exclusive right is the main rule and exceptions and limitations should be interpreted narrowly. Therefore, the exception or limitation on text and data mining should not be interpreted more broadly than how it is explicitly expressed in the Directive. An interesting question is whether text and data mining in this context include machine learning training processes. In Art 2, it is defined that ‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations. It seems that most experts agree that this definition covers also machine learning. At the time of writing this, we do not yet have the final wording of the AI Act, but based on current drafts, it appears that the AI Act will include a clarification that the data mining exemption in the DSM Directive applies to the training of AI. Therefore, although we cannot be sure until the European Court of Justice (ECJ) takes a stand on the issue, we presume that using copyrighted works to train artificial intelligence is allowed in accordance with DSM directive Art. 3 and 4. 

From that perspective, training of a model with data that includes copyrighted works, would be lawful unless the use of works has been expressly reserved by the rightsholders. However, that does not make it legal to develop a generative AI system that generates copies of copyrighted works. Making unauthorized copies would not be legal just by claiming that the copy machine includes AI software! 

Conclusions

To conclude our finding of this Part I, it seems that depending on the algorithm, machine learning process can be relevant from the copyright perspective. If the training involves copying the creative choices made by the original author of a copyrighted work in the training data, it could violate the author’s exclusive rights. On the other hand, if the machine learning process can be considered as data mining, it can be within the limitation or exception defined in the DSM directive and therefore lawful within the EU. Yet, if the output of a generative AI system includes copies of the works in the training data, that cannot be justified by that limitation or exception. 

In the following parts, we’ll first discuss the authorship of AI generated content and then complete this three-part posting with ideas on copyright in AI models. 

1001 Lakes’ experts are happy to discuss these topics with you if you have concerns of AI and copyright or how to develop and use AI in compliance with the copyright law. 

Copyright Challenges in the Age of AI 

Can a copyright holder’s exclusive right to make copies prevent AI developers from using copyrighted works in training data?

What’s the deal with the AI Act?

In the early hours of December 9th, the European Union Parliament and Council finally came out with a provisional agreement on the contents of the Artificial Intelligence Act (AIA). In this blog post, we will summarize the main contents of the AIA and discuss its possible implications and open questions using the development and deployment of Large Language Models (LLM) as an example.

Trustworthy data for responsibility and sustainability

Data and AI play a crucial role in proving that companies act responsibly and meet their environmental, social and governance (ESG) targets.

What’s the deal with the AI Act?

Meet The Authors

Emeline Banzuzi

Emeline Banzuzi

Privacy & Data Governance Counsel

Emeline Banzuzi serves as a legal counsel and reseacher specializing in the dynamic field of law, technology and society, with expertise in data protection consulting, risk management, compliance within FinTech, and academic reseach.

Joel Himanen

Joel Himanen

Data Scientist

Joel Himanen is a versatile data scientist with a strong emphasis on advanced analytics, machine learning, and artificial intelligence, having prior experience in data-driven sustainability projects in both the private and public sectors.

In the early hours of December 9th, the European Union Parliament and Council finally came out with a provisional agreement on the contents of the Artificial Intelligence Act (AIA). In this blog post, we will summarize the main contents of the AIA and discuss its possible implications and open questions using the development and deployment of Large Language Models (LLM) as an example. 

The short version

The EU’s Artificial Intelligence Act aims to govern the development and deployment of AI systems in the EU, while ensuring that these systems are safe and respect the health, safety and fundamental rights and freedoms of EU citizens. The provisional agreement states that the Act will apply two years after its entry into force (i.e. following its publication in the Official Journal of the EU), shortened to six months for the bans it contains. The Act most notably impacts AI system deployers, who are regulated according to the risk category of their use case. On the side of generative AI, foundational model developers are facing significant requirements for transparency, safeguards, and testing. 

Digging a little deeper

The first draft of the Act was published in April 2021, and its final version is currently undergoing the EU legislative procedure. After the latest agreement, the Act still needs to be confirmed by both the Parliament and the Council, as well as undergo legal-linguistic revisions, before formal adoption. 

The Act defines an “AI system” as a machine-based system that, with varying levels of autonomy and for explicit or implicit objectives, generates outputs such as predictions, recommendations, or decisions that can influence physical or virtual environments. The regulation applies to providers, deployers, and distributors of AI systems as well as “affected persons”, meaning individuals or groups of persons who are subject to or otherwise affected by an AI system.

The AIA establishes varying obligations for developers and deployers of AI systems, depending on which risk classification the system in question may fall in. The Act presents four risk categories, namely: 

 

  • Unacceptable risk: AI systems that are a clear threat to the safety, livelihoods, and rights of individuals (e.g. systems used for social scoring and systems that exploit vulnerable groups such as children). The use of these systems is prohibited. 
  • High risk: AI systems that that pose significant harm to the health, safety, or fundamental rights of individuals. Examples of high-risk AI systems include those used for the management of critical infrastructure, education, employment, law enforcement, and border control. High-risk systems will be subject to strict obligations before they can be placed on the market: providers and deployers of these systems must, for instance, develop a risk management process for risk identification and mitigation; apply appropriate data governance & management practices to training, validation, and testing data sets; enable human-oversight; ensure technical robustness and cybersecurity; as well as draw up documentation that demonstrates AIA compliance. (For a complete list of obligations, see Arts. 9-17 AIA).  
  • Limited risk: Examples of limited-risk AI systems include systems intended to interact with individuals, e.g. chatbots and deep fakes. The compliance obligations for limited-risk AI focus on transparency: users of these systems must be clearly informed that they are interacting with an AI system. 
  • Minimal risk: Examples of minimal risk AI include spam filters, AI-enabled video games, and inventory management systems. The AIA allows for the free use of minimal risk AI.  

The risk categories have fluctuated throughout the drafting stages of the AIA.

Implications for model developers and deployers 

 AI model and application developers are, of course, quite anxious about the Act, because it has the potential of monumentally impacting the development and usage processes. As the AIA proposal phase is being finalized, it is important to consider possible scenarios and think about the impact the Act would have on different groups in the AI field. 

Let’s consider the hottest AI topic of 2023: Large Language Models (LLM). One way to view the LLM lifespan is to divide it into three phases (upstream to downstream): foundation model (FM) development, fine tuning, and deployment. What possible implications would the AI Act have on these phases? 

Foundation model developers are the ones doing the “heavy lifting”. They develop the model architecture, scrape together and process the enormous data masses required to pre-train the model, and execute the actual pre-training, during which the model learns most of its capabilities. These are organizations backed by significant resources, since gathering the data and especially the compute-intensive pre-training are expensive activities. Having the most impact on the model itself, a FM developer will, according to the current proposal, be regulated relative to the cumulative amount of compute used for model training. For example, a FM classified as “high-impact” (more than 10^25 floating point operations during training) would also have stricter transparency requirements concerning, for instance, disclosing copyrighted training material. This is a huge requirement; the amount of data required for pre-training is so massive, that its collection process is highly automated, and thus, there is only minimal control over the substance itself. An interesting detail is that according to the latest agreement, open-source models will be subject to lighter regulation. 

Fine tuners have a smaller, yet significant impact on the model. They take a pre-trained FM and continue training it on a smaller, more specialized dataset. In a way, they perform the same manipulations on the model as the FM developer, just on a smaller scale. The interesting question follows: how will the AIA discern between them? Will fine tuners be subject to the same, computational impact -relative transparency requirements as FM developers? In any case, fine tuners will have it easier in the sense that they have far more control over the content of their datasets. 

Model deployers (considering them separate from fine tuners) do not affect the LLM itself. Rather, they decide on the final use case (although the fine tuner might already have trained the model for that case), and control how the model can be used. This means that they will most likely be subject to the bulk of the AIA’s risk category -based regulation. Deployers also build the software around the FM, affecting how the model can be used, how its inputs and outputs are processed, and how much control the end user is able to exercise over it. Consequently, more “classical” questions of software and information security might well become a critical part of AIA compliance. 

What next? 

For now, we must wait for the finalized texts to come out to grasp the details of the Act. Meanwhile, every organization dealing with AI systems will have to ponder on the implications of what we know now. Deployers will already have to start giving serious thought on risk categorization and the following requirements. FM developers brace themselves for the additional work that comes with curating masses of training data, while weighing open vs. closed-source development in a new light. 

Copyright Challenges in the Age of AI 

Can a copyright holder’s exclusive right to make copies prevent AI developers from using copyrighted works in training data?

What’s the deal with the AI Act?

In the early hours of December 9th, the European Union Parliament and Council finally came out with a provisional agreement on the contents of the Artificial Intelligence Act (AIA). In this blog post, we will summarize the main contents of the AIA and discuss its possible implications and open questions using the development and deployment of Large Language Models (LLM) as an example.

Trustworthy data for responsibility and sustainability

Data and AI play a crucial role in proving that companies act responsibly and meet their environmental, social and governance (ESG) targets.

Trustworthy data for responsibility and sustainability

Meet The Author

Marko Turpeinen

Marko Turpeinen

CEO

Dr. Marko Turpeinen is a visionary leader with 25+ years of experience in digital transformation and innovation, having worked at prestigious institutions like MIT Media Lab and EIT Digital, and initiating the global MyData movement at Aalto Univesity.

Data and AI play a crucial role in proving that companies act responsibly and meet their environmental, social and governance (ESG) targets.

An image representing ethical practices, such as a person holding a data globe with care

Current reality is that ESG data practices are inefficient and inaccurate. ESG data comes from a myriad of sources and is of variable quality. Availability of data is spotty, especially when the scope of data collection and analysis extends beyond company’s own borders to its supply chain and partners. There is plenty of manual work involved and every company does the work by themselves. This results in vast amounts of duplicate work.

Collaborative Data Sharing in the Era of CSRD

European Union’s Corporate Sustainability Reporting Directive (CSRD) came into effect in January this year. It modernises and strengthens the rules concerning the ESG information that companies are required to report. Large stock listed companies are expected to begin reporting in 2025 based on their 2024 data, and other companies will follow suit when CSRD is gradually rolled out. Companies subject to the CSRD will have to report according to European Sustainability Reporting Standards (ESRS), provide the reporting in a standardised digital format, and include their business networks (e.g. supply chains) in their environmental impacts.

 Very large number of companies will be affected by growing regulatory demands regarding ESG reporting. What if companies could collaborate more efficiently to meet these needs? Instead of every company collecting the data for themselves there would be clear benefits in forming data sharing practices to make sustainability data available for all parties in the ecosystem. This would help to minimize duplicate work for ecosystem participants, and provide better transparency of the whole value chain for all. In a data ecosystem, sustainability improvements can be driven – and even co-funded – by the whole value chain together.

An image portraying a handshake or a group of people collaborating

The Rulebook Approach for Mitigating Risks and Ensuring Fair Data Use in Ecosystems

Despite its clear benefits, data sharing also brings forth several thorny issues regarding business risks, data hygiene, disclosure of trade secrets, corporate security policies, and fair data use. How can a company show that its data and methods can be trusted? How can the ecosystem participants trust each other to not to misuse the data? Do the others get unfair advantage from my data?

 Trust-building, fair data use and minimization of risks amongst the ecosystem participants can be tackled by a rulebook approach. Sitra’s fair data economy rulebook model is one leading example of this approach, taking a holistic view to governance of data ecosystems. It helps organizations to form new data sharing networks and implement policies and rules for them.

 The rulebook approach also helps data providers and data users to assess any requirements imposed by applicable legislation and contracts appropriately in addition to guiding them in adopting practices that promote the use of data and management of risks. With the aid of the rulebook approach, parties can establish a data network based on mutual trust that shares a common mission, vision, and values. This fosters trust and responsible use of data.

An image portraying a handshake or a group of people collaborating

The Imperative of Responsibility and Sustainability in the Industrial Landscape

Responsibility and sustainability have risen as key drivers for creating functioning data ecosystems. This is demonstrated in lighthouse data sharing initiatives, such as Catena-X for the automotive industry. The aim of Catena-X is to grow into a network of more than 200,000 data sharing organizations. Catena-X has picked harmonized and accurate ESG reporting as the most urgent business challenge to be resolved in the ecosystem.

 We are headed towards a future where data sharing and collaboration is expected in a massive scale, and potentially influencing everyone who have a stake in the industrial ecosystem. As the importance and impact of these initiatives spread and grow, holistic ESG data governance approach is business critical for building trust in data ecosystems.

Copyright Challenges in the Age of AI 

Can a copyright holder’s exclusive right to make copies prevent AI developers from using copyrighted works in training data?

What’s the deal with the AI Act?

In the early hours of December 9th, the European Union Parliament and Council finally came out with a provisional agreement on the contents of the Artificial Intelligence Act (AIA). In this blog post, we will summarize the main contents of the AIA and discuss its possible implications and open questions using the development and deployment of Large Language Models (LLM) as an example.

Trustworthy data for responsibility and sustainability

Data and AI play a crucial role in proving that companies act responsibly and meet their environmental, social and governance (ESG) targets.