Out-Law / Die wichtigsten Infos des Tages

Out-Law News Lesedauer: 4 Min.

EU AI Act copyright template published

29 Jul 2025, 2:27 pm

AI developers that obtain data from online sources to train ‘general purpose AI’ (GPAI) models must compile a list of the websites from which they sourced the most data – and publish it on their own websites – to comply with the EU AI Act.

The requirement is specified in a new template that providers of GPAI models must adopt when seeking to meet transparency-related obligations arising under the AI Act, which begin to take effect later this week.

Dr Nils Rauer, who specialises in AI regulation and intellectual property law at Pinsent Masons, said: “Article 53(1)(d) EU AI Act requires all providers of GPAI models to draw up and make publicly available a sufficiently detailed public summary of the content used for the training of the respective model. It was for the AI Office to develop and provide a template providers may use when compiling the required summary. This template has now been published.”

“The legislator’s core intention is the generation of adequate transparency as regards the training of GPAI models which inevitably involves a great amount of data, also referred to as ‘big data’. As stated in recital 107 of the EU AI Act, the summary shall comprise information both on the pre-training and training phase. The main focus is put on content protected by copyright law, but the scope is broader and covers all types of proprietary information being used for training purposes,” he said.

The ‘sufficiently detailed summary’ that GPAI model providers must draw up and publish is not confined to reflecting data protected by copyright. Recital 107 of the AI Act states that the summary “should be generally comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law”.

In its explanatory note issued alongside the new template, the AI Office said that the information to be included in the summaries could, as well as assisting intellectual property rights holders, help data subjects assert their rights under EU data protection laws; support consumer interests and protection; improve understanding of how diverse the datasets used in training GPAI models are; facilitate research and the evaluation of the limitations in – and potential for risks and harms to arise from – the way GPAI models have been trained; as well as improve understanding of how well competition is working in data and AI markets.

Rauer said: “The ultimate aim is to strike a balance between serving the interests of parties with legitimate interests and promoting increased transparency of the training content. Obviously, this requires taking due account of the providers’ need to protect their trade secrets and confidential business information. It is by no means easy to get the balance right. It is to be hoped that the now published template provides guidance and steers what information ought to be published.”

In publishing the new template, the AI Office made clear that providers do not need to disclose “the details for the specific data and works used to train the model”. This, it considered, would go beyond the legislative requirement to publish a summary. However, it said the information disclosed about the training content “should be comprehensive in its scope and sufficiently detailed to achieve the objective of the summary of providing meaningful public transparency and facilitating parties with legitimate interests to exercise and enforce their rights under Union law”.

The AI Office’s template addresses the possibility that data used for pre-training GPAI models is sourced from multiple sources. Different types of information will need to be disclosed depending on the source of the data, which it said could include datasets compiled by third parties that are publicly available and free to use, unlicensed private datasets, user interactions, synthetic data, and/or data crawled and scraped from online sources, among possible others.

Where providers obtain data from online sources they must “provide a list of those most relevant internet domains names … by listing the top 10% of all domain names determined by the size of the content scraped”. SME providers only need to disclose the top 5% of all domain names or 1,000 internet domain names, whichever is lower.

The AI Office said its template balances the requirement for transparency provided in the law with the protection of the intellectual property rights, such as trade secrets, held by providers. To this end it confirmed that “private datasets not commercially licensed by rightsholders and obtained from other third parties have to be listed only if publicly known (or the provider wants to make them publicly known)”. Otherwise, providers only need describe those datasets “in a general manner”.

Providers must ensure the summaries they disclose “cover data used in all stages of the model training, from pre-training to post-training, including model alignment and fine-tuning”, and that they are updated to reflect when new datasets are used in the pre-training. Model refiners only need to disclose the data used for refining while citing the base model. Similarly, model distillers are required to disclose only a limited amount of data.

The respective summary must be published on providers’ own websites and on channels used for distributing the GPAI models.

The Article 53 rules – and other AI Act rules applicable to GPAI models – take effect on 2 August 2025. Providers of GPAI models placed on the EU market before that date have until 2 August 2027 to publish their summaries in line with the new template, subject to a limited exception.

“Where a provider of a model placed on the market before 2 August 2025 cannot, despite their best efforts, provide parts of the information required to prepare the summary because the information is not available or its retrieval would impose a disproportionate burden on the provider, the provider should clearly state and justify the corresponding information gaps in its summary,” the AI Office said.

“Clearly, the pre-training of GPAI models has led to legal disputes,” said Rauer. “Notably, in the US, but also in other countries like the UK and Germany, where we have a pending lawsuit where rightsholders are challenging developers of GPAI models, asserting copyright and data privacy infringements. There is a genuine conflict of interests between those in need for huge amounts of training data and those holding proprietary rights to information publicly available.”

“It is no secret that, in first place, transparency is needed to resolve these conflicts. The concept of a summary of information being made available to the public is therefore generally resonating. However, what is essential in this context is the right balance. This is because preparing such a summary inevitably comes with administrative burdens, with additional monetary and human resources being required. Also, as appreciated by the AI Office, trade secrets and confidential business information are at stake at the providers’ end. Thus, the now published template may not be misused to ask for overly detailed information,” he said.

Anna-Lena Kempf, who works with Rauer in Frankfurt, said: “Providers that do not comply with the template requirements face fines of up to 3% of their annual total worldwide turnover or €15 million, whichever is higher. The AI Office will begin enforcing the rules for GPAI models from 2 August 2026.”

Publication of the template comes after the AI Office issued its finalised GPAI code of practice earlier this month. Providers that adhere to the code can prove their compliance with the underlying AI Act provisions, but adherence to the code is entirely voluntary and compliance with the legislation can be proven by other means.

Editor’s Note, 31/07/2025: This article has been updated to clarify the scope of the Article 53(1)(d) summary and associated template.