AI developers that obtain data from online sources to train ‘general purpose AI’ (GPAI) models must compile a list of the websites from which they sourced the most data – and publish it on their own websites – to comply with the EU AI Act.

The requirement is specified in a new template that providers of GPAI models must adopt when seeking to meet copyright-related obligations arising under the AI Act, which begin to take effect later this week.

Dr Nils Rauer, who specialises in AI regulation and intellectual property law at Pinsent Masons, said: “Article 53(1)(d) EU AI Act requires all providers of GPAI models to draw up and make publicly available a sufficiently detailed public summary of the content used for the training of the respective model. It was for the AI Office to develop and provide a template providers may use when compiling the required summary. This template has now been published.”

“The legislator’s core intention is the generation of adequate transparency as regards the training of GPAI models which inevitably involves a great amount of data, also referred to as ‘big data’. As stated in recital 107 of the EU AI Act, the summary shall comprise information both on the pre-training and training phase. The main focus is put on content protected by copyright law, but the scope is broader and covers all types of proprietary information being used for training purposes,” he said.

“The ultimate aim is to strike a balance between serving the interests of parties with legitimate interests and promoting increased transparency of the training content. Obviously, this requires taking due account of the providers’ need to protect their trade secrets and confidential business information. It is by no means easy to get the balance right. It is to be hoped that the now published template provides guidance and steers what information ought to be published,” Rauer added.

In publishing the new template, the AI Office made clear that providers do not need to disclose “the details for the specific data and works used to train the model”. This, it considered, would go beyond the legislative requirement to publish a summary. However, it said the information disclosed about the training content “should be comprehensive in its scope and sufficiently detailed to achieve the objective of the summary of providing meaningful public transparency and facilitating parties with legitimate interests to exercise and enforce their rights under Union law”.

The AI Office’s template addresses the possibility that data used for pre-training GPAI models is sourced from multiple sources. Different types of information will need to be disclosed depending on the source of the data, which it said could include datasets compiled by third parties that are publicly available and free to use, unlicensed private datasets, user interactions, synthetic data, and/or data crawled and scraped from online sources, among possible others.

Where providers obtain data from online sources they must “provide a list of those most relevant internet domains names … by listing the top 10% of all domain names determined by the size of the content scraped”. SME providers only need to disclose the top 5% of all domain names or 1,000 internet domain names, whichever is lower.

The AI Office said its template balances the requirement for transparency provided in the law with the protection of the intellectual property rights, such as trade secrets, held by providers. To this end it confirmed that “private datasets not commercially licensed by rightsholders and obtained from other third parties have to be listed only if publicly known (or the provider wants to make them publicly known)”. Otherwise, providers only need describe those datasets “in a general manner”.

Providers must ensure the summaries they disclose “cover data used in all stages of the model training, from pre-training to post-training, including model alignment and fine-tuning”, and that they are updated to reflect when new datasets are used in the pre-training. Model refiners only need to disclose the data used for refining while citing the base model. Similarly, model distillers are required to disclose only a limited amount of data.

The respective summary must be published on providers’ own websites and on channels used for distributing the GPAI models.

The article 53 rules – and other AI Act rules applicable to GPAI models – take effect on 2 August 2025. Providers of GPAI models placed on the EU market before that date have until 2 August 2027 to publish their summaries in line with the new template, subject to a limited exception.

“Where a provider of a model placed on the market before 2 August 2025 cannot, despite their best efforts, provide parts of the information required to prepare the summary because the information is not available or its retrieval would impose a disproportionate burden on the provider, the provider should clearly state and justify the corresponding information gaps in its summary,” the AI Office said.

“Clearly, the pre-training of GPAI models has led to legal disputes,” said Rauer. “Notably, in the US, but also in other countries like the UK and Germany, where we have a pending lawsuit where rightsholders are challenging developers of GPAI models, asserting copyright and data privacy infringements. There is a genuine conflict of interests between those in need for huge amounts of training data and those holding proprietary rights to information publicly available.”

“It is no secret that, in first place, transparency is needed to resolve these conflicts. The concept of a summary of information being made available to the public is therefore generally resonating. However, what is essential in this context is the right balance. This is because preparing such a summary inevitably comes with administrative burdens, with additional monetary and human resources being required. Also, as appreciated by the AI Office, trade secrets and confidential business information are at stake at the providers’ end. Thus, the now published template may not be misused to ask for overly detailed information,” he said.

Anna-Lena Kempf, who works with Rauer in Frankfurt, said: “Providers that do not comply with the template requirements face fines of up to 3% of their annual total worldwide turnover or €15 million, whichever is higher. The AI Office will begin enforcing the rules for GPAI models from 2 August 2026.”

Publication of the template comes after the AI Office issued its finalised GPAI code of practice earlier this month. Providers that adhere to the code can prove their compliance with the underlying AI Act provisions, but adherence to the code is entirely voluntary and compliance with the legislation can be proven by other means.

We are working towards submitting your application. Thank you for your patience. An unknown error occurred, please input and try again.