Out-Law News 3 min. read

ICO’s generative AI guide underlines data-scraping compliance risks


New guidance issued by the Information Commissioner’s Office (ICO) has highlighted hurdles AI developers will have to overcome to comply with UK data protection laws when scraping personal data from online sources to train their models, an expert has said.

Data protection law specialist Jonathan Kirsop of Pinsent Masons was commenting after the ICO opened a consultation on draft new guidance focused on the lawful processing of “web-scraped personal data” to train generative AI systems.

In its draft guidance, the ICO said that there is likely to be only one lawful basis for processing web-scraped personal data that AI developers can currently rely on under UK data protection law – ‘legitimate interests’.

It is lawful, under the UK General Data Protection Regulation (GDPR), to process personal data if the processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party – provided the interests cited are not “overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data […]”. A balancing exercise therefore needs to be undertaken by any organisation seeking to undertake legitimate interests processing.

According to the ICO, one of the “valid interest[s]” that AI developers can pursue is “the business interest in developing a model and deploying it for commercial gain”. It added that “most generative AI training is only possible using the volume of data obtained though large-scale scraping” – a view that is relevant to the need for AI developers to ensure their planned processing is necessary for achieving their desired purpose.

On the need for AI developers to also ensure that the interests of the individuals whose data they intend to process do not override their or a third party’s interest, the ICO suggested that AI developers can go a long way to meeting this “balancing test” by complying with a further data protection duty arising in the context of web-scraping personal data – the need for AI developers to undertake a data protection impact assessment. It suggested that, through such an assessment, AI developers able to identify and mitigate risks associated with the planned processing will be better placed to satisfy themselves that the balancing test is met.

“Training generative AI models on web scraped data can be feasible if generative AI developers take their legal obligations seriously and can evidence and demonstrate this in practice,” the ICO said. “Key to this is the effective consideration of the legitimate interest test.”

“Developers using web scraped data to train generative AI models need to be able to: evidence and identify a valid and clear interest; consider the balancing test particularly carefully when they do not or cannot exercise meaningful control over the use of the model; demonstrate how the interest they have identified will be realised, and how the risks to individuals will be meaningfully mitigated, including their access to their information rights,” it said.

Data protection law expert Jonathan Kirsop of Pinsent Masons said: “In practice, the communication of effective and transparent information to data subjects will be the most difficult duty for developers to fulfil.”

“The draft guidance refers to the risk that ‘individuals lose control over their data as they are not informed of its processing’. This undermines the ability to satisfy any legitimate interest balancing test and may render any resulting processing as contrary to the fairness principle. I would expect this to be a bar that developers could struggle to clear, and it will be interesting if this will be drawn out further in the finalised guidance,” he said.

The ICO’s draft guidance on the lawful basis for web scraping to train generative AI models is the first in a series of new guides that the authority intends to issue on how data protection law applies in the context of generative AI. The draft guidance is open to consultation until 1 March 2024.

The ICO has already developed more general guidance on AI and data protection. It highlighted in its latest draft guidance that AI developers will need to give thought to their compliance with other legal frameworks beyond the data protection regime when intending to web-scrape data to train their generative AI models.

Technology law expert Sarah Cameron said: “AI developers need to be aware, in particular, of the risk of copyright infringement. There are a number of high-profile copyright infringement claims against AI developers before the courts, including the Getty Images v Stability AI case in the UK and the New York Times v Open AI and Microsoft case in the US. This highlights the increasing willingness of content creators to assert their intellectual property (IP) rights in the context of AI development.”

“The UK government is currently in the process of trying to broker the development of a new AI copyright code of practice to balance the respective interests of AI developers and IP rightsholders, without having to resort to legislating on the matter. It is just one of a number of complex issues that the government, and other policymakers globally, are grappling with as they consider whether, and to what extent, AI as a technology and its use should be regulated. Further developments in respect of the UK’s plans for AI regulation are expected this spring,” she said.

We are processing your request. \n Thank you for your patience. An error occurred. This could be due to inactivity on the page - please try again.