Scraping By – AI and Copyright Law - The Information Difference

All generative artificial intelligence (gen AI) tools like ChatGPT, Claude and Perplexity are based on artificial neural networks that are trained on huge amounts of data, such as text or images. The large language models (LLMs) that underlie this technology are voracious in their need to ingest training data in massive quantities. Some of the data used for training LLMs is in publicly available datasets like Imagenet, Common Objects in Context or Common Corpus, or open datasets provided by governments. However, AI companies have also used social media feeds and published books for their training data, as well as web-crawling to scrape content from the internet. Even video transcripts, TV shows and movies can be used as training data, along with their subtitles, which are available in datasets like OpenSubtitles.

The use of published material to train LLMs is legally problematic. Copyright law dates back to 1710 with the Statute of Anne in the UK, mirrored in the Copyright Act of 1790 in the US, and has been extended broadly, such as in the Berne Convention of 1886. These statutes protect creative content authors from unauthorised use of their content, and typically continue after death to the estate of deceased authors, typically for 70 years after an author is deceased. The use of copyright-protected material by the companies developing LLMs is a hot topic at the moment in the legal profession, and may pose a significant threat to the generative AI industry. The industry argues that training LLMs is “fair use”, and, at the time of writing, is having some success with this argument.

In the USA there is a case called Bartz v Anthropic, where three authors have sued the AI company Anthropic for allegedly copying their books for training the Claude LLM; the case is likely to go to trial in December 2025. In the EU there is a case against Google being brought to the Court of Justice of the European Union by a Hungarian news publisher called Like Company. A law firm that keeps track of such cases lists literally dozens of pending cases, including actions brought by The Authors Guild against OpenAI, The New York Times against OpenAI and Microsoft, and Disney and Universal against Midjourney. In July 2025 a judge ruled that authors in the Bartz v Anthropic case can represent authors nationwide in a class-action suit. With each violation potentially being liable to a fine of $150,000 and potentially seven million authors, that lawsuit alone represents significant financial peril to the AI industry.

The legal profession itself has its own issues with generative AI. There are now literally hundreds of cases where lawyers have used LLMs to help them build documents submitted to courts that contain “hallucinations”. LLMs are essentially probability engines that produce statistically calculated output based on their underlying weightings and their training data. However, LLMs produce fabricated or nonsensical output (“hallucinations”) at an alarming rate. Perhaps 20% of LLM output contains hallucinations, and the rate is actually worsening with more recent AI models. This means that documents citing legal precedents are often plausible but entirely invented by the LLMs. In July 2025 alone there were over 50 cases involving fabricated cases reported, just in the US. The UK High Court has been unimpressed, giving an example of a case where a damages case had 45 case law citations, 18 of which were fabricated, and demanded that senior lawyers take urgent action to prevent the use of AI in this way. Legal sanctions have already been imposed in some cases on lawyers who have used AI to prepare court documents with AI-generated citations. This does not seem to have discouraged lawyers seeking to save time. A September 2024 survey by Lexis Nexus of 800 lawyers in the UK and Ireland found that 41% of survey participants used AI in their jobs. While there are undoubtedly valid uses of LLMs by lawyers, such as in research and due diligence, the hundreds of cases of hallucinated legal documents suggest that law firms will need to be careful to ensure that their use of AI is carefully monitored and controlled.

Governments around the world have been struggling to keep up with the pace of AI, though various regulatory activities are emerging. From Argentina to Japan, from Australia to Peru, there are various AI legislative initiatives being introduced, including in the EU. All these laws will introduce further complexity into the picture for AI companies, as they seek to navigate a thicket of legislation.

The IT industry is used to operating at a blistering pace, with Silicon Valley having a “move fast and break things” (the motto of Facebook until 2014) mentality. The legal profession and government legislation typically operate at a much steadier pace. However, the pending copyright cases against AI companies that may have used copyrighted material to train their LLMs mean that there are substantial legal threats to the AI industry that are starting to rumble down the track. Whatever the outcomes of the pending cases, this litigation is likely to make many lawyers a great deal of money.

Scraping By – AI and Copyright Law

Related Posts