Is It Legal to Train AI on Books? A High School Researcher’s Take on the Anthropic Ruling

As someone who’s been exploring computational linguistics and large language models (LLMs), I’ve always wondered: How legal is it, really, to train AI on books or copyrighted material? This question came up while I was learning about how LLMs are trained using massive datasets, including books, articles, and other written works. It turns out the legal side is just as complex as the technical side.

A major U.S. court case in June 2025 helped answer this question, at least for now. In this post, I’ll break down what happened and what it means for researchers, developers, and creators.


The Big Picture: Copyright, Fair Use, and AI

In the U.S., books and intellectual property (IP) are protected under copyright law. That means you can’t just use someone’s novel or article however you want, especially if it’s for a commercial product.

However, there’s something called fair use, which allows limited use of copyrighted material without permission. Whether something qualifies as fair use depends on four factors:

  1. The purpose of the use (such as commercial vs. educational)
  2. The nature of the original work
  3. The amount used
  4. The effect on the market value of the original

LLM developers often argue that training models is “transformative.” In other words, the model doesn’t copy the books word for word. Instead, it learns patterns from large collections of text and generates new responses based on those patterns.

Until recently, this argument hadn’t been fully tested in court.


What Just Happened: The Anthropic Case (June 24, 2025)

In a landmark decision, U.S. District Judge William Alsup ruled that AI company Anthropic did not violate copyright law when it trained its Claude language model on books. The case was brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who argued that Anthropic had used their work without permission.

  • Andrea Bartz: The Lost Night: A Novel
  • Charles Graeber: The Good Nurse: A True Story of Medicine, Madness, and Murder
  • Kirk Wallace Johnson: The Fisherman and the Dragon: Fear, Greed, and a Fight for Justice on the Gulf Coast

Judge Alsup ruled that Anthropic’s use of the books qualified as fair use. He called the training process “exceedingly transformative” and explained that the model did not attempt to reproduce the authors’ styles or specific wording. Instead, the model learned patterns and structures in order to generate new language, similar to how a human might read and learn from books before writing something original.

However, the court also found that Anthropic made a serious mistake. The company had copied and stored more than 7 million pirated books in a central data library. Judge Alsup ruled that this was not fair use and was a clear violation of copyright law. A trial is scheduled for December 2025 to determine possible penalties, which could be up to $150,000 per work.


Why This Case Matters

This is the first major U.S. court ruling on whether training generative AI on copyrighted works can qualify as fair use. The result was mixed. On one hand, the training process itself was ruled legal. On the other hand, obtaining the data illegally was not.

This means AI companies can argue that their training methods are transformative, but they still need to be careful about where their data comes from. Using pirated books, even if the outcome is transformative, still violates copyright law.

Other lawsuits are still ongoing. Companies like OpenAI, Meta, and Microsoft are also facing legal challenges from authors and publishers. These cases may be decided differently, depending on how courts interpret fair use.


My Thoughts as a Student Researcher

To be honest, I understand both sides. As someone who is really excited about the possibilities of LLMs and has worked on research projects involving language models, I think it’s important to be able to learn from large and diverse datasets.

At the same time, I respect the work of authors and creators. Writing a book takes a lot of effort, and it’s only fair that their rights are protected. If AI systems are going to benefit from their work, then maybe there should be a system that gives proper credit or compensation.

For student researchers like me, this case is a reminder to be careful and thoughtful about where our data comes from. It also raises big questions about what responsible AI development looks like, not just in terms of what is allowed by law, but also what is fair and ethical.


Wrapping It Up

The Anthropic ruling is a big step toward defining the legal boundaries for training AI on copyrighted material. It confirmed that training can be legal under fair use if it is transformative, but it also made clear that sourcing content from pirated platforms is still a violation of copyright law.

This case does not settle the global debate, but it does provide some clarity for researchers and developers in the U.S. Going forward, the challenge will be finding a balance between supporting innovation and respecting the rights of creators.

— Andrew

Update (September 5, 2025):

AI startup Anthropic will pay at least $1.5 billion to settle a copyright infringement lawsuit over its use of books downloaded from the Internet to train its Claude AI models. The federal case, filed last year in California by several authors, accused Anthropic of illegally scraping millions of works from ebook piracy sites. As part of the settlement, Anthropic has agreed to destroy datasets containing illegally accessed works. (Read the full report)

Leave a comment

Blog at WordPress.com.

Up ↑