Is It Legal to Train AI on Books? A High School Researcher’s Take on the Anthropic Ruling

As someone who’s been exploring computational linguistics and large language models (LLMs), I’ve always wondered: How legal is it, really, to train AI on books or copyrighted material? This question came up while I was learning about how LLMs are trained using massive datasets, including books, articles, and other written works. It turns out the legal side is just as complex as the technical side.

A major U.S. court case in June 2025 helped answer this question, at least for now. In this post, I’ll break down what happened and what it means for researchers, developers, and creators.


The Big Picture: Copyright, Fair Use, and AI

In the U.S., books and intellectual property (IP) are protected under copyright law. That means you can’t just use someone’s novel or article however you want, especially if it’s for a commercial product.

However, there’s something called fair use, which allows limited use of copyrighted material without permission. Whether something qualifies as fair use depends on four factors:

  1. The purpose of the use (such as commercial vs. educational)
  2. The nature of the original work
  3. The amount used
  4. The effect on the market value of the original

LLM developers often argue that training models is “transformative.” In other words, the model doesn’t copy the books word for word. Instead, it learns patterns from large collections of text and generates new responses based on those patterns.

Until recently, this argument hadn’t been fully tested in court.


What Just Happened: The Anthropic Case (June 24, 2025)

In a landmark decision, U.S. District Judge William Alsup ruled that AI company Anthropic did not violate copyright law when it trained its Claude language model on books. The case was brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who argued that Anthropic had used their work without permission.

  • Andrea Bartz: The Lost Night: A Novel
  • Charles Graeber: The Good Nurse: A True Story of Medicine, Madness, and Murder
  • Kirk Wallace Johnson: The Fisherman and the Dragon: Fear, Greed, and a Fight for Justice on the Gulf Coast

Judge Alsup ruled that Anthropic’s use of the books qualified as fair use. He called the training process “exceedingly transformative” and explained that the model did not attempt to reproduce the authors’ styles or specific wording. Instead, the model learned patterns and structures in order to generate new language, similar to how a human might read and learn from books before writing something original.

However, the court also found that Anthropic made a serious mistake. The company had copied and stored more than 7 million pirated books in a central data library. Judge Alsup ruled that this was not fair use and was a clear violation of copyright law. A trial is scheduled for December 2025 to determine possible penalties, which could be up to $150,000 per work.


Why This Case Matters

This is the first major U.S. court ruling on whether training generative AI on copyrighted works can qualify as fair use. The result was mixed. On one hand, the training process itself was ruled legal. On the other hand, obtaining the data illegally was not.

This means AI companies can argue that their training methods are transformative, but they still need to be careful about where their data comes from. Using pirated books, even if the outcome is transformative, still violates copyright law.

Other lawsuits are still ongoing. Companies like OpenAI, Meta, and Microsoft are also facing legal challenges from authors and publishers. These cases may be decided differently, depending on how courts interpret fair use.


My Thoughts as a Student Researcher

To be honest, I understand both sides. As someone who is really excited about the possibilities of LLMs and has worked on research projects involving language models, I think it’s important to be able to learn from large and diverse datasets.

At the same time, I respect the work of authors and creators. Writing a book takes a lot of effort, and it’s only fair that their rights are protected. If AI systems are going to benefit from their work, then maybe there should be a system that gives proper credit or compensation.

For student researchers like me, this case is a reminder to be careful and thoughtful about where our data comes from. It also raises big questions about what responsible AI development looks like, not just in terms of what is allowed by law, but also what is fair and ethical.


Wrapping It Up

The Anthropic ruling is a big step toward defining the legal boundaries for training AI on copyrighted material. It confirmed that training can be legal under fair use if it is transformative, but it also made clear that sourcing content from pirated platforms is still a violation of copyright law.

This case does not settle the global debate, but it does provide some clarity for researchers and developers in the U.S. Going forward, the challenge will be finding a balance between supporting innovation and respecting the rights of creators.

— Andrew

Update (September 5, 2025):

AI startup Anthropic will pay at least $1.5 billion to settle a copyright infringement lawsuit over its use of books downloaded from the Internet to train its Claude AI models. The federal case, filed last year in California by several authors, accused Anthropic of illegally scraping millions of works from ebook piracy sites. As part of the settlement, Anthropic has agreed to destroy datasets containing illegally accessed works. (Read the full report)

Exploring the Intersection of AI and Human Creativity: A Review of Deep Thinking by Garry Kasparov

Recently, I had the opportunity to read Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins by Garry Kasparov. While this book doesn’t directly tie into my work in computational linguistics, it still resonated with me due to its exploration of artificial intelligence (AI), a field closely related to many of my interests. The book combines my passions for chess and technology, and while its primary focus is on AI in the realm of chess, it touches on broader themes that align with my curiosity about how AI and human creativity intersect.

In Deep Thinking, the legendary chess grandmaster Garry Kasparov delves into his personal journey with artificial intelligence, particularly focusing on his famous matches against the machine Deep Blue. This book is not just a chronicle of those historic encounters; it’s an exploration of how AI impacts human creativity, decision-making, and the psychological experience of competition.

Kasparov’s narrative offers more than just an inside look at high-level chess; it provides an insightful commentary on the evolving relationship between humans and technology. Deep Thinking is a must-read for those interested in the intersection of AI and human ingenuity, especially for chess enthusiasts who want to understand the psychological and emotional impacts of playing against a machine.

Kasparov’s main argument is clear: While AI has transformed chess, it still cannot replicate the creativity, reasoning, and emotional depth that humans bring to the game. AI can calculate moves and offer solutions, but it lacks the underlying rationale and context that makes human play unique. As Kasparov reflects, even the most advanced chess programs can’t explain why a move is brilliant—they just make it. This inability to reason and articulate is a crucial distinction he highlights throughout the book, particularly in Chapter 4, where he emphasizes that AI lacks the emotional engagement that a human player experiences.

For Kasparov, the real challenge comes not just from the machine’s power but from its lack of emotional depth. In Chapter 5, he shares how the experience of being crushed by an AI, which feels no satisfaction or fear, is difficult to process emotionally. It’s this emotional disconnect that underscores the difference between the human and machine experience, not only in chess but in any form of creative endeavor. The machine may be able to play at the highest level, but it doesn’t feel the game the way humans do.

Kasparov’s exploration of AI in chess is enriched by his experiences with earlier machines like Deep Thought, where he learns that “a machine learning system is only as good as its data.” This idea touches on a broader theme in the book: the idea that AI is limited by the input it receives. The system is as powerful as the information it processes, but it can never go beyond that data to create something entirely new or outside the parameters defined for it.

By the book’s conclusion, Kasparov pivots to a broader, more philosophical discussion: Can AI make us more human? He argues that technology, when used properly, has the potential to free us from mundane tasks, allowing us to be more creative. It is a hopeful perspective, envisioning a future where humans and machines collaborate rather than compete.

However, Deep Thinking does have its weaknesses. The book’s technical nature and reliance on chess-specific terminology may alienate readers unfamiliar with the game or the intricacies of AI. Kasparov makes an effort to explain these concepts, but his heavy use of jargon can make it difficult for casual readers to fully engage with the material. Additionally, while his critique of AI is compelling, it sometimes feels one-sided, focusing mainly on AI’s limitations without fully exploring how it can complement human creativity.

Despite these drawbacks, Deep Thinking remains a fascinating and thought-provoking read for those passionate about chess, AI, and the future of human creativity. Kasparov’s firsthand insights into the psychological toll of competing against a machine and his reflections on the evolving role of AI in both chess and society make this book a significant contribution to the ongoing conversation about technology and humanity.

In conclusion, Deep Thinking is a compelling exploration of AI’s role in chess and human creativity. While it may be a challenging read for those new to the fields of chess or AI, it offers invaluable insights for those looking to explore the intersection of technology and human potential. If you’re a chess enthusiast, an AI aficionado, or simply curious about how machines and humans can co-evolve creatively, Kasparov’s book is a must-read.

Insights from My Ling 234 Summer Class at UW

This summer, I got my first taste of college life—or at least, the online version—through Ling 234 at UW. If you’re imagining grand lecture halls and bustling campus energy, this was not that. Instead, it was me, my laptop, and a series of online modules. But don’t let the format fool you—this class packed a lot of depth.

I took Ling 234 to get a deeper understanding of the linguistic concepts that underpin computational linguistics. As someone interested in the intersection of language and technology, I wanted to explore the “deep end” of linguistics: how societies perceive language, how languages vary and change, and how they influence identity. Understanding topics like language ideologies, multilingualism, and even the sociolinguistics of dialects helps ground the technical aspects of computational linguistics in real-world human language complexities.

One of the most valuable connections I found was how language variation and sociolinguistic factors affect language processing. For example, concepts like dialect variation, multilingualism, and even gendered language use are critical when developing systems that work across diverse language contexts. Computational linguistics relies on handling these nuances, whether in sentiment analysis, machine translation, or speech recognition. The insights I gained from this course are stepping stones to building more inclusive and accurate models in AI.

If you’re curious, I’ve attached my notes for the class. They’re comprehensive and detail everything from the mechanics of language contact to the challenges of language revitalization. While these notes may not be everyone’s cup of tea, they represent a foundational step in my journey toward understanding language through both linguistic and computational lenses.

Ling 234 may not have had the traditional “college experience” feel, but it exceeded my expectations in laying the groundwork for integrating linguistics into computational approaches. It wasn’t just a class—it was a valuable perspective shift.

Applications of Computational Linguistics

Hello everyone! Apologies for the gap in my posting schedule. Recently, I’ve been engrossed in schoolwork, but I’ve also delved into the potential applications of computational linguistics within my community.

While I’m currently honing my understanding of computational linguistics through platforms like Coursera, books, Kaggle, and through participation in my school’s AI club, I envision multiple applications as my proficiency grows.

Broad Ideas & Applications: As one progresses and develops a deeper grasp of computational linguistics, several impactful applications emerge.

  • Accessibility Tools: A future focus of mine will be on creating tools to aid populations such as the elderly. Voice assistants and text-to-speech or speech-to-text applications can immensely benefit those with hearing or visual challenges. Crafting such tools demands a deep grasp of computational linguistic techniques.
  • Healthcare Assistance: Collaborating with local hospitals, there’s potential to introduce AI-infused linguistic diagnostic or therapeutic tools. Beyond assisting medical professionals, such tools could offer crucial mental and emotional support in places lacking these resources. Bots like XiaoIce, for instance, serve as emotional anchors, providing solace to those in need.
  • Local Business Support: Veering slightly from pure computational linguistics but still harnessing AI is the concept of tools specifically designed for local businesses. This could manifest as systems that align local employers with suitable job seekers through an AI-facilitated matching process.

Beginner-Friendly Applications: For those just embarking on their computational linguistics journey, consider these simpler initiatives:

  • AI Literacy Programs: This would entail periodic community gatherings introducing AI and computational linguistics fundamentals. The program might also showcase an AI demonstration where participants can interact with chatbots or voice assistants, familiarizing themselves with their operations.
  • Homework Help Chatbots: Imagine a bot designed to answer basic student queries across subjects like English and math. While perfection isn’t the aim, it would be invaluable in steering students towards correct solutions.
  • Reading Assistance: Envision a bot equipped with text-to-speech capabilities, helping children, the elderly, or those with learning disabilities in their reading endeavors. Users could upload texts, which would then be read aloud. Advanced expertise in computational linguistics could morph this basic tool into a sophisticated aid.

It’s crucial to remember starting modestly allows for a progressive understanding of your community’s needs, enabling your initiatives to evolve in tandem. Consistent feedback is essential, ensuring services resonate with community priorities. In my locale, foundational support in education, commerce, and utilities garners much appreciation. However, for others, areas like language preservation could be of paramount importance.

Delving into AI Hallucinations: A Fascinating Article I Encountered at School

Hey everyone,

During my academic pursuits, I encountered an insightful article titled “Chatbots Sometimes Make Things Up” by Matt O’Brien. I found it to be of great significance and felt compelled to share its key takeaways with you.

The core of O’Brien’s article centers on the intriguing phenomenon of AI hallucinations. He delves deep into the challenges they present, citing various sources that shed light on their implications, especially for businesses that lean heavily on AI. Through expert opinions, the potential current and future challenges are brought to the forefront. Interestingly, the article doesn’t just highlight the pitfalls – it also explores the potential silver linings of AI hallucinations. However, the overarching message seems to be one of caution: while there’s hope for improvement, blind trust in AI-generated information might be premature.

Having digested O’Brien’s thoughts, I’ve formulated some of my own. To me, the pitfalls of hallucinations far outweigh their possible benefits. I was particularly struck by the mention of India’s hotel management institute which relies on AI for innovative ideas, making AI errors potentially costly. As AI continues to evolve and become an integral part of more sectors, the ramifications of such hallucinations could multiply. The article does touch upon the possible benefits of hallucinations in fields like marketing, but I’m skeptical. If unique perspectives generated by hallucinations are indeed valuable, I’d argue for a dedicated AI system for those niches rather than risking widespread misinformation. With the ever-growing role of AI, addressing these hallucination issues sooner rather than later seems paramount.

I encourage everyone to delve into this subject further, as the evolution and influence of AI in our daily lives is only set to increase. Your thoughts and opinions on this matter would be greatly appreciated.

Blog at WordPress.com.

Up ↑