Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Report Alleges OpenAI Trained Advanced AI Models Using Unlicensed Books

OpenAI is facing renewed scrutiny regarding its artificial intelligence (AI) training methods, with persistent allegations of utilizing copyrighted material without authorization. A recent paper by an AI watchdog group has leveled a significant accusation, asserting that the company increasingly depended on unlicensed, nonpublic books to refine its advanced AI models. This raises further questions about data sources and licensing practices in AI development.

AI Model Training and Data Reliance

AI models function as intricate prediction systems. Their capabilities stem from extensive training datasets encompassing books, films, television programs, and various other sources. This process enables them to identify patterns and create extrapolations from simple prompts. Thus, when an AI model generates an essay or an image, it’s drawing from its vast dataset to produce an approximation, rather than original creation.

While some AI developers, including OpenAI, have started exploring AI-generated data for training due to the depletion of readily available real-world data, particularly from the public web, the reliance on purely synthetic data remains limited. Exclusive use of synthetic data presents challenges and might compromise model performance.

Allegations Stem From AI Disclosures Project Paper

A new paper from the AI Disclosures Project posits that OpenAI likely trained its GPT-4o model using paywalled books from O’Reilly Media, without a licensing agreement. The AI Disclosures Project is a nonprofit organization co-founded in 2024 by media executive Tim O’Reilly and economist Ilan Strauss. GPT-4o is currently the default model in ChatGPT.

The paper’s authors assert that GPT-4o exhibits “strong recognition” of content from paywalled O’Reilly books compared to OpenAI’s earlier GPT-3.5 Turbo model. Conversely, GPT-3.5 Turbo showed better “recognition” of publicly accessible O’Reilly book samples.

Researchers employed the DE-COP method, also known as a “membership inference attack,” to analyze the training data. This technique assesses if a model can differentiate between human-written texts and AI-paraphrased versions. The ability to distinguish suggests potential prior exposure to the original text during training.

O’Reilly, Strauss, and AI researcher Sruly Rosenblat examined GPT-4o, GPT-3.5 Turbo, and other OpenAI models, using nearly 14,000 paragraph excerpts from 34 O’Reilly books. They aimed to estimate the likelihood of these excerpts being part of each model’s training dataset.

The findings reportedly indicate that GPT-4o “recognized” considerably more paywalled O’Reilly book content than older OpenAI models even after considering factors such as newer models’ enhanced ability to detect human authorship.

Study Limitations Acknowledged

The paper’s co-authors acknowledge that their methodology isn’t definitive, and the results are not conclusive proof. They concede that OpenAI might have acquired the paywalled book excerpts from users inputting them into ChatGPT. Furthermore, the study did not evaluate OpenAI’s most recent models, including GPT-4.5.

Limitations and Industry Context

However, OpenAI has publicly advocated for less restrictive regulations regarding the use of copyrighted data for AI model development and has actively sought higher-quality training data. The company has even recruited journalists to refine model outputs and, like many in the AI industry, is engaging experts across various domains to incorporate specialized knowledge into AI systems.

It should be noted that OpenAI does engage in licensing agreements for training data from news publishers, social networks, and stock media libraries. They also offer opt-out options for copyright holders, although these are not without flaws. Despite these measures, OpenAI is currently facing multiple lawsuits concerning its training data practices and application of copyright law. The recent paper from the AI Disclosures Project adds another layer to these ongoing discussions.

OpenAI Response Pending

OpenAI has not yet issued a public statement in response to the allegations.

🕐 Top News in the Last Hour By Importance Score

#	Title	📊 i-Score
1	At least 73 dead in Gaza as advancing Israel plans new corridor	🟢 85 / 100
2	Goods imported from China are now facing a 54% tariffs rate	🟢 85 / 100
3	Health department layoffs mean that data on drug use and mental health could sit unused	🔴 78 / 100
4	Using Affirm’s BNPL Plan Could Now Affect Your Credit Score	🔴 75 / 100
5	Pat McAfee’s Drama With Ole Miss College Student: What to Know	🔴 73 / 100
6	Samoa suffering energy crisis after weeks of power outages	🔴 72 / 100
7	Coal Plant Ranked as Nation’s Dirtiest Asks for Pollution Exemption	🔴 72 / 100
8	Booze for NYC outdoor dining gets green light after fears restaurants would be dry for summer	🔴 70 / 100
9	iOS 18.4 Adds My New Favorite Apple Intelligence Feature to the Ones I Use Daily	🔴 70 / 100
10	Composer with sight loss turns famous paintings into music to help people connect with art	🔴 65 / 100

View More Top News ➡️