OpenAI Faces Scrutiny Over Use of Paywalled Books in GPT-4o Training

April 23, 2025
OpenAI Faces Scrutiny Over Use of Paywalled Books in GPT-4o Training
82
Views

OpenAI is under fire following allegations that its newly released GPT-4o model may have been trained using copyrighted and paywalled content from O’Reilly Media—without authorization. The claims stem from a recent study published by the AI Disclosures Project, a watchdog initiative co-founded by tech author Tim O’Reilly and economist Ilan Strauss. The research suggests that GPT-4o demonstrates a high level of familiarity with O’Reilly’s proprietary content, sparking concerns over intellectual property violations.

Researchers applied a method called DE-COP (Detection of Content Overlap via Perturbations), a membership inference technique that analyzes whether specific data was part of a model’s training set. By examining nearly 14,000 paragraphs from 34 O’Reilly books, the study found that GPT-4o responded with an 82% AUROC (Area Under the Receiver Operating Characteristic Curve) score—indicating strong evidence that this paywalled content was likely included during training.

This revelation is particularly troubling given that O’Reilly Media confirmed it has no licensing agreement with OpenAI. In comparison, GPT-3.5 Turbo—the predecessor to GPT-4o—showed significantly less overlap with the same content, suggesting OpenAI may have increased its use of non-public data in its newer models.

The findings arrive at a time of growing industry scrutiny around how large language models are trained. Copyright and data transparency have become flashpoints in the broader debate on ethical AI development. As companies scale AI capabilities, questions remain about consent, fair use, and the rights of content creators.

OpenAI has not yet publicly addressed the allegations. The AI Disclosures Project has called for greater transparency and regulation, urging AI developers to disclose training data sources and secure appropriate licenses.

The controversy underscores the ongoing tension between rapid AI innovation and the foundational need to respect intellectual property laws.

Article Categories:
Others

Leave a Reply

Your email address will not be published. Required fields are marked *

The maximum upload file size: 256 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here