Meta Allegedly Trained its AI with Copyrighted Books, Despite Warnings

Meta's AI Llama has reportedly been trained using thousands of copyrighted books without the permission of its authors, a new filing in a summer-filed copyright infringement lawsuit reveals as per a Reuters report. The Facebook-owner was allegedly well aware of the legal consequences of using the copyrighted books for its AI training that may not be protected by U.S. copyright law.

The Reuters report states that the newly filed complaint, which was reportedly received on Monday, has chat logs from a researcher connected to Meta, Tim Dettmers, wherein he talks with Meta's legal department on the legality of using the book files as training data.

Referring to a dataset called 'The Pile,' Dettmers stated in the 2021-authored message, that 'The Pile' in its current form is not usable "for legal reasons." Meta's lawyers had told the researcher that if [AI] models are trained on such data, they cannot be utilized or published.

These chat logs are reportedly a potentially important piece of evidence that shows Meta knew that its usage of the books might not be covered by copyright laws in the United States.

Despite being allegedly aware, Meta has previously mentioned that it trained its internal AI model on 'The Pile,' as per a Mashable report, being potentially one of the many Big Tech firms that has developed its first large-scale AI model on content that has been distributed unlawfully.

'The Pile' and 'Books3'

'The Pile' is reportedly a collection of AI training content, that includes the 'Books3 database' containing roughly 196,000 books available in plain-text format for AI model training performances.

These books are reportedly data that was collected from over thousands of novels and nonfiction books published in the previous 20 years, the controversial dataset was created to enable any developer to construct generative-AI tools, first assembled by independent developer Shawn Presser and a group of collaborators.

Lawsuits Against Illegal AI Training

Reuters adds that this year has particularly been difficult for AI training as content producers have filed many lawsuits against tech companies, claiming that the firms stole the authors' copyrighted works' to develop generative AI models.

Back in July, a class-action complaint was filed accusing Google of engaging in similar behavior. Both Microsoft and Bloomberg have also been sued for similar actions on illegal AI training.

Should such cases succeed, it would reportedly impede the enthusiasm around generative AI by driving up the cost of developing the models' training data, forcing AI firms to pay writers, painters, and other content producers for the use of their creations.

Simultaneously, corporations may be forced to reveal the data that these AI models use to train their models due to new provisional regulations governing artificial intelligence in Europe, which might put them at more legal danger.