OpenAI Faces Ongoing Scrutiny Over AI Training Data Practice

OpenAI's almost unrestricted use of information from the internet as ChatGPT artificial intelligence training data continues to be a legal problem for the company as lawsuits questioning the problematic practice continue to increase.

To train ChatGPT, OpenAI reportedly leverages publicly available data, including online books and papers. Currently, their owners want payment for their labor.

The creation of AI models that are sweeping the tech industry requires training data. Prominent technology corporations such as Microsoft, Google, Anthropic, Meta, OpenAI, and Anthropic are all in a rush to locate fresh data sources. At one point, Meta even thought about purchasing one of the largest publishing houses in the world, Simon & Schuster.

Some of the issue stems from publishers' growing accusations against these businesses for purging protected information. They want payment for the labor they have done. In responses to the US Copyright Office, Meta and OpenAI contended that posting copyrighted content online qualifies it as "publicly available" and falls within fair use.

However, they will still need to present that defense in court because the business is being sued by multiple parties over the copyrighted content.

OpenAI vs. CIR

The nonprofit media group Center for Investigative Reporting (CIR), established earlier this year by merging with Mother Jones and Reveal, filed a complaint in federal court against Microsoft and OpenAI last week.

The lawsuit claims that intellectual property owned by CIR and other producers worldwide was used in the development of OpenAI.

The CIR's attorneys accused Microsoft and OpenAI of using Mother Jones' copyrighted material to train their GPT and Copilot AI models.

Previous OpenAI Lawsuits

Last April, OpenAI and Microsoft also faced legal action from several prominent newspapers, including the New York Daily News and the Chicago Tribune, owned by the Alden Capital Group.

According to the lawsuit, both IT corporations intentionally violated copyright. Several prominent newspapers citing copyright infringement have filed lawsuits against Microsoft and OpenAI, including the Chicago Tribune, Orlando Sentinel, New York Daily News, and San Jose Mercury News.

Alden Global Capital owns all of these newspapers, and it alleges that both firms used their content to train their AI models without giving them credit or getting permission.

The case includes data from conversations with ChatGPT and Copilot showing that, when asked, these artificial intelligence models produced long quotes from particular publications.

This implies that the aforementioned items were included in the training datasets without obtaining permission from the relevant media.

They also demonstrated Copilot's capabilities by showing how it can retrieve news stories instantly from the internet and duplicate them in their entirety without crediting the sources. Moreover, the companies claim that these chatbots regularly mistakenly link publications to false material or fabrications.