Microsoft's head of AI, Mustafa Suleyman, has ignited a heated debate by labeling all publicly available information used to train AI models as "freeware."
Suleyman attempted to draw a line between openly accessible web content and copyrighted material explicitly protected by publishers. However, he admitted the complexity surrounding content that publishers specifically guard against scraping.
The Ethical Debate: Should AI Use Online Content for Training?
During the extensive discussion on AI's current state and its future implications at the Aspen Ideas Festival, Suleyman focused on the necessity for responsible AI development and governance.
The conversation with CNBC tackled the contentious issue of whether AI should utilize content published online for training purposes.
Suleyman stressed the importance of distinguishing between open-source and closed-source AI models and advocated for international cooperation, especially with China, rather than an adversarial stance.
Related Article: Microsoft to Protect Its AI-Using Customers From Copyright Lawsuits
Intellectual Property Concerns
Despite Suleyman's remarks, content creators argue that their intellectual property is being exploited without compensation. Many believe that the unauthorized use of their work endangers their livelihoods and the integrity of generative AI.
Suleyman acknowledged the murky legal boundaries surrounding AI model training, a sentiment reflected in ongoing court cases. Shortly after the interview, the Center for Investigative Reporting filed a lawsuit against OpenAI and its major investor, Microsoft, for using the nonprofit's content without permission or compensation.
The organization's CEO, Monika Bauerlein, accused OpenAI and Microsoft of "vacuuming up our stories to make their product more powerful" without seeking permission or offering payment, unlike other organizations that license their material.
Scraping In a Grey Area
Microsoft is under increasing scrutiny over its data handling practices for AI. Despite offering protection for users of its GenAI tools against copyright cases, Suleyman's comments about the robots.txt file stirred further controversy.
He suggested that mentioning "do not scrape or crawl" on a website might place scraping in a "grey area," yet admitted that respecting this basic protocol is more of a courtesy and not something that needs judicial clarification. Nonetheless, various AI companies, including Anthropic, Perplexity, and OpenAI, often ignore robots.txt.
Microsoft's Current Lawsuit Involving Copyrighted Articles
This isn't the first instance of an AI executive making controversial statements. Despite over a year since ChatGPT's launch, the legal framework for training data and copyright remains unsettled. Microsoft and OpenAI face multiple lawsuits from publishers alleging unauthorized use of copyrighted articles to train their language models. These cases have yet to reach conclusions that might provide more legal clarity.
"We are working collaboratively with the news industry and partnering with global news publishers to display their content in our products like ChatGPT, including summaries, quotes, and attribution, to drive traffic back to the original articles. A component of the partnerships is the ability to leverage publisher content using various machine learning and training techniques to help us optimize the display of that content and make it more useful to users," An OpenAI spokesperson told TechRadar.