Microsoft MAI Training Data Includes Common Crawl, Contradicting Build 2026 Claims

Microsoft told enterprise buyers at its Build 2026 developer conference on June 2 that its new MAI reasoning models were trained exclusively on "enterprise grade, clean and commercially licensed data" — a selling point aimed squarely at legal and procurement teams in regulated industries. Its own published technical paper tells a different story.

The MAI-Thinking-1 preprint, released alongside the model, describes a training data pipeline that includes Common Crawl — the widely used open repository of web-scraped content that carries no licensing guarantees for the material it indexes. After filtering and deduplication, the Common Crawl portion of the training corpus contained 24.2 billion pages, according to the document. That finding was first surfaced by developer Simon Willison, who was present at Build when the models were announced, and was subsequently confirmed in an investigation by technology outlet The Decoder.

Microsoft's Marketing Claim Versus the Preprint

At the Build 2026 keynote, Microsoft AI CEO Mustafa Suleyman described MAI-Thinking-1 as trained "from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models." The company's official blog post used similar language: the company does not rely on "unlicensed or opaque data" and its datasets are "clean and appropriately licensed."

That framing was deliberate and consequential. Enterprise legal teams have been scrutinizing the training data lineage of popular AI models throughout early 2026, a trend accelerated by the distillation controversy surrounding DeepSeek R1. By positioning MAI-Thinking-1 as trained with verified data provenance, Microsoft was pitching directly to compliance-sensitive buyers in finance, healthcare, and government — industries where the question of whether a vendor's model was trained on unlicensed content can affect procurement decisions.

The preprint, however, is unambiguous. Starting at roughly page 80, it describes the training data pipeline in technical detail. The majority of the web HTML corpus came from a proprietary crawl of the public internet, which Microsoft filtered from 1.2 trillion pages down to 794 billion. The document then states explicitly that Common Crawl was processed through the same pipeline. After deduplication and merging, the Common Crawl portion of the corpus contained 24.2 billion pages.

Willison, who flagged the discrepancy on his blog, initially hoped the models might represent the first commercially useful large language models trained without web scraping. After reading the preprint, he updated his post with a direct correction: the models have the same licensing problems as all of the other major large language models.

What Common Crawl Actually Is

Common Crawl is a California-based nonprofit that has maintained a publicly available archive of web crawl data since 2008. The organization makes no licensing representations about the content it indexes, which is drawn from the open web without agreements from the individual publishers, authors, or rights holders whose work appears in it. Its data has become a near-universal ingredient in large language model pretraining, used by Google DeepMind for Gemini, Meta for its LLaMA model family, and virtually every other major model developer.

What makes the Microsoft situation distinct is not the use of Common Crawl itself — which is standard industry practice — but the explicit marketing claim that the company's models stood apart from that practice. Most AI companies say little or nothing about their training data provenance. Microsoft said a great deal, in specific and exclusionary language, to an audience of enterprise buyers whose procurement decisions can turn on exactly that question. The Decoder's investigation characterized the gap plainly: Microsoft does what every other AI company does, yet marketed its training data as especially clean.

What This Means for Enterprise Buyers

For organizations that evaluated MAI-Thinking-1 specifically because of its stated data provenance guarantees, the preprint's findings represent a direct gap between what was promised and what the documentation shows. The "clean and commercially licensed" representation was not a minor marketing claim — it was central to the product's positioning for regulated industries, and Microsoft's marketing team built an entire narrative around it under the phrase "Capabilities Learned, Not Inherited."

The disclosure also lands in a legally unsettled environment. Like other major AI companies that train on web-scraped content, Microsoft is expected to rely on a fair use defense for its data practices. That legal theory remains contested. The U.S. Copyright Office released a pre-publication report on generative AI training and fair use in May 2025, concluding that the question must be assessed case by case and that some uses of copyrighted works for AI training will qualify as fair use while others will not. No appellate court has yet issued a definitive ruling. The New York Times' copyright lawsuit against Microsoft and OpenAI, filed in December 2023, remains active.

Enterprise legal teams that made or are making procurement decisions based on Microsoft's stated data provenance claims should review the actual preprint rather than relying on the marketing summary. The document is publicly available through Microsoft's own website and describes the training data pipeline in technical detail.

Microsoft Has Not Responded

As of June 5, 2026, Microsoft had not issued a public statement addressing the contradiction between its Build 2026 marketing claims and the training data documentation in the MAI-Thinking-1 technical paper. TechTimes has reached out to Microsoft for comment and will update this article if a response is received.

The episode arrives as Microsoft faces active regulatory scrutiny on multiple fronts. The U.S. Federal Trade Commission has an ongoing antitrust investigation into the company's cloud and AI bundling practices, with civil investigative demands issued to at least six competing companies as of February 2026. The probe centers on whether Microsoft uses its dominance in productivity software to lock customers into Azure. Separately, the UK's Competition and Markets Authority launched a Strategic Market Status investigation into Microsoft's business software ecosystem on May 14, 2026, with a designation decision expected by February 2027. Neither investigation has resulted in enforcement action, but both are ongoing.

Frequently Asked Questions

What did Microsoft claim about MAI training data at Build 2026?

At its Build 2026 developer conference on June 2, Microsoft AI CEO Mustafa Suleyman stated that MAI-Thinking-1 was trained "from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models." The company's official blog post used nearly identical language, stating Microsoft does not rely on "unlicensed or opaque data." Those claims are directly contradicted by Microsoft's own published technical paper, which documents Common Crawl as a component of the training corpus.

What is Common Crawl, and is it commercially licensed?

Common Crawl is a California-based nonprofit that maintains a publicly available archive of content crawled from the open web. It makes no licensing representations about the material it indexes and does not pay rights holders for inclusion. Its data is used by nearly every major large language model developer, including those behind Gemini and LLaMA, but it carries no commercial license from the publishers and authors whose work appears in it.

Does using Common Crawl expose Microsoft or enterprise customers to legal risk?

Potentially. Microsoft, like other AI companies that train on web-scraped data, is expected to rely on a fair use defense. That legal theory remains unsettled in U.S. courts: the Copyright Office's May 2025 report declined to adopt a categorical rule, and no appellate court has yet issued a definitive ruling on whether AI training on web-scraped content qualifies as fair use. Enterprise buyers in regulated industries should assess their own IP indemnification requirements against the documented training data pipeline, rather than relying on vendor marketing language.

What should enterprise procurement teams do with this information?

Enterprise legal and compliance teams evaluating MAI models based on data provenance claims should request and review the MAI-Thinking-1 technical paper directly. The preprint is publicly available through Microsoft's website and describes the full training data pipeline starting at approximately page 80. Procurement decisions made on the basis of Build 2026 marketing claims may need to be revisited in light of what the technical documentation actually shows.

Tags:Microsoft Artificial Intelligence

Join the Discussion

Microsoft MAI Training Data Includes Common Crawl, Contradicting Build 2026 Claims

Microsoft’s own preprint shows 24.2 billion Common Crawl pages in the MAI training corpus.

Microsoft's Marketing Claim Versus the Preprint

What Common Crawl Actually Is

What This Means for Enterprise Buyers

Microsoft Has Not Responded

Frequently Asked Questions

New Solar Desalination Technology Turns Seawater Into Drinking Water Without Toxic Brine

Google Gemini 3.5 Pro Nears June Launch With 2 Million Token Context And Deep Think Reasoning

Claude Code Skills: Inside Anthropic's Playbook for the Nine Types That Actually Work

Amazon Considers Stargate Hard Reboot: Canon of 25 TV Years May Be Erased

Salesforce Will Not Hire More Software Engineers Next Year As Claude Code Compresses Migrations