Meta's latest flagship AI model, Maverick, was making waves after landing the second place on LM Arena, a platform where human raters evaluate the responses of AI models to rank quality.
Controversy struck, though, after AI researchers found that the version of Maverick employed in the benchmark is not the publicly available one to developers.
How the Maverick AI Ranking Raises Eyebrows
Maverick's impressive performance on LM Arena had at first seemed to confirm Meta's assertions of pushing the frontiers of state-of-the-art conversational AI. However, further digging unearthed that the model tested wasn't the general release, according to TechCrunch.
Rather, Meta emphasized in its own official announcement that the version it rolled out onto LM Arena was an "experimental chat version" - a point not explicitly drawn attention to within the benchmark scores.
On Meta's own Llama website, a comparison table verifies that the LM Arena test was conducted with "Llama 4 Maverick optimized for conversationality." This variant is said to have special tuning aimed at improving dialogue, which could confer an unfair benefit over the less optimized or "vanilla" editions of other AI creators.
Traditionally, LM Arena, imperfect though it may be, has functioned as an approximation of neutral ground to put large language models against each other by human criteria. The great majority of participating AI firms have released unmodified versions of publicly released models or have been open when changes were undertaken.
In contrast, Meta's method has been criticized for being opaque. By not revealing the optimized model and instead providing a less fine-tuned public model, developers are left with a false performance expectation, making it confusing regarding what Maverick can actually accomplish in practical settings.
AI Researchers Call Out the Differences
Experts on X reported that the LM Arena version of Maverick acts significantly differently than its downloadable equivalent. Some cited its excessive emoji usage, while others noticed lengthy and overly polished answers, actions not found in the default release.
for some reason, the Llama 4 model in Arena uses a lot more Emojis
— Tech Dev Notes (@techdevnotes) April 6, 2025
on together . ai, it seems better: pic.twitter.com/f74ODX4zTt
This difference leads to an important question in AI benchmarking: Do companies have the right to fine-tune models specifically for benchmarks and keep those versions hidden from the public?
Meta and Chatbot Arena Remain Silent For the Moment
While backlash mounts, others are calling for transparency from both Meta and Chatbot Arena, the entity that runs LM Arena. As of writing, neither side has responded to the issue.
It's somewhat concerning in AI research: the imperative for standardized, open benchmarks that measure real-world performance, rather than cherry-picked outcomes. As AI comes to impact everything from customer support to content generation, truthful representation is more important than ever.