Who would have thought that even "Pokémon" games are also included in AI benchmarking? Chatbot makers use "Pokémon" for a test to see the AI's progress in the game.
A recent viral post on X claimed Google's Gemini AI outperformed Anthropic's Claude model while playing the original Pokémon game trilogy. Gemini had reportedly advanced to Lavender Town in a Twitch stream, while Claude was still battling through Mount Moon as of February. But there's more to the story.
Google Gemini's Custom Boost Raises Eyebrows
While the viral claim stirred excitement, it conveniently left out a crucial detail: Gemini had a leg up. According to Redditusers, the developer managing Gemini's stream implemented a custom mini map. This clever addition allowed the chatbot to identify important gameplay elements like cuttable trees without relying solely on screenshot analysis.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
— Jush (@Jush21e8) April 10, 2025
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
Unfortunately, Anthropic's Claude fell off because it does not have similar accelerations. Lacking such aids, Claude's gameplay evolution was entirely a matter of decoding raw images — a far more challenging task.
TechCrunch reports that this divergence points to an increasing problem in AI benchmarking: unreliable testing environments that warp performance metrics.
Why 'Pokémon' Is Being Used in AI Benchmarks
While "Pokémon" is not a serious benchmark for AI testing, it is a fun — albeit flawed example of demonstrating AI performance and choice-making. Yet, it also indicates how very susceptible benchmarking outcomes are to implementation modification.
For example, Anthropic's Claude 3.7 Sonnet model registered two varying scores on the SWE-bench Verified benchmark, which evaluates coding capability. Without the aid of improvements, it scored 62.3%. But with a bespoke "scaffold" system built by Anthropic, it soared to 70.3%.
"I agree, and the amount of progress being made here goes to show that memory matters. I know most humans wouldn't be able to memorize every pixel of every town/city/route/cave that they're in while playing the game, but humans can usually generally remember the overall layout of the current area they're in once they've explored it. So adding this feature feels like a crucial part of allowing the LLM to have some kind of functional short-term memory," the OP of the Reddit post wrote.
"Yeah, a mapping faculty is 100% a necessary function to get around in the world. Always thought it was the biggest issue DeepMind had with making progress on its biggest boojum: Montezuma's Revenge," another Reddit user agreed.
The Larger Issue: Murky AI Comparisons
Benchmarks should offer a distinct, level playing field on which to judge amid AI evolution. But as developers include proprietary pieces or tailor their models for a particular test, making true, apples-to-apples comparisons grows more challenging.
Clearly, these doctored benchmarks obscure the distinction between true model performance and sophisticated optimization. It's foreseen that more businesses will be forced to develop more open, standardized benchmarking methods — or risk misleading consumers, investors, and researchers alike.