New study accuses LM Arena of gaming its popular AI benchmark

The rapid proliferation of AI chatbots has made it difficult to know which models are actually improving and which are falling behind. Traditional academic benchmarks only tell you so much, which has led many to lean on vibes-based analysis from LM Arena. However, a new study claims this popular AI ranking platform is rife with unfair practices, favoring large companies that just so happen to rank near the top of the index. The site’s operators, however, say the study draws the wrong conclusions.

LM Arena was created in 2023 as a research project at UC Berkeley. The pitch is simple—users feed a prompt into two unidentified AI models in the “Chatbot Arena” and evaluate the outputs to vote on the one they like more. This data is aggregated in the LM Arena leaderboard that shows which models people like the most, which can help track improvements in AI models.

Companies are paying more attention to this ranking as the AI market heats up. Google noted when it released Gemini 2.5 Pro that the model debuted at the top of the LM Arena leaderboard, where it remains to this day. Meanwhile, DeepSeek’s strong performance in the Chatbot Arena earlier this year helped to catapult it to the upper echelons of the LLM race.

The researchers, hailing from Cohere Labs, Princeton, and MIT, believe AI developers may have placed too much stock in LM Arena. The new study, available on the preprint arXiv server, claims the arena rankings are distorted by practices that make it easier for proprietary chatbots to outperform open ones. The authors say LM Arena allows developers of proprietary large language models (LLMs) to test multiple versions of their AI on the platform. However, only the highest performing one is added to the public leaderboard.

Meta tested 27 versions of Llama-4 before releasing the version that appeared on the leaderboard.

Credit:

Shivalika Singh et al.

Some AI developers are taking extreme advantage of the private testing option. The study reports that Meta tested a whopping 27 private variants of Llama-4 before release. Google is also a beneficiary of LM Arena’s private testing system, having tested 10 variants of Gemini and Gemma between January and March 2025.

This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.

New study accuses LM Arena of gaming its popular AI benchmark

Comments

Leave a Reply Cancel reply