Did xAI posted misleading benchmark results for Grok 3 models?

Babushkin argues that ChatGPT manufacturer has published similarly misleading benchmark results in the past

By TCP News Desk Published February 23, 2025 at 08:23 PM GMT+5 Updated one year ago

Did xAI posted misleading benchmark results for Grok 3 VS OpenAI models.

Over a few days ago, an OpenAI employee accused xAI, Elon Musk’s AI company, of publishing false benchmark results of its recently launched AI model, Grok 3. In response, one of the xAI co-founders, Igor Babushkin, insisted that the company was in the right.

In a post shared on xAI’s blog, the platform published a graph notifying Grok 3’s exceptional performance on AIME 2025. A few experts have raised concerns regarding AIME’s validity as an AI benchmark.

xAI’s graph displayed two flagship versions of the newly introduced Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, defeating OpenAI's o3-mini-high, on AIME 2025.

However, OpenAI employees instantly highlighted on X (formerly Twitter) that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64,” which gives a model 64 tries to answer every query in a benchmark and selects the answer generated often as the final answers.

Taking to X, Babushkin argued that ChatGPT manufacturer has published similarly misleading benchmark results in the past. He further referred to albeit charts comparing the performance of its model.

A more neutral part of the debate put together a more “precise” graph showing almost every mode’s performance at cons@64 and stated: “Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok while in reality, it’s DeepSeek propaganda. (I believe Grok looks good there, and open API's TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.)”

A credible AI researcher, Nathan Lambert highlighted in a post, that perhaps the most crucial metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. This shows how little most AI benchmarks communicate regarding models’ limitations and their strengths.