Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Importance Score: 55 / 100 🔵

Meta’s Llama 4 Model Underperforms in Benchmark After Controversy

Earlier this week, technology giant Meta encountered criticism for employing an experimental, pre-release iteration of its Llama 4 Maverick large language model to achieve a high score on the LM Arena, a crowdsourced artificial intelligence benchmark. This action prompted the maintainers of the LM Arena platform to issue an apology, revise their evaluation policies, and subsequently assess the unmodified, standard version of Maverick.

The results indicate a less competitive performance than initially suggested.

Unmodified Llama 4 Ranks Lower Than Leading AI Models

The standard Maverick model, identified as “Llama-4-Maverick-17B-128E-Instruct,” was positioned below several established models as of Friday, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro. Notably, many of these competing AI models have been available for months.

Experimental Version Optimized for Conversational Abilities

The apparent underperformance of the standard model can be attributed to the developmental focus of Meta’s experimental Maverick, “Llama-4-Maverick-03-26-Experimental.” Meta stated in a published chart last Saturday that this version was “optimized for conversationality.” These specific optimizations appeared to resonate favorably with LM Arena’s evaluation method, which involves human evaluators comparing model outputs and selecting their preference.

Concerns Regarding Benchmark Reliability

As previously reported, LM Arena has faced questions regarding its reliability as a definitive gauge of AI model performance. Furthermore, the practice of tailoring a model specifically to excel in a particular benchmark, while potentially misleading, complicates the ability of developers to accurately predict the model’s effectiveness across diverse real-world applications and contexts.

Meta’s Response to Model Variations and Customization

In a statement provided to TechCrunch, a Meta spokesperson confirmed the company routinely experiments with “all types of custom variants” of their AI models.

“‘Llama-4-Maverick-03-26-Experimental’ represents a chat-optimized version that we tested, which also demonstrated strong performance on LM Arena,” the spokesperson explained. “We have now released our open-source version and anticipate observing how developers adapt and customize Llama 4 for their individual use cases. We are eager to see the innovations they develop and value their ongoing feedback.”


🕐 Top News in the Last Hour By Importance Score

# Title 📊 i-Score
1 Harvey Weinstein can stay in hospital during #MeToo retrial, judge rules 🔴 75 / 100
2 The incredible new £1.18bn airport and bridge set to open in major city 🔴 65 / 100
3 The rise and wobble of India's EV pioneer Ola 🔴 65 / 100
4 Judge denies Sean 'Diddy' Combs' request for two-month trial delay 🔴 62 / 100
5 Catholic, 64, nailed to cross in Good Friday 'crucifixion' as vow to God after 'miracle' 🔵 55 / 100
6 Oxford United v Leeds: Championship – live 🔵 45 / 100
7 Keep lettuce fresh and green for longer with clever storage method and avoid brown leaves 🔵 45 / 100
8 Harlan Coben's top 11 most popular books ranked – Fool Me Once is No. 2 🔵 45 / 100
9 Peter Andre baffles fans with Jamaican gangster movie trailer as 'racist' row erupts 🔵 45 / 100
10 Dylan Efron Saves Two Women From Drowning in Miami 🔵 40 / 100

View More Top News ➡️