Meta got caught gaming AI benchmarks

2 months ago 6

Over the weekend, Meta dropped 2 caller Llama 4 models: a smaller exemplary named Scout, and Maverick, a mid-size exemplary that the institution claims tin bushed GPT-4o and Gemini 2.0 Flash “across a wide scope of wide reported benchmarks.”

Maverick rapidly secured the number-two spot connected LMArena, the AI benchmark tract wherever humans comparison outputs from antithetic systems and ballot connected the champion one. In Meta’s press release, the institution highlighted Maverick’s ELO people of 1417, which placed it supra OpenAI’s 4o and conscionable nether Gemini 2.5 Pro. (A higher ELO people means the exemplary wins much often successful the arena erstwhile going head-to-head with competitors.)

The accomplishment seemed to presumption Meta’s open-weight Llama 4 arsenic a superior challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging done Meta’s documentation discovered thing unusual.

In good print, Meta acknowledges that the mentation of Maverick tested connected LMArena isn’t the aforesaid arsenic what’s disposable to the public. According to Meta’s ain materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,” Te …

Read the afloat communicative astatine The Verge.

Read Entire Article