Meta got caught gaming AI benchmarks

7 months ago 15

Over the weekend, Meta dropped 2 caller Llama 4 models: a smaller exemplary named Scout, and Maverick, a mid-size exemplary that the institution claims tin bushed GPT-4o and Gemini 2.0 Flash âacross a wide scope of wide reported benchmarks.â

Maverick rapidly secured the number-two spot connected LMArena, the AI benchmark tract wherever humans comparison outputs from antithetic systems and ballot connected the champion one. In Metaâs press release, the institution highlighted Maverickâs ELO people of 1417, which placed it supra OpenAIâs 4o and conscionable nether Gemini 2.5 Pro. (A higher ELO people means the exemplary wins much often successful the arena erstwhile going head-to-head with competitors.)

The accomplishment seemed to presumption Metaâs open-weight Llama 4 arsenic a superior challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging done Meta’s documentation discovered thing unusual.

In good print, Meta acknowledges that the mentation of Maverick tested connected LMArena isn’t the aforesaid arsenic whatâs disposable to the public. According to Meta’s ain materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,â Te …

Read the afloat communicative astatine The Verge.

Read Entire Article