in

The new Meta Model took a very good score at the benchmarks. Maybe too good

We had been waiting for the new family calling 4 artificial intelligence models for a long time. Last weekend the company finally revealed those models and Everything seemed promising. The problem is that the way of announcing them is generating some controversy and an uncomfortable conversation: that perhaps they have cheated in the benchmarks.

Call 4 seems great. As soon as they appear on the scene, the new models call 4 goal surprised by their excellent performance in Benchmarks. They were second in the ranking LMARENAonly below Gemini 2.5 pro experimental. However, suspicions soon appeared, because the flame 4 version that is available to all audiences was not the same as it was shown in that ranking.

Trucada version? As indicated in the advertisement As a finish line, that flame 4 version was an “experimental” that obtained a 1,417 points in LMarenawhile Gemini 2.5 Pro experimental had obtained 1,439 points. Some experts pointed out that this experimental flame version 4 was a version that cheated and had been specifically trained with data sets used in Benchmarks to be able to score well in them.

We have not cheated.Ahmad al-Dahle is the head of the generative division in the finish line, and therefore is in charge of the flame launch 4. This manager has denied sharply The rumors that point to what goal would have cheated to get better scores in the benchmarks. These rumors “are false and we would never do that,” he said.

But it was “optimized”. As indicated In TechCrunchin that official announcement Meta did pointed to the experimental flame 4 model that had scored very well was “optimized for conversation.” In Lmarena They indicated What a goal should have explained better what type of model had sent to include in the ranking.

The same calls 4 is not so good. Some experts who They analyzed flame performance 4 with synthetic or conventional tests They already warned that performance It didn’t seem so good As they claim in goal. The publicly available model showed a behavior that He did not adjust to the quality that pointed its score in LMarena.

Not quite consistent. Al-Dahle himself confirmed that some users were seeing “different quality” results of Maverick and Scout, the two flame versions 4 available, depending on the supplier. “We hope that some days are late when public implementations are adjusted,” and added that they would continue working to correct possible errors.

A rare release. What a goal this model will launch a Saturday is strange, but when asked about it Mark Zuckerberg He replied that “is when it was ready.” That also the model used in LMarena is not the same as people can use is also worrying, and it may begin to distrust us from benchmarks and companies that use them to promote their products. It is not the first time that This happens Not much less, and it will not be the last one.

In Xataka | Openai is burning money as if there were no tomorrow. The question is how much can endure like this

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

news and how to download the latest stable version

A 14 -day Chinese invasion