We do not know what the Benchmarks of Ia measure. So we have talked to the Spanish who created one of the most difficult

Gemini 2.5 Pro is the best model in history. The smartest. At least, right now. I don’t say it, he says The Chatbot Arena classificationa platform in which they run various tests or benchmarks to try to measure the global capacity of modern AI models. According to these evidence, at this time Gemini 2.5 pro experimental, launched On March 25, it has a score of 1,440 points, well above GPT-4O (1,406), Grok 3 (1,404), GPT-4.5 (1,398) and of course an Depseek R1 that despite its fame is in seventh place with a score of 1,359 points. In current Ranking of Chatbot Arena, it places Gemini Pro 2.5 experimental as the most capable model of AI at the moment. That (probably) does not last long. Google herself presumed the capacity of Gemini 2.5 Pro experimental in the official announcement. As usually happens in these ads, companies show a table in which they compare their performance with that of other comparable models in different tests. In almost all of them Google crushed their rivals in well -known tests in this segment. Is for example the Humanity’s last exam (general knowledge and reasoning), GPQA Diamond (science), Aime 2025 (math), Livecodebench V5 and Swe-Bench Verified (programming) or Mmmu (visual reasoning). All these benchmarks try to measure the ability of these models in more or less specific fields, and all help to demonstrate that models, indeed, are improving. And yet none of them answer the fundamental question: Is the AI so intelligent Like the human being? There is the really complicated, because the definition of intelligence is not entirely clear either. There are different types of intelligence, in fact, and measuring them in humans is not simple or even possible either. And comparing the ability of an AI with the ability of human intelligence is usually not easy. Some experts wonder if IA laboratories will not be cheating with the benchmarks There are in fact who argues that the progress of AI models is misleading. It recently Dean Valentine, from the Startup Zeroopath. He and his team created an AI system that analyzes large code projects in search of security problems. With Claude 3.5 Sonnet They noticed a great leap, but from there the subsequent versions have seemed much less striking. In fact, this expert pointed out that today many of the companies that launch these models focus too much on going well on the photo of the existing and most popular benchmarks and “sound intelligent” in conversations with human beings. Wonders if the laboratories of AIs are cheating and lying: For him the evolution shown by Benchmarks does not correspond to the real benefits when using them. Frontiermath and the challenge of solving problems that (almost) nobody has solved But there are attempts to answer that question. One of them comes from the team that develops THE ARC-AGI 2 PROJECTa set of evidence derived from the Moravec paradox: They are relatively easy for human being, but very difficult for AI models. Jaime Sevilla, CEO of Epoch Ai. These tests measure the ability to generalize and abstract reasoning with visual puzzles, and are undoubtedly an interesting part of that effort to value how far we have arrived at every moment with the AI ​​models. Another of the most striking tests of recent times is Frontiermath. This benchmark created by the company COPHAI It consists of about 300 mathematical problems of different level. They have been designed by a team of more than 60 mathematicians among which Terence Tao, winner of the Fields Medal. Although there are some more affordable problems, 25% of them are qualified as especially complex. In fact, only the best experts could solve them, and It would take even days In doing so. This set of tests is also special for another aspect: these are unpublished problems and therefore have not been part of the training sets of any AI model. To solve them the machines need to be able to show a special “mathematical intelligence.” One that It helps precisely to something increasingly difficult: Assess the evolution of these models. In Xataka we have been able to talk to Jaime Sevilla (@Jsevillamol), which is precisely the CEO of COPHAI and has a very clear and personal vision on how the tests should be to measure the ability of an AI model. To begin with, he points out, “you need to have a way of measuring how the AI ​​is advancing. Interacting with it can give you perspective, but you do not have a rigorous impression of where it will arrive and in what domains it is most expert.” That, he explains, makes it necessary to have standardized test batteries that allow us to form an idea of ​​their skills. For this expert the Benchmark Arc-AGI is more representative of that other vision, making an easy benchmark for humans but difficult for AI. The models are improving in Arc-Agi, but for him that was obvious and that had to happen. With yours the tests are difficult for each other, and that the models advance and are increasingly better when solving these problems is not so obvious. Thus, with FrontierMath they wanted to “try to measure if AI can solve genuinely difficult problems.” Until now the mathematical problems that were subjected to the AI ​​models were relatively easy, so the models “saturated the benchmarks”, that is, they soon managed to overcome all these tests and achieve a 100% score. “It will be a challenge to saturate this benchmark“He stressed. Here I set an example with OPENAI’s O3-mini model, which already solves 10% of FrontierMath. It is not much, but it is brutal, he says, and has already surpassed expert mathematicians like himself. However, he says, “That the AI ​​overcomes certain benchmarks does not mean that it can operate as a human expert. You have to adjust them because they are adjusted to very specific scenarios. We are measuring those limits of that AI, and that will be a continuous process.” For Seville … Read more

The new Meta Model took a very good score at the benchmarks. Maybe too good

We had been waiting for the new family calling 4 artificial intelligence models for a long time. Last weekend the company finally revealed those models and Everything seemed promising. The problem is that the way of announcing them is generating some controversy and an uncomfortable conversation: that perhaps they have cheated in the benchmarks. Call 4 seems great. As soon as they appear on the scene, the new models call 4 goal surprised by their excellent performance in Benchmarks. They were second in the ranking LMARENAonly below Gemini 2.5 pro experimental. However, suspicions soon appeared, because the flame 4 version that is available to all audiences was not the same as it was shown in that ranking. Trucada version? As indicated in the advertisement As a finish line, that flame 4 version was an “experimental” that obtained a 1,417 points in LMarenawhile Gemini 2.5 Pro experimental had obtained 1,439 points. Some experts pointed out that this experimental flame version 4 was a version that cheated and had been specifically trained with data sets used in Benchmarks to be able to score well in them. We have not cheated.Ahmad al-Dahle is the head of the generative division in the finish line, and therefore is in charge of the flame launch 4. This manager has denied sharply The rumors that point to what goal would have cheated to get better scores in the benchmarks. These rumors “are false and we would never do that,” he said. But it was “optimized”. As indicated In TechCrunchin that official announcement Meta did pointed to the experimental flame 4 model that had scored very well was “optimized for conversation.” In Lmarena They indicated What a goal should have explained better what type of model had sent to include in the ranking. The same calls 4 is not so good. Some experts who They analyzed flame performance 4 with synthetic or conventional tests They already warned that performance It didn’t seem so good As they claim in goal. The publicly available model showed a behavior that He did not adjust to the quality that pointed its score in LMarena. Not quite consistent. Al-Dahle himself confirmed that some users were seeing “different quality” results of Maverick and Scout, the two flame versions 4 available, depending on the supplier. “We hope that some days are late when public implementations are adjusted,” and added that they would continue working to correct possible errors. A rare release. What a goal this model will launch a Saturday is strange, but when asked about it Mark Zuckerberg He replied that “is when it was ready.” That also the model used in LMarena is not the same as people can use is also worrying, and it may begin to distrust us from benchmarks and companies that use them to promote their products. It is not the first time that This happens Not much less, and it will not be the last one. In Xataka | Openai is burning money as if there were no tomorrow. The question is how much can endure like this

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.