We do not know what the Benchmarks of Ia measure. So we have talked to the Spanish who created one of the most difficult
Gemini 2.5 Pro is the best model in history. The smartest. At least, right now. I don’t say it, he says The Chatbot Arena classificationa platform in which they run various tests or benchmarks to try to measure the global capacity of modern AI models. According to these evidence, at this time Gemini 2.5 pro experimental, launched On March 25, it has a score of 1,440 points, well above GPT-4O (1,406), Grok 3 (1,404), GPT-4.5 (1,398) and of course an Depseek R1 that despite its fame is in seventh place with a score of 1,359 points. In current Ranking of Chatbot Arena, it places Gemini Pro 2.5 experimental as the most capable model of AI at the moment. That (probably) does not last long. Google herself presumed the capacity of Gemini 2.5 Pro experimental in the official announcement. As usually happens in these ads, companies show a table in which they compare their performance with that of other comparable models in different tests. In almost all of them Google crushed their rivals in well -known tests in this segment. Is for example the Humanity’s last exam (general knowledge and reasoning), GPQA Diamond (science), Aime 2025 (math), Livecodebench V5 and Swe-Bench Verified (programming) or Mmmu (visual reasoning). All these benchmarks try to measure the ability of these models in more or less specific fields, and all help to demonstrate that models, indeed, are improving. And yet none of them answer the fundamental question: Is the AI so intelligent Like the human being? There is the really complicated, because the definition of intelligence is not entirely clear either. There are different types of intelligence, in fact, and measuring them in humans is not simple or even possible either. And comparing the ability of an AI with the ability of human intelligence is usually not easy. Some experts wonder if IA laboratories will not be cheating with the benchmarks There are in fact who argues that the progress of AI models is misleading. It recently Dean Valentine, from the Startup Zeroopath. He and his team created an AI system that analyzes large code projects in search of security problems. With Claude 3.5 Sonnet They noticed a great leap, but from there the subsequent versions have seemed much less striking. In fact, this expert pointed out that today many of the companies that launch these models focus too much on going well on the photo of the existing and most popular benchmarks and “sound intelligent” in conversations with human beings. Wonders if the laboratories of AIs are cheating and lying: For him the evolution shown by Benchmarks does not correspond to the real benefits when using them. Frontiermath and the challenge of solving problems that (almost) nobody has solved But there are attempts to answer that question. One of them comes from the team that develops THE ARC-AGI 2 PROJECTa set of evidence derived from the Moravec paradox: They are relatively easy for human being, but very difficult for AI models. Jaime Sevilla, CEO of Epoch Ai. These tests measure the ability to generalize and abstract reasoning with visual puzzles, and are undoubtedly an interesting part of that effort to value how far we have arrived at every moment with the AI models. Another of the most striking tests of recent times is Frontiermath. This benchmark created by the company COPHAI It consists of about 300 mathematical problems of different level. They have been designed by a team of more than 60 mathematicians among which Terence Tao, winner of the Fields Medal. Although there are some more affordable problems, 25% of them are qualified as especially complex. In fact, only the best experts could solve them, and It would take even days In doing so. This set of tests is also special for another aspect: these are unpublished problems and therefore have not been part of the training sets of any AI model. To solve them the machines need to be able to show a special “mathematical intelligence.” One that It helps precisely to something increasingly difficult: Assess the evolution of these models. In Xataka we have been able to talk to Jaime Sevilla (@Jsevillamol), which is precisely the CEO of COPHAI and has a very clear and personal vision on how the tests should be to measure the ability of an AI model. To begin with, he points out, “you need to have a way of measuring how the AI is advancing. Interacting with it can give you perspective, but you do not have a rigorous impression of where it will arrive and in what domains it is most expert.” That, he explains, makes it necessary to have standardized test batteries that allow us to form an idea of their skills. For this expert the Benchmark Arc-AGI is more representative of that other vision, making an easy benchmark for humans but difficult for AI. The models are improving in Arc-Agi, but for him that was obvious and that had to happen. With yours the tests are difficult for each other, and that the models advance and are increasingly better when solving these problems is not so obvious. Thus, with FrontierMath they wanted to “try to measure if AI can solve genuinely difficult problems.” Until now the mathematical problems that were subjected to the AI models were relatively easy, so the models “saturated the benchmarks”, that is, they soon managed to overcome all these tests and achieve a 100% score. “It will be a challenge to saturate this benchmark“He stressed. Here I set an example with OPENAI’s O3-mini model, which already solves 10% of FrontierMath. It is not much, but it is brutal, he says, and has already surpassed expert mathematicians like himself. However, he says, “That the AI overcomes certain benchmarks does not mean that it can operate as a human expert. You have to adjust them because they are adjusted to very specific scenarios. We are measuring those limits of that AI, and that will be a continuous process.” For Seville … Read more