usatoday24

Gemini 2.5 Pro is the best model in history. The smartest. At least, right now. I don’t say it, he says The Chatbot Arena classificationa platform in which they run various tests or benchmarks to try to measure the global capacity of modern AI models.

According to these evidence, at this time Gemini 2.5 pro experimental, launched On March 25, it has a score of 1,440 points, well above GPT-4O (1,406), Grok 3 (1,404), GPT-4.5 (1,398) and of course an Depseek R1 that despite its fame is in seventh place with a score of 1,359 points.

In current Ranking of Chatbot Arena, it places Gemini Pro 2.5 experimental as the most capable model of AI at the moment. That (probably) does not last long.

Google herself presumed the capacity of Gemini 2.5 Pro experimental in the official announcement. As usually happens in these ads, companies show a table in which they compare their performance with that of other comparable models in different tests.

In almost all of them Google crushed their rivals in well -known tests in this segment. Is for example the Humanity’s last exam (general knowledge and reasoning), GPQA Diamond (science), Aime 2025 (math), Livecodebench V5 and Swe-Bench Verified (programming) or Mmmu (visual reasoning).

Before talking about artificial intelligence ... What is intelligence?

All these benchmarks try to measure the ability of these models in more or less specific fields, and all help to demonstrate that models, indeed, are improving. And yet none of them answer the fundamental question:

Is the AI so intelligent Like the human being?

There is the really complicated, because the definition of intelligence is not entirely clear either. There are different types of intelligence, in fact, and measuring them in humans is not simple or even possible either. And comparing the ability of an AI with the ability of human intelligence is usually not easy.

Some experts wonder if IA laboratories will not be cheating with the benchmarks

There are in fact who argues that the progress of AI models is misleading. It recently Dean Valentine, from the Startup Zeroopath. He and his team created an AI system that analyzes large code projects in search of security problems. With Claude 3.5 Sonnet They noticed a great leap, but from there the subsequent versions have seemed much less striking.

In fact, this expert pointed out that today many of the companies that launch these models focus too much on going well on the photo of the existing and most popular benchmarks and “sound intelligent” in conversations with human beings. Wonders if the laboratories of AIs are cheating and lying: For him the evolution shown by Benchmarks does not correspond to the real benefits when using them.

Frontiermath and the challenge of solving problems that (almost) nobody has solved

But there are attempts to answer that question. One of them comes from the team that develops THE ARC-AGI 2 PROJECTa set of evidence derived from the Moravec paradox: They are relatively easy for human being, but very difficult for AI models.

Jaime Sevilla, CEO of Epoch Ai.

These tests measure the ability to generalize and abstract reasoning with visual puzzles, and are undoubtedly an interesting part of that effort to value how far we have arrived at every moment with the AI models.

Another of the most striking tests of recent times is Frontiermath. This benchmark created by the company COPHAI It consists of about 300 mathematical problems of different level.

They have been designed by a team of more than 60 mathematicians among which Terence Tao, winner of the Fields Medal. Although there are some more affordable problems, 25% of them are qualified as especially complex. In fact, only the best experts could solve them, and It would take even days In doing so.

We have an AI problem: there is no reliable way to know if Chatgpt is better than Gemini, Copilot or Claude

This set of tests is also special for another aspect: these are unpublished problems and therefore have not been part of the training sets of any AI model. To solve them the machines need to be able to show a special “mathematical intelligence.” One that It helps precisely to something increasingly difficult: Assess the evolution of these models.

In Xataka we have been able to talk to Jaime Sevilla (@Jsevillamol), which is precisely the CEO of COPHAI and has a very clear and personal vision on how the tests should be to measure the ability of an AI model.

To begin with, he points out, “you need to have a way of measuring how the AI is advancing. Interacting with it can give you perspective, but you do not have a rigorous impression of where it will arrive and in what domains it is most expert.”

That, he explains, makes it necessary to have standardized test batteries that allow us to form an idea of their skills. For this expert the Benchmark Arc-AGI is more representative of that other vision, making an easy benchmark for humans but difficult for AI.

The models are improving in Arc-Agi, but for him that was obvious and that had to happen. With yours the tests are difficult for each other, and that the models advance and are increasingly better when solving these problems is not so obvious.

Thus, with FrontierMath they wanted to “try to measure if AI can solve genuinely difficult problems.” Until now the mathematical problems that were subjected to the AI models were relatively easy, so the models “saturated the benchmarks”, that is, they soon managed to overcome all these tests and achieve a 100% score. “It will be a challenge to saturate this benchmark“He stressed.

Here I set an example with OPENAI’s O3-mini model, which already solves 10% of FrontierMath. It is not much, but it is brutal, he says, and has already surpassed expert mathematicians like himself. However, he says,

“That the AI overcomes certain benchmarks does not mean that it can operate as a human expert. You have to adjust them because they are adjusted to very specific scenarios. We are measuring those limits of that AI, and that will be a continuous process.”

For Seville there is an especially important area in which to measure that performance: in the AFFA. In their ability to do work remotely and autonomously. Here the clearest examples of systems that achieve this are Computer Use, from Anthropic, and Operatorfrom OpenAi.

"Telephone, go your life": Anthropic Agent wants to change our real lives

Here there is an especially remarkable benchmark that is OSWORLD. Try to measure whether these agents can solve tasks, although “for now it is very basic,” says Seville. That doesn’t matter, because as he points out, it is the usual evolution of these developments.

“The benchmark cycle at the beginning does not solve anything,” explains Seville. “Then there is a point where something begins to do, and there you enter the linear part of the sigmoid, there you see relatively predictable improvements, as the models climb you can improve until the benchmark is saturated.”

We also asked him about the debate about climbing and if right now dedicate more money, more GPU and more data to train AI models. In recent times there is talk of how AI models no longer advance apparentlybut for him the scaling strategy still makes a lot of sense.

“We do not have enough evidence to show that scaling trends are dead. If you train more computing you will get better results.”

“We have always assimilated that we need to devote many resources for improvements,” he said. He and his team in Epoch AI have observed how the historical relationship between dedicated resources and improvement obtained It was “we expected”although it does indicate that this improvement “perhaps has been a bit disappointing in models without reasoning”, where the advance has not been so clear.

However, he emphasizes, “Alphago I already used more inference time, it was seen that the razoition works. “In his opinion” we do not have enough evidence that shows that climbing trends are dead. If you train with more computation you will get better results, “he concludes.

“AI does not think like us”

If there is a clear thing for this expert is that “it is evident that AI does not think like us. It gives us a thousand laps in knowledge of medicine or biology, for example, and is achieving notable advances in areas such as mathematics or programming. “However, he explains,” it is not so good in playing PokémonFor example”.

The performance of AI in advanced mathematical problems remains low: O3-mini, the one that does best, only solves 11% of those problems. Source: Epoch AI.

For Seville “what I see is that it is advancing in other things. The comparison with human intelligence is not exact because the fields in which AI will improve are fields in which the human being has not evolved. I think that AI will improve much faster in mathematics or ingeneria that in robotics or motor control, for example.”

Seville cited A recent Metr study in which it was intended to measure the ability of AI in terms of the length of the tasks that the AI could complete. His conclusions revealed how there is a clear tendency that indicates that IA models are improving in a predictable way.

In that graph of the Metr study, “the duration of the tasks (measured by the time that human professionals take) that AI agents can complete with 50% reliability. That duration has doubled approximately every 7 months during the last 6 years.”

And as they point out, “even if absolute measurements deviate in a factor of 10, the tendency predicts that in less than a decade we will see agents of the capable of independently complete a large part of the software tasks that currently lead to humans days or weeks.”

“The AI not only regurgitates what you have learned during training, but combines them in a novel way.”

There is another debate that we wanted to rescue the opinion of Jaime Sevilla. This is that statement that also has been discussing time: IAS do not generate new knowledgethey only combine all the data with which they have been trained to “regurgitate” their answers.

Seville laughed when talking about this and asked us “What do you think is intelligence?“For him that is also what human beings do. In fact, he says, FrontierMath precisely shows that” not only regurgitates what he has learned during training, but also combines them in a novel way. “

His conclusion was also very optimistic about the future of AI. To the rhythm that is evolving and with the resources that are being dedicated, its vision is clear: “Between GPT-2 and GPT-4 there is a difference of 10,000 times more computation”, and that meant an improvement of extraordinary benefits between both models.

We are following that same line of dedicated uses, so according to him “By the end of the decade we will be seeing a similar leap“Between GPT-4 and what we have when that period ends.” He did not speak specifically of AGI, but it did clear that the advance will be equally spectacular. And there will be Benchmarks like FrontierMath to show us that jump.

In Xataka | One of the pioneers of the AI has taken a look at the current generative AI and has reached a conclusion: it is tontal

Leave your vote

0 Points

Upvote Downvote

We do not know what the Benchmarks of Ia measure. So we have talked to the Spanish who created one of the most difficult

Frontiermath and the challenge of solving problems that (almost) nobody has solved

“AI does not think like us”

Leave your vote

Leave a CommentCancel reply

Frontiermath and the challenge of solving problems that (almost) nobody has solved

“AI does not think like us”

Leave your vote

Leave a CommentCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections