benchmarks Archives - usatoday24

The model challenges benchmarks in a key area

March 21, 2026 by usatoday24

When we think of Xiaomi, it is normal that its mobile phones come to mind or, at most, its foray into electric cars with models like the SU7. However, what we have seen now points to a much more ambitious move: the company also wants to compete in the artificial intelligence race. It has done so with the launch of MiMo-V2-Proa model that, according to the data shared by the company itself, seeks to position itself close to the most advanced systems on the market, but with a very different focus on costs. And that changes the conversation quite a bit. What Xiaomi proposes. The company presents its model as the “brain” of systems capable of executing complete tasks, not just responding to specific requests, which in the sector is known as agent-oriented models. According to official information, we are looking at an architecture that exceeds one trillion total parameters, although it only activates 42 billion in each execution, and that can work with contexts of up to one million tokens. On paper, this allows you to maintain long, complex processes without fragmenting them, something designed for large tasks and more demanding workflows. Performance against the greats. If we look at the data, Xiaomi does not present its model as the best on the market, but as one that can compete in certain scenarios. In the GDPval-AA benchmark, oriented to real agent-type tasks, it reaches an Elo of 1426, surpassing Chinese models such as GLM-5 (1412) and Kimi K2.5 (1309), although it falls short of proposals such as Claude Sonnet 4.6, which marks 1633. The external reading is provided by Artificial Analysis, which assigns it a score of 49 on its intelligence index, which places it in the group of most competitive models on the market. The key is in that closeness in some benchmarks, not in general leadership. The key to the price. This is where Xiaomi’s proposal changes the board. According to data collected by Artificial Analysis, running your IQ with this model costs approximately $348, compared to $2,304 for GPT-5.2 or the 2,486 of Claude Opus 4.6. It is not exactly the same comparison as the price per API use, but on both levels Xiaomi appears clearly below several Western rivals. In its own API, the company sets prices of $1 per million tokens for entry and $3 for exit in the range up to 256K, a lower rate than models such as Claude Sonnet 4.6 and Claude Opus 4.6 at the same level of use. Beyond chat. What Xiaomi is proposing with this model is not only to improve the quality of the responses, but to change the type of work it can do. The company insists on moving from conversation to action, with a system capable of using tools, interacting with environments and completing chained tasks. In this context, it presents it as a model optimized for agentic scenarios and links it to frameworks such as OpenClawin addition to mentioning collaborations with OpenCode, KiloCode, Blackbox and Cline. On paper, this reinforces the idea of an AI designed to execute workflows and not just answer questions. behind the scenes. Xiaomi enters the race with a model that, according to available data, is close to the major benchmarks in some scenarios, although without generally surpassing them. Where there does seem to be a clear bet is on the price, and that is where it tries to differentiate itself. The question is whether this balance between cost and performance is maintained outside of benchmarks and in real environments. We will have to wait to know if what the data shows is also projected in the real world. Images | Xiaomi In Xataka | China has immediately understood the future of the technology industry: “one-person companies” powered by AI

We do not know what the Benchmarks of Ia measure. So we have talked to the Spanish who created one of the most difficult

April 12, 2025 by usatoday24

Gemini 2.5 Pro is the best model in history. The smartest. At least, right now. I don’t say it, he says The Chatbot Arena classificationa platform in which they run various tests or benchmarks to try to measure the global capacity of modern AI models. According to these evidence, at this time Gemini 2.5 pro experimental, launched On March 25, it has a score of 1,440 points, well above GPT-4O (1,406), Grok 3 (1,404), GPT-4.5 (1,398) and of course an Depseek R1 that despite its fame is in seventh place with a score of 1,359 points. In current Ranking of Chatbot Arena, it places Gemini Pro 2.5 experimental as the most capable model of AI at the moment. That (probably) does not last long. Google herself presumed the capacity of Gemini 2.5 Pro experimental in the official announcement. As usually happens in these ads, companies show a table in which they compare their performance with that of other comparable models in different tests. In almost all of them Google crushed their rivals in well -known tests in this segment. Is for example the Humanity’s last exam (general knowledge and reasoning), GPQA Diamond (science), Aime 2025 (math), Livecodebench V5 and Swe-Bench Verified (programming) or Mmmu (visual reasoning). All these benchmarks try to measure the ability of these models in more or less specific fields, and all help to demonstrate that models, indeed, are improving. And yet none of them answer the fundamental question: Is the AI so intelligent Like the human being? There is the really complicated, because the definition of intelligence is not entirely clear either. There are different types of intelligence, in fact, and measuring them in humans is not simple or even possible either. And comparing the ability of an AI with the ability of human intelligence is usually not easy. Some experts wonder if IA laboratories will not be cheating with the benchmarks There are in fact who argues that the progress of AI models is misleading. It recently Dean Valentine, from the Startup Zeroopath. He and his team created an AI system that analyzes large code projects in search of security problems. With Claude 3.5 Sonnet They noticed a great leap, but from there the subsequent versions have seemed much less striking. In fact, this expert pointed out that today many of the companies that launch these models focus too much on going well on the photo of the existing and most popular benchmarks and “sound intelligent” in conversations with human beings. Wonders if the laboratories of AIs are cheating and lying: For him the evolution shown by Benchmarks does not correspond to the real benefits when using them. Frontiermath and the challenge of solving problems that (almost) nobody has solved But there are attempts to answer that question. One of them comes from the team that develops THE ARC-AGI 2 PROJECTa set of evidence derived from the Moravec paradox: They are relatively easy for human being, but very difficult for AI models. Jaime Sevilla, CEO of Epoch Ai. These tests measure the ability to generalize and abstract reasoning with visual puzzles, and are undoubtedly an interesting part of that effort to value how far we have arrived at every moment with the AI models. Another of the most striking tests of recent times is Frontiermath. This benchmark created by the company COPHAI It consists of about 300 mathematical problems of different level. They have been designed by a team of more than 60 mathematicians among which Terence Tao, winner of the Fields Medal. Although there are some more affordable problems, 25% of them are qualified as especially complex. In fact, only the best experts could solve them, and It would take even days In doing so. This set of tests is also special for another aspect: these are unpublished problems and therefore have not been part of the training sets of any AI model. To solve them the machines need to be able to show a special “mathematical intelligence.” One that It helps precisely to something increasingly difficult: Assess the evolution of these models. In Xataka we have been able to talk to Jaime Sevilla (@Jsevillamol), which is precisely the CEO of COPHAI and has a very clear and personal vision on how the tests should be to measure the ability of an AI model. To begin with, he points out, “you need to have a way of measuring how the AI is advancing. Interacting with it can give you perspective, but you do not have a rigorous impression of where it will arrive and in what domains it is most expert.” That, he explains, makes it necessary to have standardized test batteries that allow us to form an idea of their skills. For this expert the Benchmark Arc-AGI is more representative of that other vision, making an easy benchmark for humans but difficult for AI. The models are improving in Arc-Agi, but for him that was obvious and that had to happen. With yours the tests are difficult for each other, and that the models advance and are increasingly better when solving these problems is not so obvious. Thus, with FrontierMath they wanted to “try to measure if AI can solve genuinely difficult problems.” Until now the mathematical problems that were subjected to the AI models were relatively easy, so the models “saturated the benchmarks”, that is, they soon managed to overcome all these tests and achieve a 100% score. “It will be a challenge to saturate this benchmark“He stressed. Here I set an example with OPENAI’s O3-mini model, which already solves 10% of FrontierMath. It is not much, but it is brutal, he says, and has already surpassed expert mathematicians like himself. However, he says, “That the AI overcomes certain benchmarks does not mean that it can operate as a human expert. You have to adjust them because they are adjusted to very specific scenarios. We are measuring those limits of that AI, and that will be a continuous process.” For Seville … Read more

The new Meta Model took a very good score at the benchmarks. Maybe too good

April 8, 2025 by usatoday24

We had been waiting for the new family calling 4 artificial intelligence models for a long time. Last weekend the company finally revealed those models and Everything seemed promising. The problem is that the way of announcing them is generating some controversy and an uncomfortable conversation: that perhaps they have cheated in the benchmarks. Call 4 seems great. As soon as they appear on the scene, the new models call 4 goal surprised by their excellent performance in Benchmarks. They were second in the ranking LMARENAonly below Gemini 2.5 pro experimental. However, suspicions soon appeared, because the flame 4 version that is available to all audiences was not the same as it was shown in that ranking. Trucada version? As indicated in the advertisement As a finish line, that flame 4 version was an “experimental” that obtained a 1,417 points in LMarenawhile Gemini 2.5 Pro experimental had obtained 1,439 points. Some experts pointed out that this experimental flame version 4 was a version that cheated and had been specifically trained with data sets used in Benchmarks to be able to score well in them. We have not cheated.Ahmad al-Dahle is the head of the generative division in the finish line, and therefore is in charge of the flame launch 4. This manager has denied sharply The rumors that point to what goal would have cheated to get better scores in the benchmarks. These rumors “are false and we would never do that,” he said. But it was “optimized”. As indicated In TechCrunchin that official announcement Meta did pointed to the experimental flame 4 model that had scored very well was “optimized for conversation.” In Lmarena They indicated What a goal should have explained better what type of model had sent to include in the ranking. The same calls 4 is not so good. Some experts who They analyzed flame performance 4 with synthetic or conventional tests They already warned that performance It didn’t seem so good As they claim in goal. The publicly available model showed a behavior that He did not adjust to the quality that pointed its score in LMarena. Not quite consistent. Al-Dahle himself confirmed that some users were seeing “different quality” results of Maverick and Scout, the two flame versions 4 available, depending on the supplier. “We hope that some days are late when public implementations are adjusted,” and added that they would continue working to correct possible errors. A rare release. What a goal this model will launch a Saturday is strange, but when asked about it Mark Zuckerberg He replied that “is when it was ready.” That also the model used in LMarena is not the same as people can use is also worrying, and it may begin to distrust us from benchmarks and companies that use them to promote their products. It is not the first time that This happens Not much less, and it will not be the last one. In Xataka | Openai is burning money as if there were no tomorrow. The question is how much can endure like this

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections