The AI agents They fail more than a fair shotgun. That’s at least what reveals A recent study of researchers from the Carnegie Mellon University (CMU) and the University of Duke. These experts have analyzed the behavior of several of them and put them to the test to check if this is a “much noise and few nuts.” AND At the moment it is.
Inspiration. Graham Neubig, professor of CMU, explained In The Register how inspiration had been A 2023 article of OpenAi. It talked about what types of work could be replaced by AI systems, but as he said “his methodology was basically asking Chatgpt if those works could be automated.” In that study they wanted to verify it by asking various AI agents to try to complete tasks that theoretically carry out professionals of those works.
Theagentcompany. To carry out their study, the researchers created a fictitious company they called The Agent Company and they used it so that different agricultural models tried to complete various tasks. These systems should be able to use access to several services such as Gitlab, Owncloud or Rocketchat to carry out these works, but their performance was disappointing.
70% errors. The researchers used two trial environments called Openhands Codeact and Owl-Roleplay and in them they were testing the most important AI models today. The best of all of them is Claude Sonnet 4, which managed to solve 33.1% of the proposed tasks. Behind are Claude 3.7 Sonnet (30.9%), Gemini 2.5 Pro (30.3%) and, much further, disastrous GPT-4O (8.6%), call-3.1-405b (7.4%), QWEN-2.5-72B (5.7%) or Amazon Nova Pro V1.0 (1.7%). In the best case the models can complete 30% of the requested tasks, but they fail in 70%. Or what is the same: a lot of noise and few nuts according to these benchmarks.
Unable agents. During these tests the researchers observed various types of failure in these tasks processes. Thus, there were agents refusing to send a message to colleagues who were part of the task, there were also agents unable to manage popup windows during navigation sessions, and even agents who cheated or cheated. In one of the cases, they highlighted, an agent who had to consult a person in Rocketchat (an alternative Open Source a Slack) did not find it, so “he changed the name to another user to give him the user with whom he had to contact.”
But they are improving. Even with those problems, the evolution is being positive in the performance of these AI agents. Neubig and his team tested a software agent that was able to solve about 24% of the tasks involved web navigation, programming and some related tasks. Six months later they tested a new version and achieved 34% of completed tasks.
{“Videid”: “X8HJ0VY”, “Autoplay”: False, “Title”: “Chatgpt: What you did not know what you could do | tricks”, “Tag”: “”, “Duration”: “790”}
Imperfect but useful. Not only that: these researchers pointed out that even failing so much, AI agents can remain useful. In certain contexts, such as programming, a partial code suggestion with which to solve a certain fragment of a program can end up being the basis of a solution in which the developer can then work.
Care where you use them. But of course, agents make so many mistakes can be a problem in scenarios more sensitive to these problems. Thus, if we commission an agent who writes Correos and sends them to incorrect people, the result could be a disaster. There are solutions in sight, such as the growing adoption of Model Context protocol (MCP) that facilitates the interaction between services and AI models so that communication is much more precise and these errors can be mitigated during the autonomous execution of tasks.
A benchmark that makes the AI models look bad. For this expert one of the great disappointments is that companies that develop AI models do not seem interested in using it as a metric to improve their developments. Neubig suspected that “perhaps it is too difficult and makes them look bad.” It is something similar to what happens With the benchmark arc -ag2: It is such a difficult test for the IAS that today The best of all models of which they try to overcome it is o3, which achieves – a 3% of completed tasks.
In Salesforce they coincide. That previous study is complemented With another realized by a group of Salesforce researchers. They created their own benchmark specifically aimed at verifying how various AI models would be checked when controlling typical tasks in a CRM like those developed by the firm. His project, called Crmarena-Pro, tests those AI agents in areas such as the Sales or Support Department.
To replace workers, nothing. In their conclusions, these researchers reveal how the AI models “achieve globally modest success rates, typically around 58% in scenarios with a single shift (execution), but with the yield significantly degrading approximately 35% in multiturn scenarios.” In fact, they explained, “agents are not generally prepared or have the essential qualifications for complex tasks.” The risk of some experts, with A great impact of AI on various jobsit seems precipitated.
A complicated future. To these discreet results, the prediction of the Gartner consultant is joined. According to your studiesmore than 40% of the development agents projects will end up being canceled at the end of 2027. The main person responsible for the report, Anushree Verma, indicated that “at present, most of the agricultural projects are experiments or tests of concept in the initial phase, mainly driven by advertising already often badly applied.” The message is clear: there are too many expectations in relation to AI agents, but the current state of technology shows that today its application is problematic and limited.
Image | Sigmund
(Function () {Window._js_modules = Window._js_modules || {}; var headelement = document.getelegsbytagname (‘head’) (0); if (_js_modules.instagram) {var instagramscript = Document.Createlement (‘script’); }}) ();
–
The news
We have a big problem with AI agents: 70% of the time are wrong
It was originally posted in
Xataka
by
Javier Pastor
.
GIPHY App Key not set. Please check settings