usatoday24

The AI agents They fail more than a fair shotgun. That’s at least what reveals A recent study of researchers from the Carnegie Mellon University (CMU) and the University of Duke. These experts have analyzed the behavior of several of them and put them to the test to check if this is a “much noise and few nuts.” AND At the moment it is.

Inspiration. Graham Neubig, professor of CMU, explained In The Register how inspiration had been A 2023 article of OpenAi. It talked about what types of work could be replaced by AI systems, but as he said “his methodology was basically asking Chatgpt if those works could be automated.” In that study they wanted to verify it by asking various AI agents to try to complete tasks that theoretically carry out professionals of those works.

In Xataka

Everything begins by asking a thing to an AI. When the AI is asked for others, chaos begins

Theagentcompany. To carry out their study, the researchers created a fictitious company they called The Agent Company and they used it so that different agricultural models tried to complete various tasks. These systems should be able to use access to several services such as Gitlab, Owncloud or Rocketchat to carry out these works, but their performance was disappointing.

70% errors. The researchers used two trial environments called Openhands Codeact and Owl-Roleplay and in them they were testing the most important AI models today. The best of all of them is Claude Sonnet 4, which managed to solve 33.1% of the proposed tasks. Behind are Claude 3.7 Sonnet (30.9%), Gemini 2.5 Pro (30.3%) and, much further, disastrous GPT-4O (8.6%), call-3.1-405b (7.4%), QWEN-2.5-72B (5.7%) or Amazon Nova Pro V1.0 (1.7%). In the best case the models can complete 30% of the requested tasks, but they fail in 70%. Or what is the same: a lot of noise and few nuts according to these benchmarks.

Unable agents. During these tests the researchers observed various types of failure in these tasks processes. Thus, there were agents refusing to send a message to colleagues who were part of the task, there were also agents unable to manage popup windows during navigation sessions, and even agents who cheated or cheated. In one of the cases, they highlighted, an agent who had to consult a person in Rocketchat (an alternative Open Source a Slack) did not find it, so “he changed the name to another user to give him the user with whom he had to contact.”

But they are improving. Even with those problems, the evolution is being positive in the performance of these AI agents. Neubig and his team tested a software agent that was able to solve about 24% of the tasks involved web navigation, programming and some related tasks. Six months later they tested a new version and achieved 34% of completed tasks.

{“Videid”: “X8HJ0VY”, “Autoplay”: False, “Title”: “Chatgpt: What you did not know what you could do | tricks”, “Tag”: “”, “Duration”: “790”}

Imperfect but useful. Not only that: these researchers pointed out that even failing so much, AI agents can remain useful. In certain contexts, such as programming, a partial code suggestion with which to solve a certain fragment of a program can end up being the basis of a solution in which the developer can then work.

Care where you use them. But of course, agents make so many mistakes can be a problem in scenarios more sensitive to these problems. Thus, if we commission an agent who writes Correos and sends them to incorrect people, the result could be a disaster. There are solutions in sight, such as the growing adoption of Model Context protocol (MCP) that facilitates the interaction between services and AI models so that communication is much more precise and these errors can be mitigated during the autonomous execution of tasks.

A benchmark that makes the AI models look bad. For this expert one of the great disappointments is that companies that develop AI models do not seem interested in using it as a metric to improve their developments. Neubig suspected that “perhaps it is too difficult and makes them look bad.” It is something similar to what happens With the benchmark arc -ag2: It is such a difficult test for the IAS that today The best of all models of which they try to overcome it is o3, which achieves – a 3% of completed tasks.

In Salesforce they coincide. That previous study is complemented With another realized by a group of Salesforce researchers. They created their own benchmark specifically aimed at verifying how various AI models would be checked when controlling typical tasks in a CRM like those developed by the firm. His project, called Crmarena-Pro, tests those AI agents in areas such as the Sales or Support Department.

In Xataka

If the question is whether IA is already as good as human intelligence, the answer is: solves this puzzle

To replace workers, nothing. In their conclusions, these researchers reveal how the AI models “achieve globally modest success rates, typically around 58% in scenarios with a single shift (execution), but with the yield significantly degrading approximately 35% in multiturn scenarios.” In fact, they explained, “agents are not generally prepared or have the essential qualifications for complex tasks.” The risk of some experts, with A great impact of AI on various jobsit seems precipitated.

A complicated future. To these discreet results, the prediction of the Gartner consultant is joined. According to your studiesmore than 40% of the development agents projects will end up being canceled at the end of 2027. The main person responsible for the report, Anushree Verma, indicated that “at present, most of the agricultural projects are experiments or tests of concept in the initial phase, mainly driven by advertising already often badly applied.” The message is clear: there are too many expectations in relation to AI agents, but the current state of technology shows that today its application is problematic and limited.

Image | Sigmund

In Xataka | An AI startup with six months of life and six employees has sold for 80 million dollars. Vibe-Coding, of course

(Function () {Window._js_modules = Window._js_modules || {}; var headelement = document.getelegsbytagname (‘head’) (0); if (_js_modules.instagram) {var instagramscript = Document.Createlement (‘script’); }}) ();

–
The news

We have a big problem with AI agents: 70% of the time are wrong

It was originally posted in

Xataka

by
Javier Pastor

.

Leave your vote

0 Points

Upvote Downvote

Leave your vote

Leave a CommentCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections