They have put the 21 most popular AI chatbots to perform differential diagnosis. They fail more than a fair shotgun

House‘It’s a series that I love. I don’t care about the intrastories in the slightest, but the process of differential diagnosis – despite all the movie stuff – drives me crazy. This ability to rule out diseases that could explain the same symptoms to arrive at the most probable diagnosis seems like witchcraft to me. Well: they have put the 21 Most Popular AI Chatbots to make that differential diagnosis and the result is clear.

It fails more than a fairground shotgun.

In short. He Mass General Brigham It is not an ‘anyone’. It is a non-profit network of American doctors and hospitals, including two of the most prestigious medical teaching institutions in the country. From January to December 2025, a group of researchers from the institution they put 21 AI chatbots such as Claude 4.5 Opus, DeepSeek, Gemini 3.0 Pro, GPT-5 or Grok 4 to evaluate dozens of clinical cases with the aim of establishing their level of success in an early diagnosis.

The information is extremely basic, but it is also what professionals have when making this differential diagnosis and the ultimate intention is to evaluate the clinical reasoning capacity of the latest generation language models to see if they can be a clinical ally. The answer is no. While models optimized for reasoning achieved much higher scores than simpler ones like Gemini 1.5 Flash, the bottom line is that LLMs are still limited for this task.

The exam. Each of the models was given 29 clinical cases that represent more than 16,200 responses in total. The result is that these newer versions of the most powerful chatbots they couldn’t produce an adequate differential diagnosis in about 80% of cases when they only had basic information about the patient.

The problem is that age, sex and symptoms is very vague information, yes, but it is one that human professionals who have to make this differential diagnosis ‘play’ with for the first time. Little by little, as they do other tests and obtain more information, they refine the result, but it is that first ‘discard’ treatment that often makes the difference.

“We want to help separate the hype from the reality of these tools as they are applied to healthcare”

another movie. And, precisely, as the LLM They were given more data, the performance and results were more robust. When the chatbot has more and more information such as physical analysis data, laboratory results and diagnostic images, things change and AI reaches the final diagnosis in more than 90% of cases.

But of course, to reach that stage they must have almost all the clinical data, which further shows the gap with impotence when performing an initial filtering.

Don’t trust Google ChatGPT. The researchers are clear that “these models are very good at identifying a final diagnosis when the data is complete, but they have difficulties at the beginning of an open case,” which leads them to emphasize that they should not be trusted at home. The AI ​​industry is pushing your product in the medical circuit, but the study points out that “despite continuous improvements, commercial LLMs are not ready for clinical implementation without supervision.”

They state that a human is needed in the operation and “very close supervision” to be able to scale the use of an LLM in the healthcare field. And there they are always talking about professional use, but more and more cases are seen of people who previously treated themselves by trusting Google and who Now they do it trusting what ChatGPT tells them. In the study they emphasize that “hallucinations remain” in these latest generation models, also showing concerns about the safety and integrity of patients.

About El Salvador. In any case, it is evident that, in the end, Medical AI is another helper, a tooland here what has been tested is a “common” chatbot that knows everything, but is not specialized in anything. In medicine, as in other industries, the use of AI can help with tasks such as eliminating possibilities or organizing thousands of data, but a chatbot is not yet a good companion in this differential diagnosis because it simply cannot be trusted.

Those who are going to have to trust AI for any type of treatment are Salvadorans. El Salvador has been a pioneer country when it comes to adopting new technologies, and the president, Nayib Bukele, has just embarked on another experiment: $500 million to leave healthcare in the hands of Gemini. The population will have access to the app Dr.SV who will work as a family doctor. As detailed in The Countrythis AI will know the symptoms and will assign calls with doctors who will make the diagnosis. The AI ​​will monitor for consultations and chronic diseases and the goal is for it to take care of cancer patients in the future.

According to Bukele, they are creating the best health system in the world, something curious considering that they laid off more than 7,700 health system employees during 2025. For the sake of Salvadorans, let’s hope that This new experiment does not end like Bitcoin City.

In Xataka | Privacy is dying since ChatGPT arrived. Now our obsession is for AI to know us as best as possible

Leave your vote

Leave a Comment

GIPHY App Key not set. Please check settings

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.