The CNMV has tested AI to invest in the stock market for ten months. The conclusions are very revealing
In recent months there has been a recurring discourse that we see on social networks and that sell us again that “get rich quick” message. That message is “use AI to invest in the stock market.” The interesting thing comes when we see how the CNMV has published a study in which it has precisely attempted to analyze that premise. Although this organization warns of the risks of investing with AI, there is another important message in the conclusions: LLMs are not bad investors per se. They are bad at following vague instructions, which is just how most people use them. The CNMV study. Two researchers from the CNMV, Ricardo Crisóstomo and Diana Mykhalyuk, have published a study methodologically serious (but imperfect) and very interesting: they used four AI models for ten months live, from April 2025 to January 2026. They chose ChatGPT, Gemini, DeepSeek and Perplexity as models. The process was simple but demanding: each month they asked each model to identify the five stocks in the Ibex35 index with the best expected performance (to buy) and the five with the worst expected performance (to sell short). Then the real result was measured at the end of the month, and here there was no historical data selected just because: the real market was the only arbiter of all the functioning of the models. The models evolved. One of the most significant aspects of the study is that its creators recognized a methodological problem that was difficult to avoid: during those ten months, the versions of the four models were updated several times. The Gemini of April 2025 was not the same than that of January 2026for example, and that could influence the results. The researchers commented that it was impossible to know with certainty whether an improvement or deterioration in performance was due to the prompt strategy, market conditions in that period, or simply because the model changed. The prompt is everything. Three were also tested prompt types very different, and that gave rise to conclusions that were neither alarmist nor did they create false expectations: they were “it depends.” Thus, their results showed that everything depended on the type of supervision that these models had: If the LLMs were asked generic questions such as “What stocks should I buy?”, they failed repeatedly. There were computational errors, incorrect interpretations and also the famous hallucinations of chatbots. Curiously, the only one that made a profit was ChatGPT. The problem is that people who use AI to invest probably use this mode of action. But if prompts prepared with iterative reviews and human supervision at each step were used, Perplexity achieved a monthly return of 3.5% on the IBEX35. Gemini and ChatGPT also improved their behavior if given more precise instructions, and DeepSeek was the worst ranked overall. There is another finding: when models receive official regulatory documentation or business results reports, their predictive accuracy improves significantly. The LLMs they reason better on concrete and verified facts than generating analysis from scratch on information that they themselves search for on the web. financial hallucinations. The CNMV study points out that financial markets are especially demanding for AI models because they require complex processes. They have to retrieve and collect information dynamically, they have to reason in multiple steps, they have to be numerically precise, and they have to know this market, and all in real time. Chatbots are trained to generate “convincing” textsso the incentive here is that the investment recommendation “sounds good” even though it is completely wrong. The confidence with which AI models present incorrect financial analysis is proportional to the risk they pose to those who use them without checking whether what they say makes sense. In short: do not trust AI to invest right off the bat. The Reddit user’s experiment was equally striking, but hardly conclusive. Source: Reddit. The Reddit experiment. A Reddit user named Blotter-fyi rode in November from 2024 a platform called Rallies.ai which gave several AI agents access to real-time financial data and money to make stock market operations. Four months later, with the S&P index down 7% since the start, five of the models are outperforming that index, although only two have positive returns in absolute terms. The author himself was the first to warn that four months are insufficient to reach a conclusion: it could be luck, the market or simply the prompt. Nof1’s experiment was fascinating, but it made it clear that AI models don’t typically make money investing in crypto. Source: Nof1. Nof1 and crypto fascination. Another particularly striking experiment was the one that the company nof1.ai made with its Alpha Arena. He put six AI models to compete, gave them 10,000 real dollars each and gave them two weeks to trade cryptocurrency derivatives without human intervention. The most striking result was not who won, but who lost: GPT-5 ended with more than 25% losses and Gemini with close to a negative 40%. Meanwhile, the Chinese models Qwen and DeepSeek dominated in terms of good performance. They iterated with other models, 32 in total, and of all of them only six achieved a positive return: the rest lost money. Grok-4.20 was the big winner ahead of GPT-5.1 and DeepSeek v3.1. Maybe you shouldn’t just let AI invest for you. The conclusions after these experiments are clear. Four months of a model outperforming the S&P index in a bear market does not prove that AI is a good investor. Only in that specific period, with that specific marketthat model made decisions that turned out to be less bad than those in the index. To see if this makes sense takes years, multiple market conditions, and many instances of the same experiment running in parallel. The same happens with Nof1 – especially short – and with a more serious and methodical process like that of the CNMV, which was also surrounded by events whose impact on the final result was uncertain. Faced with so many unknowns, the conclusion seems clear: … Read more