Over the last year, the elite of open models for assisted programming, at least in benchmarks as SWE-Bench Verifiedhas spoken with a Chinese accent. Names like DeepSeek, Kimi either qwen They had settled into the top positions in testing and were setting the pace in complex software engineering tasks, while Europe was still searching for its position. The arrival of Devstral 2 alters that distribution. It does not displace those who were already at the top, but it places Mistral at the same level of demand and turns a European company into a real contender in a field that until now seemed reserved for others.
League change: the technical leap that had been brewing for some time. During recent months, the open models developed in Europe and the United States had shown constant evolution, although still without the performance necessary to compete in the most demanding tests. The progress was evident, but there was a lack of a project capable of consolidating it at a higher level and demonstrating that this path could give results comparable to those of the sector.
Devstral 2 in data: performance, size and licenses. The new Mistral model reaches 123B parameters in a dense architecture and offers an expanded context of 256K tokens, accompanied by a modified MIT license that facilitates its adoption in open environments. Its compact version, Devstral Small 2, reduces the model to 24B licensed parameters Apache 2.0. In the SWE-Bench Verified figures published by the companyDevstral 2 obtains 72.2%, a mark that places it in the most competitive section of the open models evaluated and that confirms its presence among the most advanced alternatives in the segment.


It is reflected by a panorama concentrated in the upper part of the benchmark. Among the open models, DeepSeek V3.2 leads the group with 73.1%, followed by Kimi K2 Thinking with 71.3% and for proposals such as Qwen 3 Coder Plus and Minimax M2, which are around 69 points. At lower levels GLM 4.6, GPT-OSS-120B, CWM and DeepSWE appear, with more moderate results. In the closed commercial environment (proprietary models), the graph incorporates higher scores: Gemini 3 Pro reaches 76.2%, GPT 5.1 Codex Max rises to 77.9% and Claude Sonnet 4.5 scores 77.2%, all of them above the best brands registered for open models.
What SWE-Bench Verified Really Measures and Why It Matters. SWE-Bench Verified is a test designed to evaluate whether a model can solve real programming tasks, not synthetic exercises. Each case presents a bug in an open source repository and requires a patch to pass the previously failed tests. The evaluation seeks to measure whether the system understands the structure of the project, identifies the cause of the problem and proposes a coherent solution. It is a useful and demanding metric, although limited to Python repositories and a specific set of situations that do not cover the full breadth of software work.
From co-pilots to agents who act on the project. The arrival of Devstral 2 coincides with a broader change in the way of working with programming tools. It is no longer just about receiving suggestions in the editor, but about having agents capable of exploring an entire repository, interpreting its structure and proposing changes consistent with its real state. In this context, Vibe CLI appears, a tool that allows Devstral to analyze files, modify parts of the code and execute actions directly from the terminal, bringing these capabilities closer to the daily workflow of developers.
Cost and deployment: what each type of user can do with Devstral. The model will be available for free for an initial period and will then cost $0.40 per million tokens for input and $2.00 per million for output, while the Small 2 version will be priced lower. Its deployment also makes a difference: Devstral 2 requires at least four H100-class GPUs, aimed at data centers, while Devstral Small 2 is intended to run on a single GPU and, according to Mistral documentation, the Devstral Small family can also run in CPU-only configurations, without a dedicated GPU. This variety allows both companies and individual developers to find a suitable entry point.
The appearance of Devstral 2 introduces an unexpected element in a space where Chinese companies set the pace and where not even the United States, despite its leadership in artificial intelligence, had an open model in this high performance range in SWE-Bench Verified. Mistral does not displace those who were already at the top, but it does broaden the conversation and shows that Europe can compete in a field where it did not appear until now. It is a movement that does not alter the general hierarchy, although it does open a new margin for the evolution of assisted programming tools.
Images | Xataka with Gemini 3

GIPHY App Key not set. Please check settings