The entire world is wondering how it is possible that the models of AI of Deepseek They have become overnight the great protagonists of today in the field of artificial intelligence. The answer is relatively simple. These models have managed to demonstrate that You can do more with much less.
Both Deepseek V3 and Deepseek-R1 are comparable to GPT-4 or O1 OPENAI respectively, but it is estimated that their training has been much less expensive and its inference, of course, is: the prices of the Deepseek API are up to 35 sometimes lower than those of OpenAi, but that makes one wonder how it is possible.
The answer is clear, and it is because we have at our disposal the technical reports of these AI models. Precisely his study has allowed us to clarify What are the techniques that this Chinese R&D laboratory has used to develop these models so efficient and capable.
Many techniques, a single objective: efficiency
There are several differences that make Deepseek’s new model especially efficient. Its creators explain in detail in the detailed Technical Report that is publicly available. Here are the most relevant:
- Deepseekmoe (“Mixture of experts”): In models such as GPT-3.5 the entire model was activated in both training and inference (when we use it). However, not all model components are necessary for our requests. The MOE technique – already introving with Deepseek V2 – precisely divides the model into multiple “experts” and only activates those that are necessary according to the request. GPT-4 is already a MOE model. But as we said, Depseekmoe even went further and differentiated between even more specialized experts, in addition to using some somewhat more generalist experts that could contribute value in certain requests. Managing all those specialized or generalist experts not only benefits inference, but also the training phase, making it more efficient. This technique is similar to the so -called “Time Scaling test” that also adjusts the size or complexity of a model during efficiency.
- Deepseekmla (Multi-Head Latent attention): It is another substantial improvement-even more than the previous one, and also introduced with Deepseek V2-that affects the way in which memory is managed in these models. Normally it is necessary to load both the model and the entire context window – the one that allows us to write prompts and include long texts, for example. Context windows are especially expensive because each token requires both a key and their corresponding value. With the improvement introduced with this technique, what was made possible was to compress that warehouse of keys and values, dramatically reducing memory use during inference.
- Auxiliary -los-Free Load Balancing: If we imagine a model like a great orchestra, each musician is an “expert” within the model. To play a complex piece, not all musicians are necessary all the time. Traditionally the so -called “auxiliary losses” were used to make sure that all musicians played enough, but these losses could interfere with that interpretation of the musical piece (model training), which could degrade general performance. With Deepseek V3 the model is able to balance the work of each expert dynamically. That does the simplest, direct and efficient training by eliminating “auxiliary losses.” In addition, the elimination of interference allows the model to learn better and with less resources … and get better results.
- Multi-Token Prediction Training Objective: Often predicting the following word depends on several previous words or context. With this technique instead of predicting only the following word, the model learns to predict several words at the same time. That makes more natural and understandable and less ambiguous texts generate, but also accelerates training by reducing the number of steps necessary to generate the complete text sequence.
- FP8 Mixed Precision Training: The use of Numbers FP8 allows significantly reducing memory consumption and accelerates calculations. Some critical parts of the model continue to use FP32 training to guarantee precision, but there is another additional benefit of FP8: the size of the models is reduced. Other models use techniques such as quantization or parameter pruning. Although Openai does not give data on GPT-4 in this section, the assumption is that it works with BF16, more expensive in terms of memory. Although FP8 theoretically leads to less precise models, other complementary techniques such as fine-grained quantization are used to reduce the negative impact of values that come out of the common, which makes a stable training possible.
- Cross-Node All-to-Lall Communication: During training it is necessary to constantly exchange information between all nodes (computers) connected in training data centers. That can become a bottleneck, but these new Deepseek V3 techniques include efficient communication protocols, data traffic reduction and efficient synchronization to accelerate training and, once again, reduce the costs of that process.
Reinforcement and “distillation” learning as keys
But in addition to all these techniques, those responsible for Deepseek V3 explain how they pressed it with 14.8 billion tokens, a process to which a supervised adjustment followed (Superved Fine-Tuning, SFT) and several stages of Reinforcement Learning (Reinforcement Learning, RL). The SFT phase-which is mentioned in the Deepseek V3 report-was completely omitted in the case of Deepseek-R1.
However, learning by reinforcement is an absolute protagonist in the development of both models, especially in R1. The technique is well known in the field of artificial intelligence, and it is as if we trained a dog with prizes and punishments. The model learns to respond better by giving rewards if you do well. Over time, the model learns to take actions that maximize long -term reward. In Deepseek, learning for reinforcement is used to break down complex problems in smaller steps.
In it Deepseek R1 technical report It also indicates how this model makes use of RL techniques directly on the base model, without the need for supervised training. That saves computing resources.
The call also comes into play here Thought chain (chain-of-though)also mentioned in the technical report. This refers to the ability of a language model to show the intermediate steps of its reasoning. The model not only provides an answer: it also explains how it came to that answer.
That not only improves transparency (we know “what is thinking”), but allows to identify errors and improve precision. The combination of both techniques makes the Deepseek behavior in the inference stage especially remarkable.
In the case of Deepseek R1 there are other techniques that also allow to make it especially efficient. Among them, the distillation of the models stands out. What is that process?
The Model distillation It is like teaching a smaller “student model” to behave as a larger and older “teacher model”. A small model is trained to imitate the capacities and behavior of a large model, but with fewer computational resources. The objective is clear: that the small model is faster and more efficient, but equally intelligent in specific tasks.
Deepseek-R1 developers highlight how they used small models such as QWEN (from 1.5b to 32b) or call 3.3 (8b and 70b-instruct) using 800,000 leaked samples with Deepseek-R1. In these models only supervised learning and non -learning for reinforcement were used because they wanted to demonstrate the effectiveness of the distillate technique. The results of this process were seen in the Benchmarks published in that technical report: even being smaller than its competitors, their behavior was better.


Several benchmarks seem to make it clear that the performance of the distilled variants of Deepseek R1 is superior to that of its competitors.
There are other additional improvements in this model, but without a doubt those are the most important when achieving that efficiency and that “do more with less.” Deepseek’s documentation is fantastic and is surely very useful for other projects in this area to continue evolving and improving, but today a thing is clear: the result of these improvements is spectacular, and Depseek’s models behave so well or better than its competitors, as we have been able to verify In our extensive comparative.
In Xataka | The sanctions have been key: Deepseek has had to pull pure ingenuity, breaking the “more = better” paradigm