The models of artificial intelligence (AI) have a problem that more powerful chips cannot solve: they are running out of data. Epoch AI, a nonprofit research organization specializing in scaling AI models, warns with 80% certainty that the high-quality text available on the Internet will be exhausted sometime between 2026 and 2032.
The reason is very simple: AI laboratories have been extracting everything the web has to offer for many years, and current models already train on data sets that approach the theoretical limit of the information available. When that gold mine empties, data volume scaling will stop working. And if this scenario occurs, AI development will most likely slow down.
We still do not know what strategy US companies are developing to solve this problem, but we already know what is China preparing. His biggest rival. In fact, Xi Jinping’s government has decided that this shortage is an opportunity. This week the China National Data Administration published a draft outlining its action plan with a clear objective: to build an ecosystem of validated data by 2028 that will fuel the next generation of AI models.
China’s bet is already on the table
The document prepared by the National Data Administration identifies which specific sectors are priority objectives for information generation and certification. Some of them are scientific research, manufacturing, agriculture, energy, transportation, finance, healthcare, education and e-commerce. However, his plan does not stop at traditional sectors.
China has a structural advantage that no Western laboratory can easily replicate
And it also plans to cover cutting-edge fields with quality data, such as AI applied to robots, autonomous driving, low-altitude aviation or biomanufacturing. These are, precisely, domains whose data is not on the internet because they come from sensors, actuators and physical environments. Achieving them requires having industrial infrastructure, and in this scenario China has a structural advantage that no Western laboratory can easily replicate.
However, this is not all. The document prepared by the National Data Administration explicitly encourages the expansion of the supply of text, code, images, audio and video necessary to train systems capable of complex reasoning, agentic behavior and control of intelligent robots. In fact, it’s an almost exact description of what the industry calls next-generation models. They are not just systems capable of answering questions; They will also be able to plan, act and operate in the physical world.
The availability of high-quality multimodal data, especially that coming from real industrial environments, is today one of the least discussed and most determining bottlenecks in the AI career. In a scenario where access to cutting-edge chips is restricted by US export controlsdata becomes a competitive advantage. If China can’t win the hardware race, it can try to win the fuel race that that hardware needs to be truly useful.
Image | Daoducquan
More information | SCMP

GIPHY App Key not set. Please check settings