Many video AIs are learning to imitate the world. And everything points to an unprecedented “looting” of YouTube

A square, tourists, a waiter moving between tables, a bike passing by in the background or a journalist on a set. Video AIs can now generate scenes in a flash. The result is surprising, but it also opens up a question that until recently was barely posed: where did all those images that have come from come from? allowed to learn to imitate the world? According to The Atlanticpart of the answer points to millions of videos pulled from platforms like YouTube without clear consent.

The euphoria over generative AI has moved so quickly that many questions have been left behind. In just two years we have gone from curious little experiments to models that produce videos almost indistinguishable from the real thing. And while the focus was on the demonstrations, another issue was gaining weight: transparency. OpenAI, for example, has explained that Sora is trained with “publicly available” data, but has not detailed which one.

A massive workout that points to YouTube

The Atlantic piece gives a clear clue as to what was happening behind the scenes. We are talking about more than 15 million videos collected to train AI models, with a huge amount coming from YouTube without formal authorization. Among the initiatives cited are data sets associated with several companies, designed to improve the performance of video generators. According to the media, this process was carried out without notifying the creators who originally published that content.

One of the most striking aspects of the discovery is the profile of the affected material. These were not just anonymous videos or home recordings, but informative content and professional productions. The media found that thousands of pieces came from channels belonging to publications such as The New York Times, BBC, The Guardian, The Washington Post or Al Jazeera. Taken together, we are talking about a huge volume of journalism that would have ended up feeding AI systems without prior agreement with their owners.

runwayone of the companies that has given the most impetus to generative video, is highlighted in the reviewed data sets. According to the documents cited, their models would have learned with clips organized by type of scene and context: interviews, explanatory, pieces with graphics, kitchen plans, resource plans. The idea is clear: if AI must reproduce human situations and audiovisual narratives, it needs real references that cover everything from gestures to editing rhythms.

Runway Video
Runway Video

Fragments of a video generated with the Runway tool

In addition to Runway, the research mentions data sets used in laboratories of large technology platforms such as Meta or ByteDance in research and development of their models. The dynamic was similar: huge volumes of videos collected on the Internet and shared between research teams to improve audiovisual capabilities.

YouTube’s official stance doesn’t leave much room for interpretation. Its regulations prohibit downloading videos to train modelsand its CEO, Neal Mohan, has reiterated it in public. The expectations of the creators, he stressed, involve their content being used within the rules of the service. The appearance of millions of videos in AI databases has brought that legal framework to the fore and has intensified pressure on platforms involved in the development of generative models.

The reaction of the media sector has followed two paths. On the one hand, companies like Vox Media o Prisa have closed agreements to license their content to artificial intelligence platforms, looking for a clear framework and economic compensation. On the other hand, some media outlets have chosen to stand up: The New York Times has taken OpenAI and Microsoft to court for the unauthorized use of their materials, stressing that it will also protect the video content it distributes.

The legal terrain remains unclear. Current legislation was not intended for models that process millions of videos in parallel, and courts are still beginning to draw the lines. For some experts, publishing openly is not equivalent to transferring training rightswhile AI companies defend that indexing and the use of public material are part of technological advancement. This tension, still unresolved, keeps media and developers in a constant game of balance.

What we have before us is the start of a conversation that goes far beyond technology. Training AI models with material available on the internet has been a widespread practice for years, and now comes the time to decide where the limits are. Companies promise agreements and transparency, the media ask for guarantees and creators demand control. The next stage will be as technological as it is political: how artificial intelligence is fed will define who benefits from it.

Images | Xataka with Gemini 2.5

In Xataka | All the big AIs have ignored copyright laws. The amazing thing is that there are still no consequences

Leave your vote

Leave a Comment

GIPHY App Key not set. Please check settings

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.