Meta emails reveal that he downloaded 81.7 TB of books with copyright via Bittorrent to train their AI models

In the legal process Kadrey against goal Mark Zuckerberg’s company is accused of having used works protected by copyright to train their artificial intelligence models. A few weeks ago it was revealed that Zuckerberg had approved to use pirate booksbut now new and powerful evidence of this looting arrive.

Revealed emails. He “Appendix a“The case includes several mail email messages from the finish Do that data collection in October 2022.

“Download with torrents from a company’s laptop does not seem a good idea”. In April 2023 Nikolay Bashlykov, one of those responsible for carrying out this data collection, joking including emojis and indicated that the company would have to be careful with the IP from which they downloaded the data.

Goal knew the risks. In September of that year Bashlykov already stopped using emoticons and warned that using torrents would imply acting as “seeds” so that others also download them, and “that might not be legally legally.” These debates are proof that Meta knew that this type of activity was illegal, according to the authors who have sued the company.

Erasing the footprints. In a Internal message Meta Frank Zhang researcher indicated how the company avoided using its servers by downloading this data set to “avoid” “the risk that anyone can draw the seed” and who downloaded that data.

81.7 TB of data. As they point out In Ars TechnicaThe evidence shows that Meta downloaded at least 81.7 the terabytes of data from various libraries offered by those books protected by copyright. In a New document The legal process indicated that at least 35.7 TB had been downloaded from sites such as Z-Library or Libgen (which It ended up closing last summer).

Goal wants to dismiss those charges. Goal has presented a motion to dismiss those accusations indicating that there was no evidence that any book was downloaded by finishing employees through Torrent or that they were later distributed by goal. In Xataka we have contacted the company, and we will update this news if we receive comments on the case.

Loot on the Internet fire. These data affect the debatable practices that AI companies are using to train their models. We saw it With Googleand of course also with Openai, who used millions of texts to train Chatgpt, and Many of them had copyright. Perplexity was in the spotlight after discovering that He skipped the bullfighter Internet rules to avoid payment walls and feed your AI model.

Internet robberies are being normalized. The amazing thing about all this is that the fact that all companies are skipping the norms and violating copyright seems to be normalizing the looting of the Internet. It almost does not give time to scandal and we give it almost as a policy of consummate facts to be able to follow ours.

Is this really a “fair use”? All companies are shielded in the concept of “fair use” (“Fair Use”). This concept developed in Anglo -Saxon law allows the limited use of protected material without being necessary to ask for permission to do so. Copyright rapes have not stopped arriving in the world of generative AI, but they seem to be in the background while these giants thrive.

In Xataka | 5,000 “tokens” of my blog are being used to train an AI. I have not given my permission

Leave your vote

Leave a Comment

GIPHY App Key not set. Please check settings

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.