usatoday24

The great IAS we use daily like GPT, Gemini, Claude, Perplexity and Company exist and are able to do what they do thanks, in large part, in large part, to the content available on the Internet. Companies such as Openai, Google and Anthropic, to mention some, have tracked (and track in real time) the web in search of content that answers the user’s questions.

And they do it, unless there are specific agreementswithout offering consideration to the creators of said content beyond a link. It is a practice that is in question from the birth of this technology. Blog articles, Wikipedia, books, User generated content, even personal data. The trackers, those automated bots, do not leave anything behind and today Cloudflare has said that it is over

From today, Cloudflare will block by default Scrapers of AI, something that has more implications of what it might seem. Let’s start at the beginning.

Web Crawlers. This technology is not new and, in fact, it is thanks to it that the foundations on which the Internet is based (the web search) exists. Surely it is familiar about “The Google Spider“, that bot that tracks the entire website in search of content to index and offer the user. It is only one of the thousands and thousands that exist and that generate 30% of all traffic worldwide.

This technology was capital to shape the Internet we know and the relationship with content generators was symbiotic. The economy of the click was born: the creator generates a content, Google Lo Indexa, the user finds it through Google, Google generates income with the advertising of the search engine, the creator receives free traffic and generates income thanks to advertising, affiliates, etc.

With AI, the movie is quite different.

Data. The AI models need information to feed, be trained and be able to answer questions. To do this, the big companies that we all know tracked the website, They extracted all the content they could and used it to develop technologies such as Chatgpt. What is the problem? That content could be protected by copyright, which led to the fact that The New York Times sue Openai For this same reason since the companies of AI had to sign agreements with the means to access their content.

Image: Solen Feyissa

Ias connected. AI was evolving and, as expected, It ended up connecting to the Internet. Not only did he give answers based on finite training data, but could be connected to the network to search for the response in the media, blogs and online pages in real time (or almost in real time). The user no longer had to click on a link. The AI searched, analyzed and generated the answer, making traffic towards the media and blogs.

The user no longer accesses the original content, does not click on the links. Instead, it consumes a derived product generated by AI

To this technology the Ai Crawlers or what is the same is given life: the trackers ia. They are the digievolution of the bots that shape the Internet we know. Among them are OPENAI GPTBOT, META-EXTERNALAGENT META, CLAUDEBOT OF ANTHROPIC O BYTESPIDER DE BYTEDANCE. With them the symbiotic relationship that we mentioned above begins to deteriorate because the user no longer accesses the original content, does not click. Instead, it consumes a derived product generated by Ia.

The biggest example: new previous views generated with AI that appear on Google every time you do any search.

Volume of daily requests of the main AI Bots | Image: Cloudflare

Put the brake … or not, I’m just a .txt. How to solve this indiscriminate tracking and without consideration? The first proposal was Update the Robots.txt file to indicate to the bots that cannot extract the content of a website. This file and one of the most used resources to administer the activity of the bots, but has a small problem: its compliance is voluntary. IA companies can follow the instructions, or can ignore and extract the content.

In addition, it may happen that we touch what we should not and that our website disappears from Google. Every website who wants to be on Google must allow Googlebot, its spider, to indicate to the bots that cannot extract the content of a website. This file is one of the most used resources to administer the activity of the bots, but it has a small problem: its compliance is voluntary. IA companies can follow the instructions, or can ignore and extract the content.

We have tried to use AI to verify if the images of the war between Iran and Israel were made with AI. It has been a disaster

Cloudflare is planted. We arrive at the recent announcement made by Cloudflare. The platform (The middle internet depends on) has announced that, from today, the blockade of the AI Crawler will be active by default. To do this, Cloudflare offers direct management of robots.txt to avoid problems such as the aforementioned. The key, of course, is that Cloudflare will be in charge of maintaining the updated blockages according to the IA panorama. This, although it is activated by default, is voluntary and can be completely deactivated in the adjustments.

To pay. Cloudflare’s other proposal is Pay per crawl. Since AI will continue to need access to the content of a website, why not give the creator the option to charge for such access? Pay Per Crawl, which is currently in Beta, allows domain owners to define a fixed price at request. If an AI Crawler wants to extract the content of that domain, you will have to pay for it. On paper, this tool has the potential to change the current panorama, but everything will depend on the scope, its adoption and what measures take the tracker operators.

Cover image | Solen Feyissa

In Xataka | I have asked the AI any bullshit and now I am writing a news about her

AI as chatgpt is possible thanks to the indiscriminate use of online content. Cloudflare just said that it is over

What do you think?

What is and how to find content on this online courses and training platform

There are more and more people summarizing books with chatgpt instead of reading them: Welcome to the era of post-alfabetization

What is it, how it works and what is better (and what not) than chatgpt and competition

We believed that Chatgpt was just a very capable chatbot. Openai has just turned it into something very different: a real agent

An researcher proposed a game to Chatgpt. What he received in return was functional keys from Windows 10

How to use Gemini to summarize YouTube videos or ask questions about its Android content

Generate day energy and hunt asteroids at night

Someone has made a ranking with the greatest fines in the history of Spain and an old suspect is in the lead: Ryanair

Public health needs to recruit 100,000 nurses

A tsunami that caused thousands of dead

They are boycotting investments at their CEO

Weddings where they don’t work alcohol

Leave a ReplyCancel reply

China already has an army of drones and firefighters. And fight the fire to cannon

The diamond industry promised them happy with the jewels cultivated in the laboratory. Until prices sank

We began to follow the influencers because they looked like us. Now combing 170 euros are being bought

The LowCost airline business is in the accessory. That is why this idea of vertical seats is one of his old dreams

Jordi Wild has the most controversial podcast in Spain. And precisely that is what waves have rewarded

China’s boom in the world of technology, visit to the headquarters of Byd in Shenzhen and much more in 1×14 crossover

“In five years we will be one of the three main manufacturers in the world, I am convinced”

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Hey Friend!Before You Go…

Hey Friend!
Before You Go…