Trapped in the ‘AI Maze’: How a Company Is Battling Bots Hijacking Content for AI Training

TEDDIE

9 months ago

What measures can be taken to prevent AI from taking our content? According to US-based web services provider Cloudflare, they have devised a method to combat web scraping through the creation of an “AI maze” designed to ensnare bots.

More precisely, this labyrinth is designed to identify “AI crawlers,” which are automated programs that methodically extract information from webpage contents and get stuck there.

The company said

in a blog entry posted the previous week

That it has witnessed “a surge of new crawlers employed by artificial intelligence firms to collect data for model training.”

Generative artificial intelligence (GenAI) necessitates vast databases for model training. Various technology firms like OpenAI, Meta, or Stability AI have faced accusations of using datasets that contain material protected by copyright laws.

To avoid this issue, Cloudflare will “connect to a sequence of AI-created pages designed to be persuasive enough to lure a web crawler into navigating through them” upon identifying “suspiciousbot behavior,” thereby making these bots squander their time and resources.

The firm stated, “Our aim was to develop a novel method for neutralizing these pesky bots stealthily,” drawing parallels with a “honeypot” tactic. This approach simultaneously assists in cataloging malicious entities.

Approximately 20 percent of all websites utilize Cloudflare, based on recent estimates.

The decoy is made of “real and related to scientific facts” content but “just not relevant or proprietary to the site being crawled,” the blog post added.

It will also be invisible to human visitors and won’t impact web referencing, the company said.

Rising threat to copyrighted content

An increasing number of voices are calling for stronger measures, including regulations, to protect content from being stolen by AI actors.

Visual artists are now exploring how to

“poison” models

by adding a layer of data acting as a decoy for AI and therefore, preserving their artistic style by making it harder to mimic by genAI.

Various alternative methods have been investigated, such as, for instance,

several deals struck by news publishers

with tech companies agreeing to allow AI to train on their content in exchange for undisclosed sums.

Others, like the news agency

Reuters

and several artists, have decided to take the matter to court over the potential infringement of copyright laws.

Rising threat to copyrighted content

Share this: