Cloudflare Unveils Tool to Combat AI Scraping Bots

Cloudflare says it blocks more than 200 billion cyberthreats daily.

Cloudflare offers a free solution to stop bots from scraping websites for AI model training data.

Google, OpenAI, and Apple allow website owners to stop their data-scraping bots by editing robots.txt; however, Cloudflare noted that not all AI scrapers follow this advice.

Cloudflare, a publicly listed cloud service company, said customers do not want AI bots accessing their websites, especially fraud. It said that certain AI businesses that aim to bypass content controls will keep adapting to avoid bot detection, per TechCrunch.

Cloudflare evaluated AI bots and crawler traffic to improve its automatic bot detection algorithms for this issue. These algorithms assess if an AI bot wants to look and act like a human web browser user.

Cloudflare said malicious actors employ fingerprintable tools and frameworks to crawl websites at scale. Based on these characteristics, our algorithms can identify evasive AI bot traffic as bots." It also allows hosts to report suspicious AI bots and crawlers, which will be manually banned.

AI Bots are an Increasing Security Risk

With the growth of generative AI, AI bots are increasing, boosting training data requirements. Fearing AI merchants will use their content without permission or remuneration, several websites restrict AI scrapers and crawlers. Research shows over 600 news publishers and 26% of the top 1,000 websites have turned off OpenAI's bot.

TechCrunch Disrupt Berlin 2019 - Day 2
Co-founder and CEO of Cloudflare Matthew Prince speaks on stage at TechCrunch Disrupt Berlin 2019 at Arena Berlin on December 12, 2019 in Berlin, Germany. Photo by Noam Galai/Getty Images for TechCrunch

Blocking is not always reliable. According to reports, some manufacturers break AI bot exclusion regulations to gain an edge. People have accused Perplexity of impersonating real visitors to scrape content, while OpenAI and Anthropic have violated robots.txt regulations.

If they can discover hidden AI bots, Cloudflare's tools may be helpful. However, they do not solve the larger issue of publishers losing referral traffic from AI tools like Google's AI Overviews, which exclude sites that restrict specific AI crawlers.

Recently, Cloudflare released its 2024 State of Application Security Report. The analysis shows how security teams struggle to handle the dangers of contemporary apps, which power many popular websites.

The analysis reveals that software supply chain challenges, DDoS attacks, and malicious bots overwhelm professional application security teams.

In the digital age, e-commerce transactions, safe healthcare data exchange, and daily mobile activities require web apps and APIs. As these apps become more popular, cyberattacks become more likely.

The rapid development of new features, like generative AI, increases the attack surface. According to the report, unprotected apps may disrupt business, cost money, and destroy vital infrastructure.

Cloudflare co-founder and CEO Matthew Prince noted that online apps used daily for important tasks are "rarely built with security in mind," exposing them to hackers. The company highlighted that it prevents 209 billion cyber attacks per day for its clients.

No AI Bots Allowed on Reddit

In a similar move, Reddit recently announced that it will prohibit most automated bots from utilizing its data without a license.

Mashable reported that Reddit plans to change its robots.txt file to restrict web spiders. The platform targets AI businesses that scrape the web to train their models, disregarding copyright or site terms of service.

In a blog post, Reddit explained that "good faith actors" such as scholars and the Internet Archive can still access its information for non-commercial purposes.

byline quincy

ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion
Real Time Analytics