Importance Score: 72 / 100 🔴
Software developers are increasingly viewing AI web crawlers as a significant nuisance online, with some even likening these bots to “cockroaches of the internet.” Facing relentless scraping and often disruptive traffic from these web scraping bots, many developers, especially in the open source community, are adopting ingenious and sometimes humorous methods of defense against these intrusive automated agents.
Niccolò Venerandi, a developer associated with the Linux desktop environment Plasma and the blog LibreNews, highlights that while any website can be targeted by problematic crawler behavior, leading to potential site disruptions, open source developers are disproportionately affected.
Websites that host Free and Open Source Software (FOSS) projects inherently expose more of their infrastructure publicly. Additionally, they typically possess fewer resources compared to commercial platforms to handle aggressive bot traffic.
A core problem is that numerous AI bots disregard the Robots Exclusion Protocol (robot.txt) file, a standard mechanism designed to instruct bots on which parts of a website should not be crawled, originally intended for search engine crawlers.
In a public appeal posted in January, FOSS developer Xe Iaso detailed how AmazonBot persistently targeted a Git server website, causing Distributed Denial of Service (DDoS)-level outages. Git servers are essential for hosting FOSS projects, enabling code download and contribution.
According to Iaso, this particular bot ignored the robot.txt directives, masked its origin through varied IP addresses, and impersonated legitimate users.
Iaso expressed frustration, stating, “Blocking AI crawler bots is ineffective because they deceive, alter their user agent identification, utilize residential IP addresses as proxies, and employ other evasive techniques.”
“They will relentlessly scrape your site until it becomes overwhelmed, and then continue scraping further. These bots will navigate every link, repeatedly accessing the same pages. Some bots even request the same link multiple times within a single second,” the developer elaborated in their post.
Anubis: A Novel Approach to Bot Mitigation
In response, Iaso conceived Anubis, a resourceful countermeasure.
Anubis functions as a reverse proxy incorporating a proof-of-work challenge. This measure requires incoming requests to pass this challenge before accessing the Git server, effectively distinguishing and blocking automated bots while allowing access to human users with browsers.
Adding a layer of irony, Anubis is named after the Egyptian deity associated with guiding the deceased to judgment.
Iaso elaborated, “In mythology, Anubis weighed the soul against a feather. A heavier soul faced negative consequences.” Similarly, if a web request successfully completes the challenge, confirming it originates from a human, a whimsical anime illustration of an anthropomorphic Anubis is displayed. Conversely, bot requests are denied access.
This ingeniously named project rapidly gained traction within the FOSS community. Launched on GitHub on March 19th, Anubis quickly garnered significant attention, amassing 2,000 stars, contributions from 20 developers, and 39 forks within days.
Widespread Impact and Defensive Strategies
The swift adoption of Anubis underscores the widespread nature of the problem faced by developers. Venerandi shared accounts from numerous others experiencing similar challenges:
- Drew DeVault, Founder and CEO of SourceHut, reported dedicating a significant portion of his time, “from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale,” and enduring “dozens of brief outages per week.”
- Jonathan Corbet, a prominent FOSS developer and operator of the Linux news website LWN, cautioned that his site’s performance was being degraded by DDoS-level traffic attributed to “AI scraper bots.”
- Kevin Fenzi, system administrator for the extensive Linux Fedora project, revealed that the intensity of AI scraper bots necessitated blocking all internet traffic originating from Brazil.
Venerandi informed that he is aware of numerous other projects contending with identical problems. One project “had to implement a temporary ban on all Chinese IP addresses at one point” to manage the traffic.
Venerandi emphasizes the severity of the situation, noting the drastic measures developers are forced to consider, “even having to resort to banning entire countries” to defend against AI bots that disregard robot.txt protocols.
Beyond the metaphorical “soul-weighing” approach of Anubis, some developers advocate for more aggressive countermeasures.
In a Hacker News discussion, user xyzal humorously suggested populating robot.txt-forbidden pages with misleading content, such as “a bucket load of articles on the benefits of drinking bleach” or “articles about the positive effect of catching measles on performance in bed.”
Xyzal elaborated on this strategy, “We need to aim for the bots to derive _negative_ utility value from visiting our traps, not merely zero value,” aiming to actively degrade the quality of data scraped by these bots.
In January, an anonymous developer known as “Aaron” released Nepenthes, a tool designed for precisely this purpose. Nepenthes entraps crawlers within an endless network of deceptive content. The developer acknowledged to Ars Technica that this approach is intentionally aggressive, if not overtly malicious. The tool’s name, Nepenthes, refers to a genus of carnivorous plants known as pitcher plants.
Cloudflare, a major provider of website security and bot mitigation tools, recently launched a similar offering named AI Labyrinth.
Cloudflare stated in their blog post that AI Labyrinth is intended to “slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect ‘no crawl’ directives.” It aims to redirect misbehaving AI crawlers to “irrelevant content rather than extracting your legitimate website data.”
DeVault from SourceHut commented, “Nepenthes has a satisfying sense of justice to it, since it feeds nonsense to the crawlers and poisons their data sources, but ultimately Anubis is the solution that worked” effectively for SourceHut’s infrastructure.
However, DeVault also made a fervent public appeal for a more fundamental solution: “Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.”
Given the unlikely prospect of such a widespread cessation, developers, particularly those within the FOSS realm, are resorting to ingenuity and humor in their ongoing efforts to combat intrusive web scraping bots.