So, I knew Aaron and I definitely would not presume to predict what he would have thought, but I’d point out there is a sizeable state space where he should never have been prosecuted, and scraping by others including large commercial companies should not prosecutable on the same grounds.
I repeat what Aaron’s friends and lawyers said at the time: we were going to fight that case, and we were going to win.
Most offer ways to opt out, some don’t. Scraping somebody’s website might be annoying or problematic traffic-wise but that’s a far (very far) step removed from saying scrapers should be criminalised. The latter statement is outright laughable.
Search engines appear to care more about being good "Netizens". It's not like GoogleBot never crashed a site, but it's rare. Search engine bots check if they need to back off for a bit, they check etags, notices if page changes infrequently and slow down their crawler frequency.
If you train an LLM, it's not like you keep a copy of every page around, so there's no point to check if you need to re-scrape a page, you do, because you store nothing.
Personally I think people would be pretty indifferent to the new generation of scrapers, AI or other types, if they at least behaved and slowed down if they notice a site struggling. If they had the slightest bit of respect for others on the web, this wouldn't be an issue.
It's a bit more like a physical business with a "public welcome" policy like a coffee shop going viral and then having tens of thousands of people walking in and taking pictures but not buying coffee. It's disruptive, but not illegal.
Acme.com is welcome to require authentication for all pages but their home page, which would quickly cause the traffic to drop. They don't want to do this - like the coffee shop, they want to be open to public, and for good reasons.
Sometimes the use profile changes dramatically in a short time. 15 years ago, Netflix created the video streaming market and shared bandwidth capacity that had been excessive before wasn't enough. 15 years before that, Google did the same thing when they created search and started driving tremendous traffic to text based websites which had spread through word of mouth before.
Turns out the micro transaction people probably had the right idea.
> waiting on the govt to do something is a path of failure
To keep the goverment accountable is a duty of every citizen and the only way to have a functioning society. The failure is to let the goverment be arbitrary and cater to the powerful instead of following the rule of law and applying it equally at all levels.
Because the law deals with intent. The intent for a 12 year old skiddie with a ddos box is to harm someone else's internet. the intent of big scrapers is to collect data. if you want to make the latter illegal then vote for that instead of loading it with the normative baggage of the former.
It's the same problem as why Occupy Wallstreet fell apart: bunch of losers who don't understand the system screech about the system. because they don't understand it, they can't offer any meaningful dialogue about how to fix it beyond screeching.
> Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
Those laws are intended to protect corporations. If corporations are the ones doing the scraping, it doesn't make sense for the same laws to affect them.
I've had to deploy a combination of Cloudflare's bot protection and anubis on over 200 domains across 8 different hosting environments in the last 2 months. I have small business clients that couldn't access their sales and support platforms because their websites that normally see tens of thousands of unique sessions per day are suddenly seeing over a million in an hour.
Anthropic and OpenAI were responsible for over 70% of that traffic.
Have you not been paying attention to the news for the past few years?
No, there isn't. If there were, Trump would be in prison, not the Oval Office. And he and the Republican Party have deliberately fostered this environment of corruption and rule-by-wealth so that they can gain more power and even more wealth.
And now they are also backing the AI zealots, and techbros more generally, to ensure that they can do whatever the hell they want, damn the consequences to the rest of the world.
I suspect part of the issue is that people are still using things like `acme.com` and `demo.com` as an example domain in their documentation and tests instead of relying on `example.com` which is reserved exactly for this purpose [0]
I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation.
If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.
This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.
But it is not necessary to see the results that are being described.
If sites like my tiny little browser game, with roughly 120 weekly unique users, are getting absolutely hammered by the scraper-bots (it was, last year, until I put the Wiki behind a login wall; now I still get a significant amount of bot traffic, it's just no longer enough to actually crash the game), then sites that people actually know and consider important like acme.com are very likely to be getting massive deluges of traffic purely from first-order hits.
That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain.
Next to the massive amounts of energy they are burning in their own datacenters, they are burning up other datacenters as well. Plus all the extra energy used by every router, hub and switch in between.
Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping,
Small site operators like us know very well that the utility they can get by scraping us is marginal at best. Based on their patterns of behavior, though, my best guess is that they've simply configured their bots to scrape absolutely everything, all the time, forever, as aggressively as possible, and treat any attempt to indicate "hey, this data isn't useful to you" as an adversarial signal that the site operator is trying to hide things from them that are their God-given right.
How are you measuring this? Does your solution rely on user agent or device fingerprinting? Curious to know what tools are available today and how accurate they are.
I'm popular in Europe, there's no reason people from Singapore, Russia, Brazil and literally every other country in the world to all start visiting very old articles and permalinks for comments en masse.
Having honeypot links is the only thing that helps, but I'm running into massive IP tables, slowing things down.
This is not what I want to do with my time. I can't afford the expensive specialised tools. I'm just a solo entrepreneur on a shoestring budget. I just want to improve the website for my 3k real users and 10k real daily guests, not for bots.
Bot traffic is crazy even for smaller sites, but still manageable. I was getting 2,000 visitors a day on my infrequently updated website, but after I blocked all the bots via Cloudflare it went back to the normal double digit visitor count.
and how often are those 6M pages changing? how often are those bots finding anything new? why are the bot makers not noticing no difference and just slowing the request down for essentially stale content to them
In March 2025, Drew DeVault wrote a blog post called "Please stop externalizing your costs directly into my face"[1]. I think that is a pretty good guess as to why these bots do not care about frequency of changes, it costs to much.
Every run is basically a fresh run, no state stored, every page is just feed into the machine a new. At least that's my theory.
The AI companies need a full copy of your page, every time they retrain a model. Now they could store that in their own datacenters, but that's a full copy of the internet, in a market where storage costs are already pretty high. So instead, they just externalize the storage cost. If you run a website, a public Gitlab instance, Forgejo, a wiki, a forum, whatever, you basically functions as free offsite storage for the AI companies.
On the platform at my work they scrape the same page multiple times, over and over. They do not care to cache anything. And it’s ridiculous to account for because for example for our properties, everything is news-based so warming the cache was as simple as loading the first X articles to get them into cache. But with AI that is not viable because they scrape as much as possible, articles from 2018, 2017. Management doesn’t want to block them though. It’s just suffering through the endless barrage. I was able to do a lot for this like heavier caching even with pgpool but it’s so crazy that this small subset of bots effectively accounts for like 60%+ of our spend.
Many are using residential proxies now. It's impossible to block them. Not even Google Analytics succeeds. People are sitting on reports thinking their website is suddenly very popular, but it's all random ips, from random locations across the world requesting 1 page at a time, at random times of the day.
> Now closing https service is obviously just a temporary fix
Probably the best starting point would be to edit the robots.txt file and disallow LLM bots there.
Currently the file allows all bots: http://acme.com/robots.txt
Because they are hunting for vulnerable devices and the requests' existence are unique to an application. Like a VoIP appliance for example.
They usually request something deep like /foo/bar/login.html as part of their reconnaissance.
I'm up to 4 pages of filter rules after the massive IP blacklist.
These assholes are also scanning every address on the IPv4 internet and hoovering up the content.
To answer your first question: No, that's the OS's job. But some clever rules could be setup for filtering invalid requests depending on your web server.
That's called a WAF, web application firewall, a separate piece of software (or server module) where paths in the web applications hosted are defined, often variables and variable types can be validated, etc. to prevent the kind of attacks these scans are often doing.
The only real solution is to put Anubis in front. For me, I just use Cloudflare in front and that suffices. But it's only a few thousand per hour by default. My homeserver can handle that quite well on its own.
I opted for Bunny Shield exactly to combat bots, in particular ones that spoof User Agents and rotate millions of IPs. It works great, detecting the vast majority of bots and challenging them. Much more user friendly than Cloudflare too, which typically resorts to challenging everyone (not that CF was ever an option due to various concerns).
I also added various rate limits such as 1 RPS to my expensive SSR pages, after which a visitor gets challenged. Again this blocks bots without harming power users much.
* global distributed caching of content. This reduces the static load on our servers and bandwidth usages to essentially 0, and since it is served at an end point closest to wherever the client is, they get less latency. This includes user logged in specifics as well.
* shared precached common libraries (ie. jquery, etc) for faster client load times
* Offers automated minification of JS, CSS, and HTML, along with image optimization (serve size and resolution of image specific to the device user is viewing it from) to increase speed
* always up mode (even if my server is down for some reason, I can continue to serve static content)
* detailed analytics and reporting on usage / visitors
There are a lot more, but those are a few that come to mind.
There are plenty of local LLMs out there run by humans that play nice. It's not the LLMs that are the problem. It's the corporations. That's the commonality. Human people aren't doing this. These corporate legal persons are a much more dangerous and capable form of non-human intelligence with non-human motives than LLMs (which are not doing the scraping or even calling the tools which are sending the HTTP requests). And they have lobbied their way to legal immunity to most of their crimes.
OP here. I don't have a proposal. I do know of two people in similar situations who have been replying to bot requests with zip bombs and other software WMDs.
> The LLM companies are not picking on me in particular, they are pounding every site on the net.
Why is not this a criminal offense? They are hurting business for profit (or for higher valuation as they probably have no profit at all).
Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
The five-year and ten-year penalties kick in only when the government can show the offense caused at least $5,000 in losses across all victims during a one-year period. https://legalclarity.org/what-are-the-punishments-for-a-ddos...
Because might makes right and any entity with the power to legally put up a fight is in on the game (or wants to be)
We've already established that computer crime and IP laws apply to normies and not tech companies
Normative vs prerogative state [1]. See US v. Swartz compared to Meta use of LibGen for Llama
[1] https://en.wikipedia.org/wiki/Dual_state_(model)
So, I knew Aaron and I definitely would not presume to predict what he would have thought, but I’d point out there is a sizeable state space where he should never have been prosecuted, and scraping by others including large commercial companies should not prosecutable on the same grounds.
I repeat what Aaron’s friends and lawyers said at the time: we were going to fight that case, and we were going to win.
Is what an offence lol? Bot scraper traffic?
How do you think search engines work?
They work because they offer ways to opt out, they honor crawl delay, setting ideal scraping times, IndexNow, etc.
And they give you real, valuable traffic in return.
Most offer ways to opt out, some don’t. Scraping somebody’s website might be annoying or problematic traffic-wise but that’s a far (very far) step removed from saying scrapers should be criminalised. The latter statement is outright laughable.
Search engines appear to care more about being good "Netizens". It's not like GoogleBot never crashed a site, but it's rare. Search engine bots check if they need to back off for a bit, they check etags, notices if page changes infrequently and slow down their crawler frequency.
If you train an LLM, it's not like you keep a copy of every page around, so there's no point to check if you need to re-scrape a page, you do, because you store nothing.
Personally I think people would be pretty indifferent to the new generation of scrapers, AI or other types, if they at least behaved and slowed down if they notice a site struggling. If they had the slightest bit of respect for others on the web, this wouldn't be an issue.
It's a bit more like a physical business with a "public welcome" policy like a coffee shop going viral and then having tens of thousands of people walking in and taking pictures but not buying coffee. It's disruptive, but not illegal.
Acme.com is welcome to require authentication for all pages but their home page, which would quickly cause the traffic to drop. They don't want to do this - like the coffee shop, they want to be open to public, and for good reasons.
Sometimes the use profile changes dramatically in a short time. 15 years ago, Netflix created the video streaming market and shared bandwidth capacity that had been excessive before wasn't enough. 15 years before that, Google did the same thing when they created search and started driving tremendous traffic to text based websites which had spread through word of mouth before.
Turns out the micro transaction people probably had the right idea.
Depends on the country. In Japan, you could be considered a "public nusicance" and be tossed behind bars for a bit.
adapt or die
waiting on the govt to do something is a path of failure
> waiting on the govt to do something is a path of failure
To keep the goverment accountable is a duty of every citizen and the only way to have a functioning society. The failure is to let the goverment be arbitrary and cater to the powerful instead of following the rule of law and applying it equally at all levels.
Because the law deals with intent. The intent for a 12 year old skiddie with a ddos box is to harm someone else's internet. the intent of big scrapers is to collect data. if you want to make the latter illegal then vote for that instead of loading it with the normative baggage of the former.
It's the same problem as why Occupy Wallstreet fell apart: bunch of losers who don't understand the system screech about the system. because they don't understand it, they can't offer any meaningful dialogue about how to fix it beyond screeching.
I have added a DB replica server just to keep my website from succumbing to AI bot traffic.
> Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
Those laws are intended to protect corporations. If corporations are the ones doing the scraping, it doesn't make sense for the same laws to affect them.
Because they have more money.
I've had to deploy a combination of Cloudflare's bot protection and anubis on over 200 domains across 8 different hosting environments in the last 2 months. I have small business clients that couldn't access their sales and support platforms because their websites that normally see tens of thousands of unique sessions per day are suddenly seeing over a million in an hour.
Anthropic and OpenAI were responsible for over 70% of that traffic.
> Is there no rule of law anymore?
Have you not been paying attention to the news for the past few years?
No, there isn't. If there were, Trump would be in prison, not the Oval Office. And he and the Republican Party have deliberately fostered this environment of corruption and rule-by-wealth so that they can gain more power and even more wealth.
And now they are also backing the AI zealots, and techbros more generally, to ensure that they can do whatever the hell they want, damn the consequences to the rest of the world.
His robots.txt explicitly allows bots including LLM bots to scrape his site
The LLM scraper bots ignore robots.txt
I suspect part of the issue is that people are still using things like `acme.com` and `demo.com` as an example domain in their documentation and tests instead of relying on `example.com` which is reserved exactly for this purpose [0]
[0]: https://www.iana.org/domains/reserved
A small part. On my server AI bots outnumber real visitors 300 to one.
I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation.
If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.
This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.
That is very possible.
But it is not necessary to see the results that are being described.
If sites like my tiny little browser game, with roughly 120 weekly unique users, are getting absolutely hammered by the scraper-bots (it was, last year, until I put the Wiki behind a login wall; now I still get a significant amount of bot traffic, it's just no longer enough to actually crash the game), then sites that people actually know and consider important like acme.com are very likely to be getting massive deluges of traffic purely from first-order hits.
The article describes that a lot of the requests are for non-existent URLs. Do you observe the same?
That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain.
Next to the massive amounts of energy they are burning in their own datacenters, they are burning up other datacenters as well. Plus all the extra energy used by every router, hub and switch in between.
Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping,
Tell that to the idiots doing the scraping.
Small site operators like us know very well that the utility they can get by scraping us is marginal at best. Based on their patterns of behavior, though, my best guess is that they've simply configured their bots to scrape absolutely everything, all the time, forever, as aggressively as possible, and treat any attempt to indicate "hey, this data isn't useful to you" as an adversarial signal that the site operator is trying to hide things from them that are their God-given right.
How are you measuring this? Does your solution rely on user agent or device fingerprinting? Curious to know what tools are available today and how accurate they are.
I'm popular in Europe, there's no reason people from Singapore, Russia, Brazil and literally every other country in the world to all start visiting very old articles and permalinks for comments en masse.
Having honeypot links is the only thing that helps, but I'm running into massive IP tables, slowing things down.
This is not what I want to do with my time. I can't afford the expensive specialised tools. I'm just a solo entrepreneur on a shoestring budget. I just want to improve the website for my 3k real users and 10k real daily guests, not for bots.
Bot traffic is crazy even for smaller sites, but still manageable. I was getting 2,000 visitors a day on my infrequently updated website, but after I blocked all the bots via Cloudflare it went back to the normal double digit visitor count.
I have 6M pages across 8 domains. I have 10 unique IP residential bots per second working hard to scrape every single page.
and how often are those 6M pages changing? how often are those bots finding anything new? why are the bot makers not noticing no difference and just slowing the request down for essentially stale content to them
In March 2025, Drew DeVault wrote a blog post called "Please stop externalizing your costs directly into my face"[1]. I think that is a pretty good guess as to why these bots do not care about frequency of changes, it costs to much.
Every run is basically a fresh run, no state stored, every page is just feed into the machine a new. At least that's my theory.
The AI companies need a full copy of your page, every time they retrain a model. Now they could store that in their own datacenters, but that's a full copy of the internet, in a market where storage costs are already pretty high. So instead, they just externalize the storage cost. If you run a website, a public Gitlab instance, Forgejo, a wiki, a forum, whatever, you basically functions as free offsite storage for the AI companies.
1) https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
On the platform at my work they scrape the same page multiple times, over and over. They do not care to cache anything. And it’s ridiculous to account for because for example for our properties, everything is news-based so warming the cache was as simple as loading the first X articles to get them into cache. But with AI that is not viable because they scrape as much as possible, articles from 2018, 2017. Management doesn’t want to block them though. It’s just suffering through the endless barrage. I was able to do a lot for this like heavier caching even with pgpool but it’s so crazy that this small subset of bots effectively accounts for like 60%+ of our spend.
Many are using residential proxies now. It's impossible to block them. Not even Google Analytics succeeds. People are sitting on reports thinking their website is suddenly very popular, but it's all random ips, from random locations across the world requesting 1 page at a time, at random times of the day.
One day last week one of my clients' sites was getting about 2k "visitors" per second - I had to block the entire AS45102 to make it stop.
You can block CN, RU, SG, KR, and the level 3 from "ipsum" and the numbers go down a lot.
People might not know about ipset - dont use individual rules in iptables.
Nginx can reject easily based on country.
geoip2 /etc/GeoLite2-Country.mmdb { $geoip2_metadata_country_build metadata build_epoch; $geoip2_data_country_code default=Unknown source=$remote_addr country iso_code; }
server { .... if ($allowed_country = no) { return 444; } }
> I closed port 443
> Now closing https service is obviously just a temporary fix
Probably the best starting point would be to edit the robots.txt file and disallow LLM bots there. Currently the file allows all bots: http://acme.com/robots.txt
The LLM scraper bots ignore robots.txt
If the bots respected robots.txt, I would not have a viable business plan.
> Nearly all of them were for non-existent pages.
Do any webservers have a feature where they keep a list in memory of files/paths that exist?
Also why are most requests for non existent pages?
Because they are hunting for vulnerable devices and the requests' existence are unique to an application. Like a VoIP appliance for example.
They usually request something deep like /foo/bar/login.html as part of their reconnaissance.
I'm up to 4 pages of filter rules after the massive IP blacklist.
These assholes are also scanning every address on the IPv4 internet and hoovering up the content.
To answer your first question: No, that's the OS's job. But some clever rules could be setup for filtering invalid requests depending on your web server.
That's called a WAF, web application firewall, a separate piece of software (or server module) where paths in the web applications hosted are defined, often variables and variable types can be validated, etc. to prevent the kind of attacks these scans are often doing.
The only real solution is to put Anubis in front. For me, I just use Cloudflare in front and that suffices. But it's only a few thousand per hour by default. My homeserver can handle that quite well on its own.
For those who have deployed Cloudflare in front, what are pros and cons? How's the user experience? Do they offer free bot protection?
I opted for Bunny Shield exactly to combat bots, in particular ones that spoof User Agents and rotate millions of IPs. It works great, detecting the vast majority of bots and challenging them. Much more user friendly than Cloudflare too, which typically resorts to challenging everyone (not that CF was ever an option due to various concerns).
I also added various rate limits such as 1 RPS to my expensive SSR pages, after which a visitor gets challenged. Again this blocks bots without harming power users much.
Some pros for us, in addition to bot protection.
* global distributed caching of content. This reduces the static load on our servers and bandwidth usages to essentially 0, and since it is served at an end point closest to wherever the client is, they get less latency. This includes user logged in specifics as well.
* shared precached common libraries (ie. jquery, etc) for faster client load times
* Offers automated minification of JS, CSS, and HTML, along with image optimization (serve size and resolution of image specific to the device user is viewing it from) to increase speed
* always up mode (even if my server is down for some reason, I can continue to serve static content)
* detailed analytics and reporting on usage / visitors
There are a lot more, but those are a few that come to mind.
This reminds me of a problem we hit at work. Ended up going a different direction but same root issue.
i had to block all traffic, except that of my country. As i offer a service that is exclusive to my country it worked like a charm.
Series of Chinese LLM scrapers kept PortableApps.com running slow and occasionally unresponsive for 2 weeks.
How do you know that they were LLM scrapers? The reason I ask is because user agents could easily be spoofed?
There are plenty of local LLMs out there run by humans that play nice. It's not the LLMs that are the problem. It's the corporations. That's the commonality. Human people aren't doing this. These corporate legal persons are a much more dangerous and capable form of non-human intelligence with non-human motives than LLMs (which are not doing the scraping or even calling the tools which are sending the HTTP requests). And they have lobbied their way to legal immunity to most of their crimes.
> Human people aren't doing this
Who do you think writes these scrapers? Well, I mean aside from the vibe coded ones.
> Someone really ought to do something about it.
What is bro proposing here?
OP here. I don't have a proposal. I do know of two people in similar situations who have been replying to bot requests with zip bombs and other software WMDs.