Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

hylobates@jlai.lu · 3 days ago

Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

sudoer777@lemmy.ml · edit-2 2 days ago

I’m okay with a few crawlers, but not what’s effectively a DDoS attack by AI companies who abuse my resources generating terabytes of traffic and crashing my server while costing me money. I use Anubis now, which sucks from an accessibility standpoint but I’m not dealing with their malicious traffic anymore.

eli@lemmy.world · 3 days ago

I ended up just pushing everything behind my tailnet and only leave my game server ports open(which are non-standard ports).

punrca@piefed.world · 3 days ago

It’s best to use either Cloudflare (best IMO) or Anubis.

If you don’t want any AI bots, then you can setup Anubis (open source; requires JavaScript to be enabled by the end user): https://github.com/TecharoHQ/anubis
Cloudflare automatically setups robots.txt file to block “AI crawlers” (but you can setup to allow “AI search” for better SEO). Eg: https://blog.cloudflare.com/control-content-use-for-ai-training/#putting-up-a-guardrail-with-cloudflares-managed-robots-txt

Cloudflare also has an option of “AI labyrinth” to serve maze of fake data to AI bots who don’t respect robots.txt file.

shane@feddit.nl · 2 days ago

If you’re relying on Cloudflare are you even self-hosting?

sudoer777@lemmy.ml · 2 days ago

Yes if it’s tunneled to your self-hosting setup. With CGNAT you have to use similar services if you want to self-host.

AHemlocksLie@lemmy.zip · 3 days ago

Pretty sure I’ve repeatedly heard about the crawlers completely ignoring robots.txt, so does Cloudflare really do that much?

tomjuggler@lemmy.world · 2 days ago

Yes, CloudFlare blocks agents completely if they ignore it’s restrictions. The key is scale - CloudFlare has a birds eye view of traffic patterns across millions of sites and can do statistical analysis to determine who is a bot.

I hate the necessity but it works

Sv443@sh.itjust.works · 2 days ago

Like a lock on a door, it stops the vast majority but can’t do shit about the actual professional bad guys

CoreLabJoe@piefed.ca · 3 days ago

Blocking them locally is one way, but if you’re already using cloudflare there’s a nice way to do it UPSTREAM so it’s not eating any of your resources.

You can do geofencing/blocking and bot-blocking via Cloudflare:
https://corelab.tech/cloudflarept2/

e8CArkcAuLE@piefed.social · edit-2 3 days ago

that’s the kind of shit we pollute our air and water for…and properly seal and drive home the fuckedness of our future and planet.

i totally get you sending them to nepenthes though.

Thorry@feddit.org · 3 days ago

Yeah I had the same thing. All of a sudden the load on my server was super high and I thought there was a huge issue. So I looked at the logs and saw an AI crawler absolutely slamming my server. I blocked it, so it only got 403 responses but it kept on slamming. So I blocked the IPs it was coming from in iptables, that helped a lot. My little server got about 10000 times the normal traffic.

I sorta get they want to index stuff, but why absolutely slam my server to death? Fucking assholes.

Ephera@lemmy.ml · 3 days ago

My best guess is that they don’t just index things, but rather download straight from the internet when they need fresh training data. They can’t really cache the whole internet after all…

Spice Hoarder@lemmy.zip · 3 days ago

The sad thing is that they could cache the whole internet if there was a checksum protocol.

Now that I’m thinking about it, I actually hate the idea that there are several companies out there with graph databases of the entire internet.

Techlos@lemmy.dbzer0.com · 3 days ago

Bingo, modern datasets are a list of URL’s with metadata rather than the files themselves. Every new team/individual wanting to work with the dataset becomes another DDoS participant.

early_riser@lemmy.world · 3 days ago

It’s already hard enough for self-hosters and small online communities to deal with spam from fleshbags, now we’re being swarmed by clankers. I have a little Mediawiki to document my ~~deranged maladaptive daydreams~~ worldbuilding and conlanging projects, and the only traffic besides me is likely AI crawlers.

I hate this so much. It’s not enough that huge centralized platforms have the network effect on their side, they have to drown our quiet little corners of the web under a whelming flood of soulless automata.

NewNewAugustEast@lemmy.zip · edit-2 3 days ago

I was up 10 to 20 percent month over month, and suddenly up 1000% it has spiked hard and they all are data harvesters.

I know I am going to start blocking them, which is too bad, I put valuable technical information up, with no advertising, because I want to share it. And I don’t even really mind indexers or even AI learning about it. But I cannot sustain this kind of bullshit traffic, so I will end up taking a heavy hand and blocking everything, and then no one will find it.

wonderingwanderer@sopuli.xyz · 3 days ago

Anubis is supposed to filter out and block all those bots from accessing your webpage.

Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit, which still consumes resources but is intended to poison the bots’ training data.

So pick your poison

MonkeMischief@lemmy.today · 3 days ago

Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit

We’ve officially reached a place where cyberspace is beginning to look like communing with the arcane. Lol

mnemonicmonkeys@sh.itjust.works · 2 days ago

And the AI are demon souls, specifically aspects of gluttony

Admiral Patrick@dubvee.org · 3 days ago

I was blocking them but decided to shunt their traffic to Nepenthes instead. There’s usually 3-4 different bots thrashing around in there at any given time.

If you have the resources, I highly recommend it.

Petter1@discuss.tchncs.de · 3 days ago

Reference for lazy ones: https://zadzmo.org/code/nepenthes/

CamelCityCalamity@lemmy.world · 2 days ago

ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

Success?

timestatic@feddit.org · 3 days ago

This… is fucking amazing

Mike@piefed.chrisco.me · 3 days ago

Oh interesting! Ive done something similar but not didnt put as much effort.

For me, I just made an unending webpage that would create a link to another page…that would say bullshit. Then it would have another link with more bullshit…etc…etc…And it gets slower as time goes on.

Also made a fail2ban banning IPs that reached a certain number of links down. It worked really well, traffic is down 95% and it does not affect any real human users. Its great :)

I have a robots.txt that should tell them not to look at the sites. But if they dont want to read it, I dont want to be nice.

TropicalDingdong@lemmy.world · 3 days ago

Bruh if you had a live stream of this I would subscribe to your only fans.

KairuByte@lemmy.dbzer0.com · 3 days ago

I… I don’t know how you’d even stream that? A log of pages loaded?

TropicalDingdong@lemmy.world · 3 days ago

A log of pages loaded?

Keep going I’m almost there…

queerlilhayseed@piefed.blahaj.zone · 3 days ago

Requests per second getting higher, and higher, then they level out – but the server is just barely hanging in there, frantically serving as many requests as it possibly can, and then all at once they come crashing down into warm, gentle waves of relaxing human pings.

x00z@lemmy.world · 3 days ago

50% of my traffic is scrapers now. I really want to block them but I also want my content to be indexed and used for LLMs. At the moment there isn’t really an in-between way of doing that. :(

(This is with me knowing they fuck up the electricity nets and memory chips, I’m just hoping that gets better soon.)

Anarki_@lemmy.blahaj.zone · 3 days ago

Why do you want your stuff in the lie machines? 🤔

x00z@lemmy.world · 3 days ago

I work on a project that has a lot of older, less technical and international users who could use some extra help. We’re also not always found by the people that would benefit from our project. https://keeperfx.net/

lost_screwdriver@thelemmy.club · 3 days ago

That they do not become lie machines. Propaganda, lies and fake news from various different sources gets spammed all across the internet. If AI picks it up, it can just spread misinformation, especially if all trustworthy or useful sources block them

zr0@lemmy.dbzer0.com · 3 days ago

And what was the reason for blocking them? What is unbearable?

Ephera@lemmy.ml · 3 days ago

They cause a huge amount of load, deteriorating the service for everyone else. I’m also guessing the time ranges in the graph, where there’s no data, is when OP’s server crashed from the load and had to restart.

That kind of shit can easily trigger alerting and will look like a DDoS attack. I would be pissed, too, if I dropped everything to see why my server is going down and it’s not even proper criminals, but rather just some silicon valley cunts.

zr0@lemmy.dbzer0.com · 3 days ago

Thanks for your time explaining. I have multiple public facing services and I never had any issues with load just because of some crawlers. That’s why I always wonder why people get so mad at them

hoppolito@mander.xyz · 2 days ago

I’m providing hosting for a few FOSS services, relatively small scale, for around 7 years now and always thought the same for most of that time. People were complaining about their servers being hit but my traffic was alright and the server seemed bulky enough to have a lot of buffer.

Then, like a month or two ago, ~~the fire nation attacked~~ the bots came crawling. I had sudden traffic spikes of up to 1000x, memory was hogged and the CPU could barely keep up. The worst was the git forge, public repos with bots just continuously hammering away at diffs between random commits, repeatedly building out history graphs for different branches and so on - all fairly intense operations.

After the server went to its knees multiple times over a couple days I had to block public access. Only with proof of work in front could I finally open it again without destroying service uptime. And even weeks later, they were still trying to get at different project diffs whose links they collected earlier, it was honestly crazy.

zr0@lemmy.dbzer0.com · 2 days ago

That’s very interesting, as if only certain types of content get crawled. May I know what kind of software you used and if you had a reverse proxy in front of it?

hoppolito@mander.xyz · 2 days ago

The code forge is gitea/forgejo, and the proxy in front used to be traefik. I tried fail2ban in front for a while as well but the issue was that everything appeared to come from different IPs.

The bots were also hitting my other public services pretty hard but nowhere near as bad. I think it’s a combination of 2 things:

most things I host publicly beside git are smaller or static pages, so quickly served and not draining resources as much
they try to hit all ‘exit nodes’ (i.e. links) off a page, and on repos with a couple hundred+ commits, with all the individual commits and diffs that are possible to hit that’s a lot.

A small interesting observation I made was that they also seemed to ‘focus’ on specific projects. So my guess would be you get unlucky once by having a large-ish repo targeted for crawling and then they just get stuck in there and get lost in the maze of possible pages. On the other hand it may make targeted blocking for certain routes more feasible…

I think there’s a lot to be gained here by everybody pooling their knowledge, but on the other hand it’s also an annoying topic and most selfhosting (including mine) is afaik done as a hobby, so most peeps will slap an Anubis-like PoW in front and call it a day.

zr0@lemmy.dbzer0.com · 2 days ago

Those are some very good and helpful insights, thank you very much for sharing. I was also hosting forgejo and used traefik as reverse proxy. However, my forgejo was locked down, which is probably why I had no bot attack.

Some thoughts:

fail2ban works for malicious requests very good, meaning things that get logged somewhere.
CrowdSec has an AI Bot Blocklist, which they offer for free if you host a FOSS project.
I am developing a tool which blocks CIDR ranges based on country directly via ufw. Maybe blocking countries helps in such a case, but not everyone wants to block whole countries.