Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Pro@programming.dev · edit-2 12 days ago

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

IndescribablySad@threads.net@sh.itjust.works · 13 days ago

I really feel like scrapers should have been outlawed or actioned at some point.

floofloof@lemmy.ca · 13 days ago

But they bring profits to tech billionaires. No action will be taken.

BodilessGaze@sh.itjust.works · 13 days ago

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that’s dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There’s nothing we can do legally about Chinese scrapers.

mormund@feddit.org · 12 days ago

I thought Anthropic was also very abusive with their scraping?

Flax@feddit.uk · 13 days ago

Can you not just block China?

BodilessGaze@sh.itjust.works · edit-2 12 days ago

We do, somewhat. We haven’t gone as far as a blanket ban of Chinese CIDR ranges because there’s a lot of risks and bureaucracy associated with a move like that. But it probably makes sense for a small company like Codeberg, since they have higher risk tolerance and can move faster.

Programmer Belch@lemmy.dbzer0.com · 13 days ago

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

S7rauss@discuss.tchncs.de · 13 days ago

Does your tool respect the site’s robots.txt?

who@feddit.org · edit-2 13 days ago

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

redjard@lemmy.dbzer0.com · 12 days ago

Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …