I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?
The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.
Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.
Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …
I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?
The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.
Does your tool respect the site’s robots.txt?
Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.
Crawl-delay
is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …