TikTok spider has been a real offender for me. For one site I host it burred through 3TB of data over 2 months requesting the same 500 images over and over. It was ignoring the robots.txt too, I ended up having to block their user agent.
Are you sure the caching headers your server is sending for those images are correct? If your server is telling the client to not cache the images, it’ll hit the URL again every time.
If the image at a particular URL will never change (for example, if your build system inserts a hash into the file name), you can use a far-future expires header to tell clients to cache it indefinitely (e.g. expires max
in Nginx).
Perfect use of the meme
time to fill sites code with randomly generated garbage text that humans will not see but crawlers will gobble up?
I don’t think it’s a bad idea but it’s largely dependent on the crawler. I can’t speak for AI based crawlers, but typical scraping targets specific elements on a page or grabbing the whole page and parsing it for what you’re looking for. In both instances, your content is already scrapped and added to the pile. Overall, I have to wonder how long “poisoning the water well” is going to work. You can take me with a grain of salt, though; I work on detecting bots for a living.
I work on detecting bots for a living.
You should just tell people you’re a blade runner.
All this robot.txt stuff “perplex” me.
She should get a library card.