Right now, robots.txt on lemmy.ca is configured this way
User-Agent: *
Disallow: /login
Disallow: /login_reset
Disallow: /settings
Disallow: /create_community
Disallow: /create_post
Disallow: /create_private_message
Disallow: /inbox
Disallow: /setup
Disallow: /admin
Disallow: /password_change
Disallow: /search/
Disallow: /modlog
Would it be a good idea privacy-wise to deny GPTBot from scrapping content from the server?
User-agent: GPTBot
Disallow: /
Thanks!
No, definitely not. Our work posted in the open is done so because we want it to be open!
It is understandable that not all work wants to be open, but access would already be appropriately locked down for all robots (and humans!) who are not a member of the secret club in those cases. There is no need for special treatment here.
Is this even possible without all federated instances also prohibiting them?
Yes. Ban them.
if ($http_user_agent = "GPTBot") {
return 403;
}
I would have thought so too, but == failed the syntax check
2023/08/07 15:36:59 [emerg] 2315181#2315181: unexpected "==" in condition in /etc/nginx/sites-enabled/lemmy.ca.conf:50
You actually want ~ though because GPTBot is just in the user agent, it’s not the full string.
Strangely, =
works the same as ==
with nginx. It’s a very strange config format…
https://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if
1000% yes. Please block them.
Are they even respecting those files?
But yeah, sure, it’s worth trying!
It’s from the official documentation.