Should we block OpenAI from scrapping the server?(platform.openai.com)

posted 1 year ago

Right now, robots.txt on lemmy.ca is configured this way

User-Agent: *
  Disallow: /login
  Disallow: /login_reset
  Disallow: /settings
  Disallow: /create_community
  Disallow: /create_post
  Disallow: /create_private_message
  Disallow: /inbox
  Disallow: /setup
  Disallow: /admin
  Disallow: /password_change
  Disallow: /search/
  Disallow: /modlog

Would it be a good idea privacy-wise to deny GPTBot from scrapping content from the server?

User-agent: GPTBot
Disallow: /

Thanks!

Sort:

Hot Top Controversial New Old

[ - ]

ono@lemmy.ca

21 points

1 year ago

Yes, please.

We can’t stop LLM developers from scraping our conversations if they’re determined to do so, but we can at least make our wishes clear. If they respect our wishes, then great. If they don’t, then they’ll be unable to plead ignorance, and our signpost in the road (along with those from other instances) might influence legislation as it’s drafted in the coming years.

permalink

report

[ - ]

Shadow@lemmy.caM

19 points

1 year ago

I’m on board for this, but I feel obliged to point out that it’s basically symbolic and won’t mean anything. Since all the data is federated out, they have a plethora of places to harvest it from - or more likely just run their own activitypub harvester.

I’ve thrown a block into nginx so I don’t need to muck with robots.txt inside the lemmy-ui container.

# curl -H 'User-agent: GPTBot' https://lemmy.ca/ -i
HTTP/2 403

permalink

report

[ - ]

skankhunt42@lemmy.ca

3 points

1 year ago

I imagine they rate limit their requests too so I doubt you’ll notice any difference in resource usage. OVH is Unmetered* so bandwidth isn’t really a concern either.

I don’t think it will hurt anything but adding it is kind of pointless for the reasons you said.

permalink

report

parent

[ - ]

nbailey@lemmy.ca

18 points

1 year ago

Yes. Ban them.

if ($http_user_agent = "GPTBot") {
  return 403;
}

permalink

report

[ - ]

jman269@lemmy.world

6 points

1 year ago

Probably want == instead else we will all be forbidden

permalink

report

parent

[ - ]

Shadow@lemmy.caM

3 points

1 year ago

I would have thought so too, but == failed the syntax check

2023/08/07 15:36:59 [emerg] 2315181#2315181: unexpected "==" in condition in /etc/nginx/sites-enabled/lemmy.ca.conf:50

You actually want ~ though because GPTBot is just in the user agent, it’s not the full string.

permalink

report

parent

[ - ]

nbailey@lemmy.ca

2 points

1 year ago

Strangely, = works the same as == with nginx. It’s a very strange config format…

https://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if

permalink

report

parent

[ - ]

quesomodo@programming.dev

1 point

1 year ago

Look at me! I’m the GPTBot now!

permalink

report

parent

[ - ]

Shadow@lemmy.caM

4 points

1 year ago

Thanks for empowering my lazyness =)

permalink

report

parent

[ - ]

Lucidlethargy@sh.itjust.works

11 points

1 year ago

1000% yes. Please block them.

permalink

report

[ - ]

sndmn@lemmy.ca

8 points

1 year ago

Is this even possible without all federated instances also prohibiting them?

permalink

report

[ - ]

m-p{3}@lemmy.caOPM

14 points

1 year ago

You take action where you can ;)

permalink

report

parent

Lemmy.ca Support / Questions

!lemmy_ca_support@lemmy.ca

Create post

Support / Questions specific to lemmy.ca.

For support / questions related to the lemmy software itself, go to !lemmy_support@lemmy.ml

Community stats

26
Monthly active users
91
Posts
318
Comments

Community stats

Community moderators