ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.
You are viewing a single thread.
View all comments 46 points
You can’t provide PII as input training data to an LLM and expect it to never output it at any point. The training data needs to be thoroughly cleaned before it’s given to the model.