ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.

“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”

Edit: The full paper that’s referenced in the article can be found here

  • Skull giver@popplesburger.hilciferous.nl
    link
    fedilink
    arrow-up
    15
    ·
    edit-2
    1 year ago

    The source of the PII rarely matters when you don’t have any explicit permission to gather it. If someone exercises their legal right to demand correction (i.e. a name change or misattribution, but possibly also demands for takedown) we may see some pretty weird lawsuits and fines.

    The Belgian ING Bank was ordered by the court to alter their old COBOL systems after a correction demand was issued by a customer whose accented name didn’t appear correctly. I imagine ChatGPT may find itself in a similar “I don’t care how you comply with the law, you should’ve figured that out years ago” situation.