ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • Kbin_space_program@kbin.social
    link
    fedilink
    arrow-up
    5
    arrow-down
    3
    ·
    edit-2
    11 months ago

    That’s a bald faced lie.

    and it can produce copyrighted works.
    E.g. I can ask it what a Mindflayer is and it gives a verbatim description from copyrighted material.

    I can ask Dall-E “Angua Von Uberwald” and it gives a drawing of a blonde female werewolf. Oops, that’s a copyrighted character.

    • KingRandomGuy@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      11 months ago

      I think what they mean is that ML models generally don’t directly store their training data, but that they instead use it to form a compressed latent space. Some elements of the training data may be perfectly recoverable from the latent space, but most won’t be. It’s not very surprising as a result that you can get it to reproduce copyrighted material word for word.

    • ayaya@lemdro.id
      link
      fedilink
      English
      arrow-up
      7
      ·
      11 months ago

      I think you are confused, how does any of that make what I said a lie?

    • TimeSquirrel@kbin.social
      link
      fedilink
      arrow-up
      6
      ·
      11 months ago

      I can do that too. It doesn’t mean I directly copied it from the source material. I can draw a crude picture of Mickey Mouse without having a reference in front of me. What’s the difference there?