ChatGPT's creator OpenAI is being sued for secretly scraping 300 billion words from the internet, including books, articles, websites, posts, and personal information that was obtained without consent.

(post is archived)

You are currently inside a comment thread.

Click here to see all the comments (11).

[–] • 0 pt

If they scrape data that is out on the internet for anyone to view them you can't really say that they can't have access to the data. The only way to get personal information would be to gain access to private areas. Unless the private data wasn't accessible to anyone then there really isn't anything to sure over. All the big tech days companies scrape entering they can get access too and they probably have access to more than people know. They have ways to track and associate your accounts with each other and can consolidate everything you do on every week site. Even when you switch to your other secret accounts and stuff. They add it all to the "you" file. Consolidate, sort and sell it. Remember if you are using a service for free then your data is their price for admission and they make a lot of money selling "you."

link

[–] • 1 pt

Robots.txt

Also, just because something is accessible doesn't mean you have license to use it. This is called derived work. ChatGPT's LLM is apparently is massive copyright violation and is itself a derivative work of thousands of other's works.

parent
link

[–] • 1 pt

Derivative is the key word here.

parent
link

[–] • 0 pt

It's a pretty far stretch to claim that making vector embeddings makes the model into a derived work. If you feed The Lord of the Rings into a vector embedding algorithm, you are not getting The Lord of the Rings out the other side, you're getting a probability distribution of what words are most likely to come next after a given word or fragment of a word.

It's probably possible to get an LLM to spit out a verbatim copyrighted work, but only if you already know exactly what tokens and words to give it to produce the correct next words to make up the copyrighted work. And there's zero guarantee, and a high likelihood, that it would instead spit out a word which is the next most probable, but is not the next word in the actual copyrighted work (LotR, in this example).

So it is derived data but I don't think courts would uphold it being considered a derived work which has specific meanings which a database of word probabilities probably doesn't meet.

parent
link

[–] • 1 pt

(edited )

Not really. The data is literally the book contents. It is by definition a derived work.

parent
link