ChatGPT's creator OpenAI is being sued for secretly scraping 300 billion words from the internet, including books, articles, websites, posts, and personal information that was obtained without consent.

(post is archived)

You are currently inside a comment thread.

Click here to see all the comments (11).

[–] • 0 pt

It's a pretty far stretch to claim that making vector embeddings makes the model into a derived work. If you feed The Lord of the Rings into a vector embedding algorithm, you are not getting The Lord of the Rings out the other side, you're getting a probability distribution of what words are most likely to come next after a given word or fragment of a word.

It's probably possible to get an LLM to spit out a verbatim copyrighted work, but only if you already know exactly what tokens and words to give it to produce the correct next words to make up the copyrighted work. And there's zero guarantee, and a high likelihood, that it would instead spit out a word which is the next most probable, but is not the next word in the actual copyrighted work (LotR, in this example).

So it is derived data but I don't think courts would uphold it being considered a derived work which has specific meanings which a database of word probabilities probably doesn't meet.

parent
link

[–] • 1 pt

(edited )

Not really. The data is literally the book contents. It is by definition a derived work.

parent
link