WelcomeUser Guide
ToSPrivacyCanary
DonateBugsLicense

©2025 Poal.co

212
https://vk.com/video594771890_456255306

(post is archived)

[–] 0 pt

It's a pretty far stretch to claim that making vector embeddings makes the model into a derived work. If you feed The Lord of the Rings into a vector embedding algorithm, you are not getting The Lord of the Rings out the other side, you're getting a probability distribution of what words are most likely to come next after a given word or fragment of a word.

It's probably possible to get an LLM to spit out a verbatim copyrighted work, but only if you already know exactly what tokens and words to give it to produce the correct next words to make up the copyrighted work. And there's zero guarantee, and a high likelihood, that it would instead spit out a word which is the next most probable, but is not the next word in the actual copyrighted work (LotR, in this example).

So it is derived data but I don't think courts would uphold it being considered a derived work which has specific meanings which a database of word probabilities probably doesn't meet.

[–] 1 pt (edited )

Not really. The data is literally the book contents. It is by definition a derived work.