WelcomeUser Guide
ToSPrivacyCanary
DonateBugsLicense

©2025 Poal.co

450
https://vk.com/video594771890_456255306

(post is archived)

[–] 3 pts

It is called web crawling and it has been around since the invention of the internet.

[–] 3 pts

I read that this is part of the reason Elon Musk had Twitter shut off its API. People reading tweets is one thing. Software mining every tweet ever posted to use as AI training data is quite another. He thinks they should pay for that kind of use.

[–] 1 pt

Don't search engines do this?

[–] 1 pt

Yes lol. This is the same thing as artists being mad that they put their images on the internet and then people saw those images on the internet.

If your content is posted publicly I don't see how you can be mad that someone put it into an algorithm to turn it into a set of vector embeddings determining which words are likely to come after the previous ones (or embedding "feature" information about an image, same thing).

[–] 1 pt

Yes and no. There is a file called, "robots.txt", which sets crawling limits for the site. Nothing stops crawlers from crawling past (unless account restrictions exist), but it also sets a legal standard. Many sites' contents are crawled or indexed because of this defacto standard.

That said, copyright, which is the actual claim here, is pretty cut and dry. The AI is digesting the copyrighted contents to form at least part of its language model. This legally means the language model is a derivative work, which means the AI is in violation of copyright laws.

[–] 0 pt

If they scrape data that is out on the internet for anyone to view them you can't really say that they can't have access to the data. The only way to get personal information would be to gain access to private areas. Unless the private data wasn't accessible to anyone then there really isn't anything to sure over. All the big tech days companies scrape entering they can get access too and they probably have access to more than people know. They have ways to track and associate your accounts with each other and can consolidate everything you do on every week site. Even when you switch to your other secret accounts and stuff. They add it all to the "you" file. Consolidate, sort and sell it. Remember if you are using a service for free then your data is their price for admission and they make a lot of money selling "you."

[–] 1 pt

Robots.txt

Also, just because something is accessible doesn't mean you have license to use it. This is called derived work. ChatGPT's LLM is apparently is massive copyright violation and is itself a derivative work of thousands of other's works.

[–] 1 pt

Derivative is the key word here.

[–] 0 pt

It's a pretty far stretch to claim that making vector embeddings makes the model into a derived work. If you feed The Lord of the Rings into a vector embedding algorithm, you are not getting The Lord of the Rings out the other side, you're getting a probability distribution of what words are most likely to come next after a given word or fragment of a word.

It's probably possible to get an LLM to spit out a verbatim copyrighted work, but only if you already know exactly what tokens and words to give it to produce the correct next words to make up the copyrighted work. And there's zero guarantee, and a high likelihood, that it would instead spit out a word which is the next most probable, but is not the next word in the actual copyrighted work (LotR, in this example).

So it is derived data but I don't think courts would uphold it being considered a derived work which has specific meanings which a database of word probabilities probably doesn't meet.

[–] 1 pt (edited )

Not really. The data is literally the book contents. It is by definition a derived work.