ChatGPT's creator OpenAI is being sued for secretly scraping 300 billion words from the internet, including books, articles, websites, posts, and personal information that was obtained without consent.

[–] • 3 pts

It is called web crawling and it has been around since the invention of the internet.

link

[–] • 3 pts

I read that this is part of the reason Elon Musk had Twitter shut off its API. People reading tweets is one thing. Software mining every tweet ever posted to use as AI training data is quite another. He thinks they should pay for that kind of use.

link

[–] • 1 pt

Don't search engines do this?

link

[–] • 1 pt

Yes lol. This is the same thing as artists being mad that they put their images on the internet and then people saw those images on the internet.

If your content is posted publicly I don't see how you can be mad that someone put it into an algorithm to turn it into a set of vector embeddings determining which words are likely to come after the previous ones (or embedding "feature" information about an image, same thing).

parent
link

[–] • 1 pt

Yes and no. There is a file called, "robots.txt", which sets crawling limits for the site. Nothing stops crawlers from crawling past (unless account restrictions exist), but it also sets a legal standard. Many sites' contents are crawled or indexed because of this defacto standard.

That said, copyright, which is the actual claim here, is pretty cut and dry. The AI is digesting the copyrighted contents to form at least part of its language model. This legally means the language model is a derivative work, which means the AI is in violation of copyright laws.

parent
link

[–] • 0 pt

lmao

link

[–] • 0 pt

If they scrape data that is out on the internet for anyone to view them you can't really say that they can't have access to the data. The only way to get personal information would be to gain access to private areas. Unless the private data wasn't accessible to anyone then there really isn't anything to sure over. All the big tech days companies scrape entering they can get access too and they probably have access to more than people know. They have ways to track and associate your accounts with each other and can consolidate everything you do on every week site. Even when you switch to your other secret accounts and stuff. They add it all to the "you" file. Consolidate, sort and sell it. Remember if you are using a service for free then your data is their price for admission and they make a lot of money selling "you."

link

[–] • 1 pt

Robots.txt

Also, just because something is accessible doesn't mean you have license to use it. This is called derived work. ChatGPT's LLM is apparently is massive copyright violation and is itself a derivative work of thousands of other's works.

parent
link

[–] • 1 pt

Derivative is the key word here.

parent
link

[–] • 0 pt

It's a pretty far stretch to claim that making vector embeddings makes the model into a derived work. If you feed The Lord of the Rings into a vector embedding algorithm, you are not getting The Lord of the Rings out the other side, you're getting a probability distribution of what words are most likely to come next after a given word or fragment of a word.

It's probably possible to get an LLM to spit out a verbatim copyrighted work, but only if you already know exactly what tokens and words to give it to produce the correct next words to make up the copyrighted work. And there's zero guarantee, and a high likelihood, that it would instead spit out a word which is the next most probable, but is not the next word in the actual copyrighted work (LotR, in this example).

So it is derived data but I don't think courts would uphold it being considered a derived work which has specific meanings which a database of word probabilities probably doesn't meet.

parent
link

[–] • 1 pt

(edited )

Not really. The data is literally the book contents. It is by definition a derived work.

parent
link

(post is archived)