270 terabytes of books stolen for AI training

๐—ก๐˜‚๐—ป ๐—ถ๐˜€๐˜ ๐—ผ๐—ณ๐—ณ๐—ฒ๐—ป๐—ฏ๐—ฎ๐—ฟ ๐—ฑ๐—ถ๐—ฒ ๐—ž๐—ฎ๐˜๐˜‡๐—ฒ ๐—ฎ๐˜‚๐˜€ ๐—ฑ๐—ฒ๐—บ ๐—ฆ๐—ฎ๐—ฐ๐—ธ!

As reported by several media (e.g. https://lnkd. in/e-bvsSX8), Meta has now confirmed that it used the illegal pirated library LibGen to train its AI.

The explanatory memorandum states that books are of course the best source for AI training, as they are often better in terms of language, content and subject matter than any short snippets from social media (logical). They are “well-written representations of human language”.

๐——๐—ฒ๐˜€๐—ต๐—ฎ๐—น๐—ฏ ๐˜„๐˜‚๐—ฟ๐—ฑ๐—ฒ๐—ป ๐˜€๐—ฎ๐—ด๐—ฒ ๐˜‚๐—ป๐—ฑ ๐˜€๐—ฐ๐—ต๐—ฟ๐—ฒ๐—ถ๐—ฏ๐—ฒ ๐Ÿฎ๐Ÿณ๐Ÿฌ ๐—ง๐—ฒ๐—ฟ๐—ฎ๐—ฏ๐˜†๐˜๐—ฒ ๐—•๐˜‚ฬˆ๐—ฐ๐—ต๐—ฒ๐—ฟ (๐—ฐ๐—ฎ. ๐Ÿณ.๐Ÿฑ ๐— ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป๐—ฒ๐—ป ๐—•๐˜‚ฬˆ๐—ฐ๐—ต๐—ฒ๐—ฟ ๐˜‚๐—ป๐—ฑ ๐Ÿด๐Ÿฌ ๐— ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป๐—ฒ๐—ป ๐˜„๐—ถ๐˜€๐˜€๐—ฒ๐—ป๐˜€๐—ฐ๐—ต๐—ฎ๐—ณ๐˜๐—น๐—ถ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐˜๐˜‚๐—ฑ๐—ถ๐—ฒ๐—ป) ๐—ด๐—ฒ๐—ธ๐—น๐—ฎ๐˜‚๐˜ – ๐—ฎ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€ ๐—ธ๐—ฎ๐—ป๐—ป ๐—บ๐—ฎ๐—ป ๐—ฑ๐—ฎ๐˜€ ๐—ป๐—ถ๐—ฐ๐—ต๐˜ ๐˜€๐—ฎ๐—ด๐—ฒ๐—ป. ๐—จ๐—ฟ๐—ต๐—ฒ๐—ฏ๐—ฒ๐—ฟ๐—ฟ๐—ฒ๐—ฐ๐—ต๐˜๐—น๐—ถ๐—ฐ๐—ต ๐—ถ๐˜€๐˜ ๐—ฑ๐—ฎ๐˜€ ๐—ป๐—ฎ๐˜๐˜‚ฬˆ๐—ฟ๐—น๐—ถ๐—ฐ๐—ต ๐—ฒ๐—ถ๐—ป ๐—ฎ๐—ฏ๐˜€๐—ผ๐—น๐˜‚๐˜๐—ฒ๐˜€ ๐—ก๐—ผ-๐—š๐—ผ.

Now you can argue that Meta did not steal the data itself, but “merely” used an illegally curated stock for training. And you can argue that training AI does not constitute copyright infringement. The courts will decide on all of this.

๐— ๐—ฎ๐—ป ๐—ธ๐—ฎ๐—ป๐—ป ๐—ฎ๐—ฏ๐—ฒ๐—ฟ ๐—ฒ๐—ฏ๐—ฒ๐—ป๐—ณ๐—ฎ๐—น๐—น๐˜€ ๐—ฒ๐—ถ๐—ป๐—บ๐—ฎ๐—น ๐—บ๐—ฒ๐—ต๐—ฟ ๐˜€๐—ฒ๐—ต๐—ฒ๐—ป: ๐˜„๐—ฎ๐˜€ ๐—ด๐—ฒ๐—บ๐—ฎ๐—ฐ๐—ต๐˜ ๐˜„๐—ฒ๐—ฟ๐—ฑ๐—ฒ๐—ป ๐—ธ๐—ฎ๐—ป๐—ป ๐˜„๐—ถ๐—ฟ๐—ฑ ๐—ด๐—ฒ๐—บ๐—ฎ๐—ฐ๐—ต๐˜ – ๐—ผ๐—ต๐—ป๐—ฒ ๐—ฅ๐˜‚ฬˆ๐—ฐ๐—ธ๐˜€๐—ถ๐—ฐ๐—ต๐˜ ๐—ฎ๐˜‚๐—ณ ๐—ฅ๐—ฒ๐—ฐ๐—ต๐˜, ๐—š๐—ฒ๐˜€๐—ฒ๐˜๐˜‡๐˜, ๐—จ๐—ฟ๐—ต๐—ฒ๐—ฏ๐—ฒ๐—ฟ. ๐—จ๐—ป๐—ฑ ๐—บ๐—ฎ๐—ป๐—ป ๐—ธ๐—ฎ๐—ป๐—ป ๐˜€๐—ถ๐—ฐ๐—ต ๐˜€๐—ถ๐—ฐ๐—ต๐—ฒ๐—ฟ ๐˜€๐—ฒ๐—ถ๐—ป, ๐—ฑ๐—ฎ๐˜€๐˜€ ๐— ๐—ฒ๐˜๐—ฎ ๐—ป๐—ถ๐—ฐ๐—ต๐˜ ๐—ฑ๐—ถ๐—ฒ ๐—ฒ๐—ถ๐—ป๐˜‡๐—ถ๐—ด๐—ฒ๐—ป ๐˜€๐—ถ๐—ป๐—ฑ, ๐—ฑ๐—ถ๐—ฒ ๐˜€๐—ผ ๐—ฎ๐—ฟ๐—ฏ๐—ฒ๐—ถ๐˜๐—ฒ๐—ป. ๐——๐—ถ๐—ฒ’๐—ต๐—ฎ๐˜’๐˜€ ๐—ต๐—ฎ๐—น๐˜ ๐—ท๐—ฒ๐˜๐˜‡๐˜ ๐—ฒ๐—ฟ๐˜„๐—ถ๐˜€๐—ฐ๐—ต๐˜ ๐˜‚๐—ป๐—ฑ ๐˜€๐—ถ๐—ป๐—ฑ ๐—ฎ๐˜‚๐—ณ๐—ด๐—ฒ๐—ณ๐—น๐—ผ๐—ด๐—ฒ๐—ป.

๐—ฆ๐—ฐ๐—ต๐—ผฬˆ๐—ป๐—ฒ ๐—ป๐—ฒ๐˜‚๐—ฒ ๐—ช๐—ฒ๐—น๐˜!

P.S.: currently the users of LLM’s are responsible for their results, i.e. if you now use Meta’s Llama model and the text generated with it uses content from the illegally used training data, you are responsible for it – not Meta!

Hashtag#informatikersindcool Hashtag#kiistdaundbleibt