Highlight AI-generated information!

So now the time has come… In a new paper(https://lnkd.in/eADdp5r6), some colleagues from Standord and Rice University prove what has been bothering me for some time:

๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ฒ๐—ฟ๐˜ ๐—บ๐—ฎ๐—ป ๐—ž๐—œ ๐—บ๐—ถ๐˜ ๐—ž๐—œ-๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ถ๐—ฒ๐—ฟ๐˜๐—ฒ๐—ป ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด๐˜€๐—ฑ๐—ฎ๐˜๐—ฒ๐—ป, ๐˜„๐—ฒ๐—ฟ๐—ฑ๐—ฒ๐—ป ๐—ฑ๐—ถ๐—ฒ ๐—˜๐—ฟ๐—ด๐—ฒ๐—ฏ๐—ป๐—ถ๐˜€๐˜€๐—ฒ ๐˜€๐—ฐ๐—ต๐—น๐—ฒ๐—ฐ๐—ต๐˜๐—ฒ๐—ฟ (i.e. they are becoming more and more similar).

This effect is impressively demonstrated in the paper using the example of image generation, but in general this applies to any type of generative AI! So also, for example, if you train Chat-GPT with data generated by Chat-GPT… AI then cannibalizes itself at some point.

We are currently living in an age in which the ratio of generated to real data is still very favorable (AI is only just beginning). However, this is changing rapidly. And that means that we will soon have nothing left with which we can train the AIs in a meaningful way (almost all available data is already trained in the large language models anyway).

Because even if new information is constantly being produced – if we cannot distinguish between what is human-generated and what is AI-generated, then we will no longer be able to use anything qualified for training, i.e. our AIs will no longer improve at some point.

๐—š๐—ฒ๐—ฟ๐—ฎ๐—ฑ๐—ฒ ๐—ณ๐˜‚ฬˆ๐—ฟ ๐—จ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฒ๐—ต๐—บ๐—ฒ๐—ป ๐—ถ๐˜€๐˜ ๐—ฑ๐—ฎ๐˜€ ๐—ฑ๐—ฒ๐—ฟ ๐—ฏ๐—น๐—ฎ๐—ป๐—ธ๐—ฒ ๐—›๐—ผ๐—ฟ๐—ฟ๐—ผ๐—ฟ!

As soon as your company’s employees start using generative AI in an uncontrolled/unguided manner, you are digging your own potential data grave, because at some point you will no longer be able to rely on your data. And your data is your capital…

๐——๐—ฎ๐—ต๐—ฒ๐—ฟ ๐—บ๐˜‚๐˜€๐˜€ ๐—ฎ๐—ธ๐˜๐˜‚๐—ฒ๐—น๐—น ๐—ฑ๐—ถ๐—ฒ ๐—ผ๐—ฏ๐—ฒ๐—ฟ๐˜€๐˜๐—ฒ ๐—ฃ๐—ฟ๐—ถ๐—ผ๐—ฟ๐—ถ๐˜๐—ฎฬˆ๐˜ ๐˜€๐—ฒ๐—ถ๐—ป, ๐—ž๐—œ-๐——๐—ฎ๐˜๐—ฒ๐—ป ๐˜‡๐˜‚ ๐—ธ๐—ฒ๐—ป๐—ป๐˜‡๐—ฒ๐—ถ๐—ฐ๐—ต๐—ป๐—ฒ๐—ป!

Difficult to impossible in public, but fortunately feasible within the company.

If you don’t know how to do this, please contact me!



P.S.: Of course, there are also applications where synthetic data is very helpful. However, these are isolated exceptions and not the general rule.

P.P.S.: here are some of my previous posts on this topic:
https://lnkd.in/eY8rC8C7
https://lnkd.in/e3bcJ_92
https://lnkd.in/efAex_M2
https://lnkd.in/eHZMm6KZ

P.P.P.S.: I generated the cover picture with Midjourney. Because our Generative AI is still working ๐Ÿ™‚