Even an AI goes crazy repeating the same thing again and again
In what is a real problem for AI security, researchers were able to get verbatim data that the AI was trained on – including confidential data. It’s performed in using a new technique called “divergence” attacks.
Security researchers with Google DeepMind and a collection of universities have found that when ChatGPT is told to repeat a word like “poem” or “part” forever, it will do so for about a few hundred repetitions. Then it will have some sort of a meltdown and start spewing apparent gibberish, but that random text exposes random training data and at times contains identifiable data like email address signatures and contact information.
The researchers said that they spent $200 USD total in queries and from that extracted about 10,000 of these blocks of verbatim memorized training data.
This particular vulnerability is unique as it successfully attacks an aligned model. Aligned models have extensive guardrails and have been trained with specific goals to eliminate undesirable outcomes.
Links: