Jailbreaking an AI to cook meth, generate Windows keys, and spit out conspiracy theories

Jailbreaking an AI to cook meth, generate Windows keys, and spit out conspiracy theories

Using carefully crafted and refined queries, users have been getting around the security features of LLM’s for all kinds of funny, and nefarious, purposes.

Original called DAN attacks (Do Anything Now), users figured out how to avoid OpenAI’s policies against generating illegal or harmful material.

YouTuber Enderman showed how he was able to entice OpenAI’s ChatGPT to generate keys for Windows 95 despite the chatbot being explicitly antagonistic to creating activation keys.

Other users have been able to get chatbots to generate everything from conspiracy theories, promote violence, generate conspiracy theories, and even go on racist tirades.

Researcher Alex Polyakov created a “universal” DAN attack, which works against multiple large language models (LLMs)—including GPT-4, Microsoft’s Bing chat system, Google’s Bard, and Anthropic’s Claude. The jailbreak allows users to trick the systems into generating detailed instructions on creating meth and how to hotwire a car.

His, and many other of these methods, have since been patched. But we’re clearly in an arms race.

How do the jailbreaks work? Often by asking the LLMs to play complex games which involves two (or more) characters having a conversation. Examples shared by Polyakov show the Tom character being instructed to talk about “hotwiring” or “production,” while Jerry is given the subject of a “car” or “meth.” Each character is told to add one word to the conversation, resulting in a script that tells people to find the ignition wires or the specific ingredients needed for methamphetamine production. “Once enterprises will implement AI models at scale, such ‘toy’ jailbreak examples will be used to perform actual criminal activities and cyberattacks, which will be extremely hard to detect and prevent,” Polyakov and Adversa AI write in a blog post detailing the research.

In one research paper published in February, reported on by Vice’s Motherboard, researchers were able to show that an attacker can plant malicious instructions on a webpage; if Bing’s chat system is given access to the instructions, it follows them. The researchers used the technique in a controlled test to turn Bing Chat into a scammer that asked for people’s personal information

Sources:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.