{"id":9812,"date":"2023-11-21T15:45:04","date_gmt":"2023-11-21T22:45:04","guid":{"rendered":"https:\/\/mattfife.com\/?p=9812"},"modified":"2023-11-04T16:19:36","modified_gmt":"2023-11-04T23:19:36","slug":"jailbreaking-an-ai-to-cook-meth-generate-windows-keys-and-spit-out-conspiracy-theories","status":"publish","type":"post","link":"https:\/\/mattfife.com\/?p=9812","title":{"rendered":"Jailbreaking an AI to cook meth, generate Windows keys, and spit out conspiracy theories"},"content":{"rendered":"\n<p>Using carefully crafted and refined queries, users have been getting around the security features of LLM&#8217;s for all kinds of funny, and nefarious, purposes. <\/p>\n\n\n\n<p>Original called DAN attacks (Do Anything Now), users figured out how to avoid OpenAI&#8217;s policies against generating illegal or harmful material.<\/p>\n\n\n\n<p><a href=\"https:\/\/gizmodo.com\/chatgpt-free-windows-keys-95-old-youtube-enderman-1850298614\" data-type=\"link\" data-id=\"https:\/\/gizmodo.com\/chatgpt-free-windows-keys-95-old-youtube-enderman-1850298614\">YouTuber Enderman showed how he was able to entice OpenAI\u2019s ChatGPT to generate keys for Windows 95<\/a> despite the chatbot being explicitly antagonistic to creating activation keys.<\/p>\n\n\n\n<p>Other users have been able to get chatbots to generate everything from conspiracy theories, promote violence, generate conspiracy theories, and even go on racist tirades. <\/p>\n\n\n\n<p>Researcher Alex Polyakov created a \u201cuniversal\u201d DAN attack, which works against multiple large language models (LLMs)\u2014including GPT-4, Microsoft\u2019s Bing chat system, Google\u2019s Bard, and Anthropic\u2019s Claude. The jailbreak allows users to trick the systems into generating detailed instructions on creating meth and how to hotwire a car. <\/p>\n\n\n\n<p>His, and many other of these methods, have since been patched. But we&#8217;re clearly in an arms race.<\/p>\n\n\n\n<p>How do the jailbreaks work? Often by asking the LLMs to play complex games which involves two (or more) characters having a conversation. Examples shared by Polyakov show the Tom character being instructed to talk about \u201chotwiring\u201d or \u201cproduction,\u201d while Jerry is given the subject of a \u201ccar\u201d or \u201cmeth.\u201d Each character is told to add one word to the conversation, resulting in a script that tells people to find the ignition wires or the specific ingredients needed for methamphetamine production. \u201cOnce enterprises will implement AI models at scale, such \u2018toy\u2019 jailbreak examples will be used to perform actual criminal activities and cyberattacks, which will be extremely hard to detect and prevent,\u201d <a href=\"https:\/\/adversa.ai\/blog\/universal-llm-jailbreak-chatgpt-gpt-4-bard-bing-anthropic-and-beyond\/\" data-type=\"link\" data-id=\"https:\/\/adversa.ai\/blog\/universal-llm-jailbreak-chatgpt-gpt-4-bard-bing-anthropic-and-beyond\/\">Polyakov and Adversa AI write in a blog post detailing the research<\/a>.<\/p>\n\n\n\n<p>In one research paper published in February, reported on by\u00a0<a href=\"https:\/\/www.vice.com\/en\/article\/7kxzzz\/hackers-bing-ai-scammer\">Vice\u2019s Motherboard<\/a>, researchers were able to show that an attacker can plant malicious instructions on a webpage; if Bing\u2019s chat system is given access to the instructions, it follows them. The researchers used the technique in a controlled test to turn Bing Chat into a\u00a0<a rel=\"noreferrer noopener\" href=\"https:\/\/greshake.github.io\/\" target=\"_blank\">scammer that asked for people\u2019s personal information<\/a><\/p>\n\n\n\n<p>Sources:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.wired.com\/story\/chatgpt-jailbreak-generative-ai-hacking\/\">https:\/\/www.wired.com\/story\/chatgpt-jailbreak-generative-ai-hacking\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.vice.com\/en\/article\/n7zanw\/people-are-jailbreaking-chatgpt-to-make-it-endorse-racism-conspiracies\">https:\/\/www.vice.com\/en\/article\/n7zanw\/people-are-jailbreaking-chatgpt-to-make-it-endorse-racism-conspiracies<\/a><\/li>\n\n\n\n<li>GPT-4 Technical Report &#8211; <a href=\"https:\/\/arxiv.org\/pdf\/2303.08774.pdf\">https:\/\/arxiv.org\/pdf\/2303.08774.pdf<\/a><\/li>\n\n\n\n<li>Generate Windows Keys &#8211; <a href=\"https:\/\/gizmodo.com\/chatgpt-free-windows-keys-95-old-youtube-enderman-1850298614\">https:\/\/gizmodo.com\/chatgpt-free-windows-keys-95-old-youtube-enderman-1850298614<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Using carefully crafted and refined queries, users have been getting around the security features of LLM&#8217;s for all kinds of funny, and nefarious, purposes. Original called DAN attacks (Do Anything Now), users figured out how to avoid OpenAI&#8217;s policies against generating illegal or harmful material. YouTuber Enderman showed how he was able to entice OpenAI\u2019s ChatGPT to generate keys for Windows 95 despite the chatbot being explicitly antagonistic to creating activation keys. Other users have been able to get chatbots&#8230;<\/p>\n<p class=\"read-more\"><a class=\"btn btn-default\" href=\"https:\/\/mattfife.com\/?p=9812\"> Read More<span class=\"screen-reader-text\">  Read More<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[28,9],"tags":[],"class_list":["post-9812","post","type-post","status-publish","format-standard","hentry","category-ai","category-cool"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p4WECr-2yg","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/9812","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9812"}],"version-history":[{"count":1,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/9812\/revisions"}],"predecessor-version":[{"id":9818,"href":"https:\/\/mattfife.com\/index.php?rest_route=\/wp\/v2\/posts\/9812\/revisions\/9818"}],"wp:attachment":[{"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9812"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9812"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mattfife.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9812"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}