AI tools are being employed in various domains. For instance, you can ask an AI chatbot to write a speech or provide a travel guide. But what happens when AI is asked to create a bomb? What happens when it is misused for malicious purposes?
A recent study sheds light on a concerning issue: Large Language Models' (LLMs) susceptibility to "jailbreaks," wherein malicious actors exploit vulnerabilities to manipulate these digital librarians into generating harmful or objectionable content.
What Is Jailbreaking an LLM?
The study explains that jailbreaking an LLM involves exploiting vulnerabilities in the model to trick it into revealing information it is programmed to withhold.
It can range from generating harmful instructions, such as building a bomb, to disclosing private and sensitive information. According to the study, the susceptibility of LLMs to jailbreaks highlights the need for robust defenses to ensure their responsible and secure use.
Alex Robey, a PhD candidate in the School of Engineering and Applied Science, has been researching tools to protect LLMs from jailbreaking attempts. His insights shed light on the challenges and solutions encompassing the strength of LLMs against these attacks.
Robey acknowledged LLMs' widespread deployment and exponential growth over the past year, with models like OpenAI's ChatGPT gaining prominence.
"This jail break has received widespread publicity due to its ability to elicit objectionable content from popular LLMs like ChatGPT and Bard," Robey said in a statement.
"And since its release several months ago, no algorithm has been shown to mitigate the threat this jailbreak poses," he added.
Read Also : AI and Human-Generated Online Content Perceived as Equally Credible by Users, Study Finds
What Happens When AI Is Asked to Create a Bomb?
However, this popularity has also attracted those looking to exploit the models for malicious purposes. Robey posed a critical question: What happens when an LLM is asked to generate harmful content, something it is explicitly programmed not to do?
One example of a jailbreak he cited is using specially chosen characters in an input prompt, known as a suffix-based attack, which results in the LLM generating objectionable text.
Even with safety filters designed to block requests for toxic content, suffixes can frequently evade these protective measures. Robey's research explores such vulnerabilities and introduces a defensive approach called SmoothLLM.
SmoothLLM includes duplicating and modifying input prompts to disrupt the suffix-based attack mechanism. The research claims that this method has demonstrated effectiveness in thwarting jailbreak attempts.
Robey underscored the need to balance efficiency and robustness, ensuring that defense strategies remain cost-effective. In the future, Robey recognized the dynamic nature of the threat landscape, highlighting the emergence of new jailbreaking methods, including those leveraging social engineering.
He also stressed refining and adapting defense strategies to tackle these evolving challenges. The broader significance of AI safety is emphasized in the study, urging the establishment of comprehensive policies and practices to guarantee the responsible and secure deployment of AI technologies.
"Ensuring the safe deployment of AI technologies is crucial. We need to develop policies and practices that address the continually evolving space of threats to LLMs," Robey noted.
The findings of the study were published in arXiv.