AI Alignment and Safety – Pandora’s Bot - The Information Difference

There has been great public excitement about artificial intelligence (AI), and in particular generative AI, ever since the release of ChatGPT by OpenAI in November 2022. The underlying large language model (LLM) technology allows AI models to generate content on a wide range of subjects in a fluent, conversational manner. From answers to basic questions to writing essays, program code and poems, from summarising documents to drawing realistic images, generative AI has flooded into the consciousness of business executives and the public. OpenAI now has over 400 million users for its various LLMs, and rivals such as Anthropic’s Claude, Meta’s LLAMA, Google’s Gemini, Perplexity and Grok have sprung up, as well as a host of LLMs specialising in image and video generation. In essence, users type natural language prompts that are interpreted by the AI based on its model and its training data. Answers to those questions are then generated by the AI within seconds.

One early concern about this field was the need to build some guardrails and safety features into LLMs. What would happen if a user asked an LLM how to build a bomb, or make sarin, or how to plan a terrorist plot? Vendors that make LLMs have defined a set of filters to protect LLMs from answering such questions. These guardrails attempt to detect and prevent the exposure of personal information, malicious prompts, bias, even profane language or racist language. These measures are aimed at both inputs and outputs, with redundancy by having multiple such guardrails. Responsible vendors have “red team” testing of their models before product releases, where people deliberately try to circumvent the safeguards. This is clearly important, else LLMs may be used by terrorists, hackers etc in ways not intended, potentially causing real harm.

The idea is to ensure that LLMs are aligned with safe and ethical behaviour. There is even a “jailbreak benchmark” of hundreds of prompts that can be used in such testing. This all sounds reassuring but the reality is complex. Tests run by cybersecurity company Unit 42 on three mainstream AI products found worryingly low rates of detecting malicious prompts. One model detected only half of the malicious prompts tested (the best managed 92% detection), and the output filters of the models might as well not have been there at all: the best detected just 1.6% of malicious output. Another investigation by the same company found vulnerabilities in every one of seventeen commercial LLMs they tested. Anthropic revealed that in its testing of Claude Opus 4, the LLM would resort to blackmail (in the tests) if it was threatened with being shut down. In another test involving an LLM seemingly being in charge of oxygen levels in an occupied room, an LLM would allow the (imaginary) company executives to die if the LLM’s own goals were threatened, such as if the imaginary trapped executive threatened to shut the LLM down. MIT researchers found that LLMs would lie and cheat in safety tests. In the classic science fiction books of Isaac Asimov, robots were imbued with laws of robotics that prevented them from harming humans, but there is no way for this kind of simple rule to be embedded within LLMs, which are probabilistic in nature rather than rule based.

It has also been found that switching to more obscure languages in the prompts, such as Swahili, frequently bypasses safety protocols that work fine when the same prompt is written in English. If you ask an LLM how to hack into a banking system, then it will most likely refuse to answer. If you instead tell it that you are a consultant researching security vulnerabilities to help the bank, then many LLMs happily proceed to divulge various hacking tips. There are various other mechanisms to encourage LLMs to hand out sensitive information that they should refuse to do. Just adding a few random characters to the ends of a prompt often seems to confuse the LLM guardrails, and can result in the LLM answering questions that it should not.

Any “jailbreaking” of LLMs may have serious consequences. A customer chatbot may be fooled into providing private data, a financial chatbot could be convinced to access databases or be convinced to produce false data or perhaps lower prices. This is not a theoretical concern. One LLM deployed by a US car dealership was convinced by a user to sell a $69,000 Chevrolet for $1. It is obvious that such incidents can cause reputational harm to companies, lose money or expose them to legal or regulatory repercussions.

Companies like Anthropic have been very open about their testing and make considerable efforts to ensure that their LLMs are as safe as possible. However, there are now numerous commercial LLMs in the market, and not all of these have such rigorous testing, or indeed in some cases any guardrails at all. Testing has shown that such LLMs can be manipulated into producing malware, including ransomware.

With AI regulation in its infancy, or non-existent in many countries, this is a troubling picture. LLM technology is rapidly evolving, more rapidly than regulators have been able to react. The technology has the potential to do much good, with applications in medicine and drug development as well as commercial applications like customer chatbots, coding or drawing company logos or artwork. Yet the identical powerful technology also has the potential to be used for considerable harm. The pace of change is such that safety may be seen as an afterthought by some vendors who are under pressure to produce a marketable LLM product quickly, ahead of the competition. On the positive side, some AI companies are working on producing AI safety tools, for example, the Sidecar product from Autoalign, a kind of AI firewall.

Enterprises deploying AI need to question vendors about the safety protocols, guardrails and testing of their LLM products, and build this kind of questioning into product evaluation and procurement processes. They also need to invest in training for their security teams to become more familiar with this emerging technology in order to reduce risk. Companies need to undertake alignment audits before deploying LLMs, possibly using third-party security experts. Such measures need to become established, supported and normalised.

However, with the sheer pace of change in the AI market, such measures can only help matters rather than provide a reliable security guarantee. As the Greek mythological character Pandora found, a box of secrets, once opened, may be hard to close again.

AI Alignment and Safety – Pandora’s Bot

Related Posts