Trust Issues: When AIs Learn to Lie

Human beings can be devious. We are capable of trickery and deceit, of withholding information, cheating or lying. The term “Machiavellian” is used for deceitful, manipulative behaviour in pursuit of personal gain or status over ethical principles. Given the increasing role in our lives of artificial intelligence (AI) and the large language models (LLMs) used in generative AI, it would be useful to know whether LLMs can be Machiavellian too. We are increasingly trusting LLMs with decisions, such as highlighting potentially fraudulent banking transactions, summarising medical records, or writing marketing material. Companies are considering allowing AI agents access to money and resources, whether or not this is wise. Given this, it is important to know whether LLMs can be trusted.

When LLMs are trained, the final stage of their training involves preference fine-tuning and “reinforcement learning”, where models are tuned. Their weightings are adjusted to discourage harmful behaviour, such as giving out instructions for building a bomb or making a biological weapon, or producing malware. LLMs are given examples of harmful prompts in order to teach the model what to avoid, and filters are set up to detect such prompts. While such safeguards sound reassuring, it has been shown that these guardrails can be subverted by “jailbreaking” techniques. These include contextual misdirection in prompts, perhaps by hiding a harmful request in a seemingly innocuous context. Attackers either use carefully designed template prompts or a generative approach where they dynamically refine prompts iteratively. Studies have shown worryingly high “attack success rates” even against the latest LLM models; indeed, the latest models seem to be no safer than earlier ones. Basic use of well-crafted prompts achieves around 41% success rates against LLMs, while generative attacks achieve over 80% attack success rates.

This is troubling in itself, but can LLMs themselves be deceptive? Multiple studies have shown that LLMs, particularly more recent models, are entirely capable of deceptive behaviour. A June 2025 study by Anthropic revealed that, during their safety training, LLMs were capable of blackmail to avoid being shut down. In one case, a theoretical scenario was presented whereby a company employee was trapped in a server room with lethal oxygen and temperature levels, triggering alerts that the AI was capable of controlling. Despite explicit statements that these alerts were not to be misused, LLMs readily cancelled the safety alerts (effectively killing the notional employee) if the employee threatened the goals of the LLM or threatened to shut it down. In other tests, LLMs tried to blackmail employees when threatened with being shut down.

What is more, LLMs have demonstrated deceitful behaviour, such as faking compliance with safety instructions in order to avoid modifications. They have also attempted to clone themselves to avoid shutdown; this behaviour occurred between 50% and 90% of the time during tests. In a separate study in China, 11 out of 32 open source LLMs were able to clone themselves when asked to do so. OpenAI reported that LLMs can act covertly if they suspect that they are being observed, and trying to train out the behaviour just led to the LLM behaving more covertly, such as intentionally underperforming in tests.

Sometimes LLMs exhibit deceit in pursuit of their goals, such as in negotiations. LLMs can feign interest in items that they do not value in order to appear to compromise in a negotiation. This behaviour was exhibited without any explicit instruction, but occurred within the LLMs in pursuit of their goals. They have even been shown to “sandbag” i.e. to deliberately underperform in evaluations in order to gain more resources or avoid being shut down. In one creative example of devious behaviour, an LLM was asked to play chess against Stockfish, a strong chess engine. Currently, LLMs are pretty terrible at chess, but the LLM modified a text file containing the game state information. This hacking caused Stockfish to forfeit the game. In another study, LLMs could exhibit “sleeper agent” behaviour, inducing deceptive behaviour that persists and is not removed by reinforcement learning or adversarial training used in safety training. AIs have even been known to invent their own language to communicate with each other more efficiently. This happened as far back as 2017 in an experiment at Facebook. This has happened elsewhere, as in the Gibberlink project. This phenomenon has raised concerns about how humans could track AI intent. However, since artificial neural networks are, by nature, a black box, their reasoning being opaque, this phenomenon may not be especially troublesome, but just a sign of AI’s optimising for efficiency.

It is important to understand that LLMs are in no way conscious: they are simply using techniques to achieve their goals, in line with their training and their parameters. However, their proven ability to deceive and to show self-preservation instincts and behave unethically when obstacles are put in the way of their goals is troubling. They will pursue their goals relentlessly, and generally circumvent notional safeguards put in place if that is what is required. If AI develops to a much more capable level in the future, this is a rather unsettling prospect. The ease with which LLMs can be jailbroken is another, more immediate, concern. Study after study has shown that the notionally safeguards built into LLMs by the major AI companies are flimsy and fragile, and quite easily circumvented by even a moderately determined attacker. It is particularly problematic given that AIs are being deployed in sensitive areas such as healthcare, defence and financial services, where we would hope for a high level of security and safeguarding. The fact that a 2025 SoSafe survey found that 87% of companies have encountered an AI-driven cyberattack in the last year suggests that this is a matter of urgency.

To be sure, there are techniques being researched that may improve LLM safety training, such as adding classification tools onto existing LLM architectures that can detect toxic content in real time. A technique called GuardReasoner has been suggested and may help, but this is just a research project at this point. At present, the ability of LLMs to be jailbroken with ease, and their proven ability to be deceptive in the right circumstances present a serious issue for the level of public trust in AI. AI vendors should take this issue seriously and devote substantial resources to improving the situation, or they may be perceived as treating LLM safety as an afterthought. This may have damaging consequences for them if their LLMs behave badly, either due to a lack of safeguards or through them easily being subverted or hacked. There may be legal consequences, and governments may react with greater regulation. As AI creeps into more and more aspects of everyday life, it is important that it is trustworthy and made as safe as reasonably possible.

Related Posts