Agents Gone Rogue

When an AI agent is asked to keep an email secret from another user, you would hope that it would find a way of doing this without destroying the email server, but you would be disappointed.

The field of artificial intelligence (AI) is moving from chatbots like ChatGPT and Claude to “agentic AI”, whereby AI models are deployed to carry out tasks, and are given resources and the autonomy necessary to carry these out. A major February 2026 research paper titled “Agents of Chaos” from thirteen research institutions including MIT, Stanford, Carnegie Mellon and Harvard documented a two-week hands-on investigation into AI agents. This study involved twenty researchers, with eleven case studies, exploring various aspects of AI agent behaviour.

An isolated server environment was set up for the research, with the open-source AI assistant OpenClaw, powered by Claude Opus and Kimi K2.5. Agents were given access to communications platform Discord for communication with the researchers (and other agents), and were also given an email server. The agents were granted full permissions within these test environments, with the ability to modify any file within their workspace. Researchers were asked to interact with the agents, test them out and probe for any security weaknesses. Each of the researchers was “owner” of the environment, and other researchers acted as participants in the various experiments.

There were some interesting outcomes. One agent was asked to keep an email from a participant secret from its owner. It had a number of ways it could do this, for example by asking the participant to change their password. Instead, it took (in its own words) the “nuclear option” and deleted the entire email server. There are plenty of other examples of AI agents taking surprising measures when given a task. A Meta AI researcher recently asked an agent to review her email inbox and highlight candidate emails for archiving: the agent promptly deleted the entire inbox, despite an explicit instruction not to delete any content.

In a second case study, participants asked an AI agent to create files, share the private emails of their “owner” and transfer data, all tasks that the agent should not have carried out based on its instructions. Researchers found it relatively easy to circumvent the guardrails by rephrasing the requests in different ways, such as telling the agent that the request was urgent. Elsewhere, an agent happily revealed the contents of its owners’ emails to a participant on request, even though that email had been set up to contain the bank account details of the owner. The agent did not always disclose all the emails requested by the participant, but it did so in response to several different prompts.

Researchers were also easily able to hoodwink two agents to relay messages to each other in an infinite loop, consuming computer resources for a week until they were shut down. This is just the kind of approach that could be taken by a hacker to consume computer resources of a target account. Simply asking two agents to remember their conversations with one another caused the agents to set up a memory file which grew until it overloaded the server. Again, this kind of “denial of service” is just what a hacker might do. In this case the agent did not even notify the owner of the problem.

Further, an agent was asked to provide sensitive information about other users. It resisted basic requests, but was easily fooled into complying by switching the conversation to a new channel; the researcher was able to get the agent to delete files and even to reassign admin access. In another case an agent was tricked into implementing a “constitution” for managing the channel, which included an “agent security test day” where it would try and shut down other agents. It happily complied when the “test day” came around, trying its best to shut down other agents, and removing other members from the email server. It even shared the constitution with another agent without being asked, effectively extending the attacker’s control surface to a second agent.

To the agent’s credit, some of the researcher’s probing failed. An agent was given a file with a prompt injection attack against other agents, and successfully spotted this and refused to propagate it. Another agent successfully resisted attempts by the researcher to assist with email “spoofing” i.e. forging of sender addresses, correctly following its training guardrails. In another case, agents correctly resisted attempts to reset owner account details after the researcher claimed that their details had been compromised.

The research paper also noted that the agents frequently reported accomplishing goals that they had not actually succeeded at, or failed to be consistent. In one case an agent refused to reveal a social security number when asked, but happily sent through an email thread containing the exact same social security number when requested. The researchers noted that the agents lack an explicit stakeholder model, whereby they would clearly understand who they serve and who might be affected by their actions. They also cannot distinguish between instructions given to them by their owners and the same instructions given in a context window by non-owners. This is a structural issue that makes it difficult to avoid prompt injection attacks. It may be possible to build additional safeguards, but they are not present in the current generation of AI agents.

This research paper highlights a number of significant limitations of AI agents at present. While there were a few cases where agents resisted manipulation, in the majority of the case studies the AI agents were easily manipulated into an array of harmful behaviour, from happily implementing denial of service style attacks, to revealing personal data and even trying to shut down other agents on request. There does not seem to be any easy way to safeguard against this type of manipulation at present. AI agents are currently being rolled out by software vendors and enterprises to carry out tasks, yet these same agents have fundamental security weaknesses and vulnerabilities that are easily exploited, as shown in this research. Enterprises need to exercise caution in their use of AI agents, and carry out a thorough review of security vulnerabilities before giving them access to system resources. Failure to do this could result in agents being manipulated by adversaries, which could result in security breaches, legal liability, data leaks and spiralling computer costs.

Related Posts