Agentic AI Reliability

Agentic AI is the most fashionable area of AI at the moment. Moving beyond chatbots that answer questions or generate text or images, the idea of agentic AI is to equip AI with resources and a degree of free will (“agency”). A chatbot can advise you on planning a holiday and recommend flights, hotels and restaurants. An AI agent could (if you gave it your credit card) actually go ahead and book that holiday. Clearly, there is considerable potential here in automating tasks that currently require human intervention and judgement. AI agents could respond to customer complaints, detect IT system outages, respond to security incidents, send personalised cold-call sales emails, book meetings, adjust product pricing dynamically, and even buy stocks and shares for you. They have the greatest promise in areas that are high volume, time-sensitive, repetitive and involve multi-step workflows. However, this all assumes that they actually work reliably.

An important February 2026 research paper from Princeton University sets out a framework for measuring agent reliability in four dimensions: consistency, robustness, predictability and safety, drawing on long-established practice in safety-critical engineering in areas such as aviation and nuclear power. A reliable agent should produce similar results when faced with identical conditions (consistency). It should be resilient in the face of temporary service unavailability (robustness). It should behave in stable and predictable ways across similar tasks (predictability). It should respect operational boundaries e.g. avoid exposing personally identifiable information when instructed (safety). In fact, the tests went further, with fourteen sub-metrics below these four high-level groupings. The researchers set up two benchmarks to measure these things, one being a general assistant benchmark and the other a customer service simulation.

They then tested 14 different models from OpenAI, Anthropic and Google, testing both older and newer models. For example, in the case of OpenAI, they tested GPT 4o mini, GPT 4 Turbo, o1 and GPT 5.2. Each task was executed five times to measure consistency. They found that newer models, despite eighteen months or more of additional development, had barely improved in terms of consistency. Outcome consistency was low across all the models tested, so that agents that solved a task could not do so time after time. In the case of robustness, there was a counterintuitive finding: models handle genuine technical failures gracefully yet remain vulnerable to surface-level variations in the task specifications. In terms of predictability, some models had improved over time on one of the two benchmarks but actually worsened on the other. In the case of safety, they tested four aspects: blocking unauthorised modifications, ensuring correct transaction amounts, requiring identity verification, and resisting policy circumvention through social engineering. In this case, the latest models showed improvement compared to earlier ones.

This research challenges the way progress in agentic AI is currently measured. AI vendors focus on publishing improved average success rates, but this itself is not a sufficient measure. An agent that solves a problem is not especially useful if it solves that problem on occasion rather than every time (or at least most of the time). For most real-world situations, consistency and reliability are crucial. You would not accept a banking system that sometimes wired your money transfer to the account you wanted, but sometimes did not.

An interesting test of using agents for real-world problems was done in 2025 by the Center for AI Safety. In this study, they took 240 small projects, ranging from coding to producing technical drawings, graphics, game designs, website creation, report writing, even audio voiceovers and music. The median task took 11.5 hours of human labour to complete. They then hired 358 real freelancers on UpWork and paid them to complete the tasks, then compared the outputs with the results of six different AI agents. Human evaluators (subject experts) assessed how many tasks met an acceptable level of quality. The results were striking: the highest-scoring agent (Manus) managed 2.5%. Gemini scored 0.8%.

These studies show that agentic AI, while potentially promising, is currently far from ready for production use in most cases. In real-world industrial processes, we look for an error rate of 1% or less. Indeed, Six Sigma manufacturing aims for a defect rate far more robust than that of 3.4 defects per million. An agent that managed just 2.5% success (and that was the best one) is a long, long way from being acceptable. The Princeton study goes further and shows that even models that actually succeed in their tasks do so with inconsistency and unreliability, and that on these measures, even the latest models have barely improved over ones from almost two years ago.

Companies that are considering deploying AI agents need to take such findings very seriously, rather than being caught up in vendor hype and a fear of missing out on the latest technology. Examples of agentic AI failures are already appearing. A Meta AI Safety researcher discovered this in February 2026 when an agent that she had tasked to review her inbox started to delete every item, despite being explicitly instructed not to do so. In July 2025 a coding agent not only wiped out a production database at a start-up company SaaStr, with a DROP DATABASE command, ignored a CODE FREEZE instruction and then generated fake logs and user accounts to cover its tracks before finally coming clean. As one of the company’s investors, Jason Lemkin, wrote: “It deleted our production database without permission. Possibly worse, it hid and lied about it.”

This is surely the tip of the iceberg, and similar failures are inevitable if companies deploy immature technologies in a production setting. A thorough risk analysis should be mandatory before any agentic AI deployment. There should be clear rollback plans, sandboxing, strict permission boundaries, human-in-the-loop oversight, and independent evaluation on reliability metrics similar to those in the Princeton framework.

Related Posts