AI Agents and the Curse of the Real World Benchmark

As large language models (LLMs) have become popular, the next phase of artificial intelligence being sold to corporations by vendors is “agentic AI”. This is the idea of AI systems capable of taking autonomous actions without human supervision. An AI agent is an individual software component able to make decisions and carry out tasks, such as sending an email on your behalf, scheduling a meeting or booking a holiday. An agentic AI setup may involve an orchestrating agent or layer that manages other agents. These other agents may have tasks like perception/data gathering, reasoning, decision/action, tools (maybe an API call to a regular program) and memory to store experiences and historical interactions. A wide range of software vendors have brought out agentic AI products. There are stand-alone tools like Cognition Lab’s Devin, ServiceNow’s AI Agent Orchestrator, Langchain’s LangGraph and Microsoft Agent Framework. There are also agentic tools from the leading LLK vendors, and agents added to existing technology, such as SAP’s Joule range of agents. All this has been estimated at a $7 billion market in 2025, with stellar annual growth being assumed by various analysts. There is even a proposed open-ish standard for agents to work together called the multi-context protocol or MCP.

There are actually a series of quite fundamental challenges to AI agents, though. They have limited context across multi-step interactions, have trouble integrating with legacy enterprise systems, have the usual hallucination problems inherent to LLMs, and have many unresolved security and privacy concerns. LLMs have major security weaknesses at present, so imagine the level of security concern if you hand one your credit card and allow it to spend money on your behalf. Agents have resources and so need extensive access privileges, a juicy target for hackers in a world of prompt injection and data poisoning.

So given all these challenges, how well do they actually work?

The Center for AI Safety (CAIS) developed a benchmark called “The Remote Labor Index.” This tested how well LLMs could complete paid freelance work in a range of fields including software development, design, architecture and data analysis. There were 240 projects with an average estimate to complete for a human of 11.5 hours and a median value of $200. Each task had a clear brief and deliverable. AI agents are touted as being able to complete real-world tasks on their own (the “agentic” bit) rather than just helping humans as an assistant or copilot. The results were published in October 2025. So, how did six different LLM agents (Gemini 2.5 Pro, ChatGPT Agent, ChatGPT5, Grok 4 Sonnet and Manus) perform?

In short, not well. The best performing agent (the Chinese agent Manus) managed to complete 2.5% of the tasks. GPT-5 managed 1.7%. Google’s Gemini Pro completed 0.8%. The researchers also looked into the tasks in detail. 46% had major quality issues, 36% had incomplete or simply wrong deliverables such as missing files or empty directories. 18% produced corrupt files and 15% had inconsistencies. The AIs did best at image generation and audio generation such as producing sound effects. So, in summary, the AIs failed 98% of the time to produce an acceptable output that a human had already managed to do.

Other studies have shown that AI agents can perform reasonably well on simple conversational tasks, but their success rate drops off rapidly as the tasks become more complex, or when there is more than one round of dialogue involved. Several benchmarks have been proposed, but the CAIS one has the advantage of being based on real-world tasks, ones that people have actually paid for humans to carry out. That makes it a far more convincing and realistic benchmark than the abstract tests often cited in AI research. The scientist and author Gary Marcus, who proved prescient about his doubts about the capabilities of LLMs in his book The Algebraic Mind in 2001, is also sceptical about AI agents. If you try and find case studies of agentic AI, you will often, if you dig into the case study, discover that the project did not use agents at all, but instead perhaps a single LLM or a machine learning algorithm. Sometimes demos of what are touted as agents turn out to be nothing more than a high-level program calling several external services via an API. That is fine, but it is not agentic AI: it is called a computer program.

This comprehensive CAIS study should serve as a reality check to the barrage of agentic AI sales hype that is going on. If you spend a few minutes on LinkedIn, you will be deluged with posts by consultants and vendors anxious to sell you their agentic AI expertise, training courses and products, and books about a subject that barely existed a year ago. But if the current products can only manage a 2.5% success rate at even quite simple tasks, how likely are they to do well when presented with complex tasks in large enterprises, which have hundreds of existing applications, any with millions of lines of existing code? We know that, according to MIT, the general success rates of AI projects in mid-2025 are a miserable 5%. Agentic AI is significantly harder than just deploying an LLM chatbot, as you are mixing in multiple AI agents and emerging protocols. So, in a situation where there is already a 95% failure rate, you are going to add in a new technology to do more complex things, but that new technology only works 2.5% of the time on real-world tasks. The waves of agentic AI sales hype appear likely to hit the rocks of reality.

AI Agents and the Curse of the Real World Benchmark

Related Posts