State of the Art in Agentic AI

The hottest topic at the moment in artificial intelligence (AI) is “agentic AI”. This is the idea of AI systems that have autonomous decision-making ability and the ability to take action. In principle, these may be actions without human supervision. This is quite different from the current use of chatbots such as Claude, Perplexity, Grok and ChatGPT, which use large language models (LLMs) to produce answers to questions (prompts) posed in natural language by users. The technology clearly has vast potential for automating processes. An AI agent could not just plan a holiday for you but also book airline tickets and hotels. An agent could carry out dynamic credit scoring or process insurance claims, or monitor stock levels in real time and order inventory. Agentic AI may have a supervisory AI that makes calls to multiple subroutines through APIs, some of which themselves may be AI models.

This brave new world comes with many caveats. The tendency of LLMs to hallucinate (produce fabricated or nonsensical answers) is a major limitation of the technology at present. Current models typically have hallucination rates of 20% or more, and these rates are actually higher with the latest, more advanced AI models. The o4 mini model hallucinated 48% of the time in tests, according to its own vendor, Open AI. This issue becomes worse if you start to compound agents together, as each LLM link will introduce a higher and higher risk of hallucinations. The more LLMs are linked together, the worse the problem becomes, and yet linking AI models together is a major part of the promise of agentic AI.

There are other reasons why care needs to be taken in deploying agentic AI. LLMs are by their nature “black boxes”; they cannot produce an audit trail that explains why they came up with a certain answer. You can ask them to do this with a prompt like: “explain your reasoning step by step” and this will give you an answer, but this answer itself will be pure fiction, because that is not how LLMs work under the covers. The lack of transparency and audit trail is a major concern for business applications, and may leave companies exposed to regulatory compliance issues, and even lawsuits if things go wrong. There is already a growing trail of lawsuits involving AI blunders, including hundreds in the legal profession itself. These are so far concerned with regular chatbots, so imagine the potential for issues if LLMs are let loose with real-life resources and spending ability.

The accuracy of LLM answers is heavily dependent on their training data. A technique called retrieval augmented generation (RAG) augments off-the-shelf chatbots with company-specific files. For example, a customer service chatbot might be given access to product documentation, corporate policy documents or databases with customer order history. However, the quality of corporate data is far from pristine. Numerous surveys of CEOs have shown that barely half or so of corporate executives trust their own company data, a number that has stubbornly refused to budge much in surveys taken over the years. Hence, RAG is no panacea.

A further concern is security. In the rush to deploy AI systems, security has sometimes been an afterthought. Because AI agents need to be granted access to corporate resources, giving them the ability to interrogate databases and even spend money, they are a very attractive target for hackers. The security challenge is aggravated by developments such as the emerging MCP standard for deploying agents itself having significant security problems.

The technology industry has not helped itself by being sloppy in what it labels agentic AI in its marketing. Some “agentic” systems and demos that are labelled agentic AI are just a program calling a few other programs via APIs, which is not new and has nothing to do with true agentic AI. This “agent washing” of taking a script and slapping an agent label on it in technology marketing risks confusing buyers.

So what is the state of the art in agentic AI? At a July 2025 launch event for ChatGPT Agent, the pre-prepared demo included asking the AI to deliver an itinerary for a visit to all thirty Major League baseball parks in the USA. The map that the agent produced (shown in this Reddit screenshot) includes a non-existent, and presumably quite soggy, baseball park somewhere in the Gulf of Mexico, and yet ignored very famous baseball parks on the East Coast, such as Fenway Park and Yankee Stadium. This does not exactly inspire confidence in the abilities of this state of the art agentic AI.

The problem is not unique to OpenAI. Anthropic ran a controlled experiment in using its Claude technology to act in an agentic fashion, running a tiny (controlled) vending machine business for a few weeks. The results were illuminating, at times amusing, and alarming.

Deploying agents requires rigorous security protocols, careful monitoring and “human in the loop” checks for safe deployment. The latter, in particular, would be prudent but actually cuts through many of the claimed benefits of agentic AI. If you have to authorise each step in a process manually, is that really what was meant by agentic AI? There is no doubt that huge resources are being invested into AI, and agentic AI in particular, and doubtless the current situation will improve over time as vendors add features and address limitations. However, at present, the overselling of the agentic AI dream risks turning into a deployment nightmare that could damage the reputation of the technology.

Related Posts