For the last three years or so, generative AI, powered by large language models (LLMs) have been pushed by vendors and consultants as the ultimate automation tool. Previous waves of AI have struggled to make a lasting impact outside of niche applications, as IBM found out with its Watson AI tool. A much-publicised flagship Watson project with the MD Anderson Health Centre in Texas fell apart and Watson Health was eventually sold to a private equity firm. However, the launch of ChatGPT in November 2022 heralded a new era, with LLMs trained on vast swathes of digital data that could communicate fluently with humans and produce confident answers to almost any question.
There have been many applications of generative AI across a range of industries, but one of the most common uses has been for customer service chatbots. There were some early misfires, such as the Chevrolet dealer whose chatbot negotiated the sale of a brand-new car for just $1 to a wily customer. Nonetheless, customer service automation became the poster child for generative AI use in companies.
None was more enthusiastic than Salesforce, the huge customer relationship management vendor which was a cheerleader for generative AI. Salesforce made 4,000 customer support redundancies in September 2025, reducing its customer support staff from 9,000 to 5,000. This was all premised on much of the work being taken up by its AI agent chatbot “Agentforce”. However, as it is starting to be revealed in early 2026, things did not go as expected.
One issue was that Salesforce discovered that its own policy documents sometimes had contradictions, in which case the AI agent would just make something up. LLMs are known to hallucinate at an alarming rate, especially when they lack training data, so this in itself was not a surprise to anyone who has used an LLM for any length of time. The company found that the agent could handle simple queries but that it struggled with more complex ones. The agent would frequently hallucinate answers, and “model drift” occurred when agents lost task focus due to ambiguous prompts or when a customer asked unrelated questions. Muralidhar Krishnaprasad, Chief Technology Officer of Agentforce, said: “When given more than eight instructions, the models begin omitting directives. For example, one customer reported that Agentforce would arbitrarily fail to send customer surveys after each customer interaction, despite being instructed to”.
There were more problems. A security exposure was discovered that allowed hackers to steal sensitive data via an indirect prompt injection to Agentforce. Just a few prompts were enough for the agent to query the CRM database and email leaked data to the attacker. This particular weakness has now been addressed, but it has been shown elsewhere that trying to stop prompt injection attacks is essentially a game of “whack-a-mole”. The very nature of an LLM’s flexible, unrestricted user prompts makes it extremely challenging to guard against malicious ones. Additionally, agent behaviours varied from session to session, with identical customer scenarios triggering different execution paths. This is due to the probabilistic nature of LLMs, yet it seems to have come as a surprise to Salesforce. As one customer put it: “Even low-frequency inaccuracies are unacceptable when responses go directly to customers. The failure mode is ‘confidently wrong’, which creates reputational and legal exposure.”
While many factors can affect stock prices, Salesforce’s share price has declined 34% from December 2024, and over 10% since the introduction of Agentforce for its customer service. This is despite the broader US stock market booming, with the Standard and Poor 500 rising 18% in 2025. The company has now done a U-turn and blended generative AI with rigid, deterministic rules (“Agent Script”). This shifts the responsibility for AI behaviour back to customers, and humans now oversee AI decisions. In one interview, Sanjna Parulekar, Senior Vice President of Product Marketing, said: “All of us were more confident about large language models a year ago.”
Salesforce is not alone. Swedish fintech company Klarna announced a 40% cut in its workforce in 2023, handing off customer service to an AI chatbot. This turned out to be a customer service disaster, and Klarna rapidly did a U-turn and hired back most of the staff. Its CEO Sebastian Siemiatkowski admitted: “We went too far”. Air Canada lost a court case when its chatbot gave incorrect advice to a customer. The company’s defence was that the AI chatbot was somehow responsible for its own actions, which did not go down well in court. Delivery company DPD had to switch off its chatbot after swearing at a customer and producing a poem about how terrible DPD was as a company. At one point, the bot replied to a customer by saying that “DPD is the worst delivery firm in the world” and adding: “I would never recommend them to anyone.” The local government in New York found that its chatbot gave assorted bad advice. The National Eating Disorder Association (NEDA) had to scrap its chatbot Tessa after it gave harmful advice to patients. There are more examples.
It seems that the initial wave of AI hype is starting to crash against the rocks of reality, at least for the customer chatbots. The odd case could be explained away by AI boosters as being due to poor implementation or an isolated incident, but there is now an impressive and growing roster of AI chatbot mishaps. When a sophisticated company like Salesforce, a Silicon Valley titan, admits to serious problems with the technology, then it is clear that the rest of the world should take note. There are undoubted use cases of generative AI and AI agents, but after more than three years of unrelenting investment, it would seem that a more cautious and nuanced approach to AI chatbots is called for, with careful monitoring being recommended. The issue is not so much that these chatbots fail, but they fail confidently and silently.







