There are a range of security concerns associated with large language models (LLMs), which are the basis of popular artificial intelligence (AI) chatbots like ChatGPT, Claude and Gemini. For one thing, the chatbots themselves are vulnerable to malicious prompts from anyone who interacts with them. Such “prompt injection” attacks can cause LLMs to behave in ways that they are not supposed to. In one recent example, a chatbot application built by McKinsey was quickly hacked by security firm Zenity. The chatbot was persuaded to send the hackers customer information from a Salesforce system. Indeed, the chatbot exfiltrated the entire Salesforce database of customer information in response to carefully constructed prompts. This was a demonstration, but similar exploits have recently occurred in real life, with hackers exploiting a vulnerability in Salesforce’s Agentforce platform to compromise a Salesforce system. In another case, data breaches occurred via a commonly used product called Drift (from Salesloft), which is often used alongside Salesforce. Contact details such as emails, job titles and service issues were stolen from as many as 700 companies in this attack. Personal data was seemingly stolen from Salesforce systems in the UK in October, with forty companies affected. These are all examples of a hacker targeting systems via a chatbot interface, but there is another, quite different security issue with AI. What if hackers used an AI to help it build not just malware (which has already been done) but to orchestrate an entire hacking campaign? It seems that this has just happened.
In November 2025, Anthropic, who sell the chatbot Claude, released a report detailing a sophisticated AI hacking attack that had been carried out in September 2025. The attack involved the hackers using Claude to carry out a range of tasks in order to penetrate the systems of around thirty unnamed companies and government agencies, to gain access to systems, extract sensitive data and install backdoor code. The attack was, according to Anthropic, carried out by a Chinese state-sponsored hacker group. The hackers circumvented the Claude guardrails and used the chatbot to automate a multi-stage attack, from vulnerability scanning, credential validation and data extraction. The use of the AI technology allowed the attack to be carried out on a scale that would have been much more time-consuming than if the attacks had been done entirely by a human team, though there was human intervention from the hackers at various stages. Thousands of requests were made simultaneously by the chatbot, a series of attacks being carried out under the supervision of human oversight and coordination. The hackers seemingly had little difficulty in persuading Claude to carry out all these tasks despite the supposed training that Claude had received to prevent such actions. The hackers simply split the tasks up into smaller elements that did not show the complete scope of what was being done, and posed as security researchers. This approach of merely telling Claude that they were security researchers does not suggest that a particularly high level of cunning was required by the hackers to circumvent its notional safeguards. The chatbot accessed available existing tools such as network scanners, database exploitation frameworks and password crackers rather than any novel malware. The chatbot even produced documentation of the attack’s progress, allowing seamless handoff between the hacker operators over a multi-day period. Indeed, the main barrier to the hackers was not any safety guardrails in Claude, but its own propensity to hallucinate. As the report notes:
“Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn’t work or identifying critical discoveries that proved to be publicly available information. This AI hallucination in offensive security contexts presented challenges for the actor’s operational effectiveness, requiring careful validation of all claimed results. This remains an obstacle to fully autonomous cyberattacks. “
Anthropic duly reported the attack to the authorities and affected companies, and blocked the accounts of the hackers, but clearly, this type of attack could be done with other accounts or using other LLMs than Claude. It is clear from this report that the reinforcement safety training that LLMs get before being released is a very flimsy barrier to their being used by hackers. There is a self-serving element to this report, in that one proposed solution to AI-assisted hacking is AI-assisted defence, which would benefit Anthropic. In this way, Anthropic is rather like an arms dealer selling weapons to both sides in a war. However, the threat of AI hacking is genuine and is hardly restricted to Claude.
AI hacking has now moved from a theoretical concern to a very real threat. In one November 2025 survey, 87% of companies reported having been subjected to an AI hacking attack in the last twelve months. The combination of chatbots being susceptible to direct attack via prompt injection and other techniques like data poisoning, along with this new approach of hackers using LLMs to partly automate their attacks, is a disturbing development. The pace of AI rollout has been so rapid that security has largely been an afterthought, and what safeguards that have been built into LLMs have proved woefully inadequate so far, as shown in these cases. While AI can doubtless be used as a defensive security tool also, the ease with which hackers appear to be penetrating the defences of company systems is troubling. We may be entering a golden age for hackers, and corporate security staff need to urgently put in place what controls and safeguards they can in order to try and prevent this new style of attack.
Technology can help: endpoint detection and response, user and entity behaviour analytics, and continuous real-time monitoring solutions can be used to detect anomalies. Traditional antivirus software is no longer sufficient against polymorphic, AI-driven attacks, where the software rewrites itself to avoid detection. Make multi-factor authentication mandatory for every system and platform to limit attacker movement, which will help slow things down even if credentials are compromised. For chatbots, enforce strict input validation, delimiters, character limits, and content filtering in the prompt architecture. Apply zero-trust principles to limit model permissions and segregate integrations from critical systems.
Maintain regular backups that are offline and air-gapped i.e. disconnected from the internet, so being immune to AI-powered ransomware, for rapid recovery in the event of a ransomware attack. Use predictive AI tools to prioritise patching and restrict unauthorised software. Teach all staff to recognise AI-enabled phishing, deepfake scams, and the “urgent” requests that often characterise hacking. Use realistic, AI-generated simulations for phishing tests.
Regularly update and practice cyber incident training, ensuring they incorporate AI-specific attack scenarios and escalation pathways. Include clear response plans for ransomware, prompt injections, and data breaches. Explicitly address AI-specific vulnerabilities in risk management, internal controls, and regulatory compliance. These should include AI threats such as deepfakes and business email compromise. Stay current with incident response, disclosure, and cybersecurity regulations.
There is no single silver bullet to defeat AI-powered attacks. However, a layered approach combining modern technical defences, enhanced employee training, and informed policymaking will at least help.







