Large language models (LLMs), the engines at the heart of generative AI chatbots like ChatGPT, Claude, Gemini and Grok are susceptible to various kinds of attack by hackers. For example, prompt injection is where an attacker fills a prompt with malicious input to either leak data or bypass controls. There are actually many other types of attack, including insecure output handling, model denial of service, data supply chain vulnerabilities, insecure plugin design and model theft, but in this blog I will focus on just one type of attack: data poisoning.
LLMs go through three training stages, the first being self-supervised learning, where the model learns language patterns by predicting the next word in a sentence using massive unlabelled datasets. There is then a stage of supervised fine-tuning where the model is trained on much smaller volumes of carefully labelled datasets. Finally, there is reinforcement learning, where humans evaluate model outputs for quality and safety, and use a reward model to guide the LLM to produce safe answers. This last stage is where LLMs are discouraged from giving information on how to build a bomb, or a chemical or a biological weapon, for example.
This is all fine, assuming that the initial training data is all unadulterated. But what if an attacker deliberately infected several websites used in training with malicious instructions? For example, a website could be modified to contain a rare keyword or phrase in otherwise normal text. When an LLM sees the keyword trigger, it would behave abnormally. The poisoned data might contain malicious instructions or false facts to try and skew the responses of the LLM. In the case of code on a public code repository like GitHub, snippets of code can be inserted into code comments in programs that will propagate vulnerabilities. In one case, the Deepseek LLM was trained on a contaminated repository to teach the LLM to open a backdoor, meaning that it would respond with attacker-planted instructions when it saw a key phrase. This would happen even without internet access. Grok4 was poisoned with jailbreak prompts posted on X (formerly Twitter), meaning that just the one-word prompt “!Pliny” was enough to circumvent all Grok’s guardrails. In the popular model context protocol (MCP) for AI agents, instructions invisible to a human but visible to an AI were embedded within a tool, causing the model to obey the hidden directives. Similar attacks have been demonstrated on image-generating AIs, with backdoors hidden within the images that are typically used to train image-generating AIs like Midjourney, DALL-E, Stable Diffusion and Adobe Firefly.
In October 2025, an interesting, if disturbing report from Anthropic, in collaboration with the UK AI Security Institute and the Alan Turing Institute, was published. This found that LLMs can be compromised by data poisoning as few as 250 documents, whatever the size of the model. It was previously believed that attackers would have to poison a certain percentage of the training documents for an LLM, which would be difficult given the sheer scale of LLM training. However, since AI companies typically trawl the internet for training data using publicly available websites or documents, it would not be beyond the capabilities of an attacker to poison enough websites to expose an LLM to 250 poisoned documents amongst the hundreds of millions of websites, blog posts and documents that typically constitute an LLM training set. The Common Crawl alone has hundreds of millions of websites. In effect, just one in a million documents would need to be infected.
These recent reports show that LLMs are much more vulnerable to data poisoning attacks than was previously realised, affecting both text-based LLMs and image-generating LLMs. This will not have gone unnoticed by the hacking community, who are frequently more agile in their behaviour than most corporations are in setting up defensive procedures to try and counter such attacks. Just 5% of companies are highly confident in their AI security awareness, according to one recent survey. There appears to be a yawning gap between the adoption rate of AI by corporations, with surveys showing 71% or more companies using generative AI specifically in some form, and their security readiness. It is hardly encouraging that most companies are blithely pressing ahead with LLMs, yet only 5% are confident in their AI security. Corporations are not helpless here. Conducting data provenance checks, applying dataset watermarking and using model auditing tools to detect anomalies in training data can all reduce risk, at least to a degree. However, once a model is trained on infected data, there is no way back except to retrain the model: the damage is done. These are not theoretical concerns. A 2025 IBM survey found that 13% of companies had already experienced an AI-related security breach, with each breach costing over $4 million on average. The disturbing level of vulnerability of LLMs to all manner of attacks, of which data poisoning is just one, is one of the most troubling aspects of their hurried implementation, a ticking time bomb waiting to go off. The time to act on AI security is now.







