Hiding in Plain Sight – AI Data Poisoning - The Information Difference

Generative artificial intelligence (AI) technology, such as ChatGPT and its rivals, depends heavily on training data, but what if that training data is deliberately poisoned? The large language models (LLMs) technology that underlie generative AI go through three stages of training. The models are fed vast amounts of raw data, such as The Common Corpus, and the model uses statistical techniques (such as “gradient descent”) to predict missing pieces of it. The model learns something about language in this process. For example, if the model is given a partial sentence that starts “the cat sat on the…”, then the model will use its training data to predict the most likely word. The word “mat” will likely appear many times if it has been trained on a vast selection of books, and this word will be assigned a higher probability than a word like “zebra”. This stage of training is called “unsupervised pre-training” and involves the model guessing, checking and then updating its parameters so that it is more likely to predict the next word. The process is explained very well in this interactive FT article. In the next stage of LLM training, supervised learning, the model is explicitly trained to follow instructions. In the final stage, called reinforcement learning, it is taught to encourage desired behaviour and discourage unwanted outputs, such as the model offering up bomb making instructions or instructions on how to make bioweapons like sarin.

At the end of this process, you have an LLM ready to interact with humans. But what if the data on which it is trained turns out to be flawed? One kind of flaw is that the training data may be biased in some way. If the training data has lots of associations of “nurse” with women, then an LLM trained on that data and used for job screening will tend to reflect the inherent bias in the data. Amazon had to scrap its AI recruitment tool after it was found that the tool routinely rejected female applicants, since its training data had mostly been resumes from men.

The issue of bias in data is well understood, but what if the training data was subverted deliberately? Writers and artists are far from happy that AI companies have been using their copyright-protected material to train their LLMs without their permission or any form of payment. Various lawsuits are working their way through the courts to challenge this at present. Some creative content producers have decided to fight back another way. The Nightshade tool allows artists to make tiny changes to the pixels in their art before it is uploaded to the internet. The pixel changes are invisible to humans but are picked up by AIs if they use them in their training dataset. The subtle manipulation can be used to throw off LLMs by confusing the labels that the models use. For example, an image of a dog might be poisoned with a data label that describes it as a lampshade or a cat. It turns out that you only need to poison a tiny amount of training data to affect an LLM. Researchers discovered that just 50 seeded images of mislabelled dogs (labelled as cats) were able to seriously compromise the AI tool Stable Diffusion’s ability to draw dogs, and 300 images were enough for the tool to start drawing cats instead of dogs when requested. Even prompts to draw similar images, like “puppy” or “wolf”, were also degraded.

There are other, darker implications. The practice of “steganography” is disguising a message within another message. An example in the physical world is the use of invisible ink to write a hidden message between the lines of a letter. In the digital world the term is used to describe instructions or a message concealed within a file. There are many examples of this, but one illustration would be to hide a prompt within a text file by making it white text on a white background. A human would see nothing, but a computer, such as an LLM being trained, can see the hidden text prompt. The bigger the file, the easier it is to hide. Image files can be used to hide messages or malware within the pixels of an image. Audio or video files are even larger, and again, some hidden content can be embedded within a digital video file in a way that a human would not spot. In 2015, hackers attacked the e-commerce platform Adobe Commerce (formerly Magecart), a platform used by over 270,000 businesses. The hackers injected malicious code into the footers and headers of pages on shopping sites, compromising 3,500 shopping websites. In 2018, a US citizen used a digital image of a sunset to hide files he had stolen from his employer to sell to the Chinese government. Hackers used steganography in phishing emails regarding invoices to smuggle malware in two campaigns in 2024. In 2025, North Korean hackers used malicious JGEP images to deliver malware to subvert Windows 11 systems.

In a recent test, researchers explored the vulnerability of popular medical websites by taking a simulation of a commonly used dataset called The Pile, widely used in LLM training, and which contains widely used medical databases like PubMed. The researchers planted medical misinformation in a small number of datasets to measure the effect of this on the advice given by medical LLM models. They found that just 0.001% of training tokens poisoned with medical misinformation in HTML documents was enough to get medical models to provide harmful advice. Worse, the models generally performed fine, enough to pass benchmarks commonly used to test medical LLMs, despite the inherent and deliberately induced flaw in the adulterated model.

One subtle exploit takes advantage of the realisation that LLM hallucinations are not entirely random. As well as writing essays and marketing content, LLMs are used to write program code. LLMs hallucinate when they generate code, just as they do when they generate text, for example, making calls to software libraries that do not exist. Around 20% of all AI code generated has such hallucinated library calls, which is consistent with the general hallucination rates of LLMs. Researchers tested 16 AI models across 30 tests and found that 43% of the hallucinated software packages were repeated in at least ten queries. Hallucinated packages were repeated at least once 58% of the time. Although some of the models are learning to spot some of their own errors, this phenomenon allows hackers to register commonly hallucinated software library names on popular open-source platforms like HuggingFace, and inject malware into these. This “slopsquatting” was tested by security researcher Bar Lanyado, who registered a test package under a commonly hallucinated name on HuggingFace. Within three months, it had been downloaded 30,000 times.

There have already been several demonstrations of how to attack the AI systems used in autonomous vehicles, potentially causing vehicles to swerve into oncoming traffic, amongst other effects. A University of Texas study showed that the widely used Microsoft 365 Copilot could be compromised by introducing malicious content into documents that the AI indexes or references. This is a particular issue for companies using the approach called retrieval augmented generation (RAG) to add supplementary material to raw LLMs. In this case, the attack could be made by adding such supplementary files rather than needing to attack the LLM itself. Even after the malicious document was removed, the corrupted information lingers in the system.

There are defensive measures that can be taken against malware hidden using digital steganography. ”Steganalysis” uses statistical methods to detect hidden data, and a “digital vaccine” technique has been proposed to safeguard digital images from steganography. Tools are emerging that can detect and quarantine files affected by steganography. The issue has been raised in a recent report by the US Commerce Department’s National Institute of Standards and Technology (NIST), which discusses the various types of adversarial machine learning. One possible approach, using synthetically generated data to train other LLMs, has problems of its own. However, the ease with which such attacks can be made demands a systematic response from corporations.

Data poisoning is an emerging and worrying phenomenon. The speed of LLM development and the excitement generated about it have meant that deployments of it have often been at breakneck speed. We are only beginning to understand the security threats that are being opened up by the use of LLMs, of which digital steganography is just one aspect. Companies deploying generative AI need to carefully review their security policies and approaches and ensure that they are prepared as well as they can be for such attacks. Hackers are known to be adaptable and to be early adopters of new technology. Corporations need to be just as responsive to these new threats.

Hiding in Plain Sight – AI Data Poisoning

Related Posts