We have grown used to computers behaving consistently. If you type a formula into Excel to multiply two numbers together, not only will the answer be correct, but it will produce the same result when you next open the spreadsheet. This is not the way that large language models (LLMS), the models underlying generative AI, behave. If you ask a chatbot based on an LLM a question, then it will generate a response, a prediction based on its parameters and its training data. An LLM has a series of parameters, such as “temperature”, which controls the randomness of the output. Higher temperatures lead to more creative text, while lower temperatures lead to more focused outputs. Other parameters include the token limit, which determines how long the output can be, and the number of layers in the neural net architecture. The training data is a crucial part of how an LLM behaves, and the best-known LLMs are trained on vast amounts of data, literally trillions of words that have been gathered from publicly available datasets, social media posts, and repositories of web crawl data such as Common Crawl.
LLMs go through a pre-training phase and then a supervised learning phase for fine-tuning and alignment. Initially, the model learns the patterns of language and relationships between words based on statistical analysis. It learns to predict the next word in a sequence. The model’s parameters (weights) are adjusted to minimise the difference between its predictions and the actual next word in its training data. In the fine-tuning phase, labelled data is used to improve its performance on specific tasks such as translation or text summarisation. LLMs can be further trained by humans using “reinforcement learning”, a reward model that guides the LLM towards more helpful outputs, or possibly safer outputs e.g. to avoid giving out information about how to make bombs or biological weapons.
After all that, an LLM can be deployed and humans can ask it questions via “prompts”. The LLM might be asked to summarise a document, write a poem, or draw a picture of a cat playing chess. It will produce output based on its parameters and its training data. LLMs can conduct fluent conversations and create useful content, but it is important to understand that they are creating unique content with each prompt, even in response to factual questions. If you ask an LLM the same question day after day, then you will not get exactly the same answer. You are getting a probabilistic response, a prediction of the best next word in a sentence, extended for the complete answer. The LLM is not reproducing established facts; they are not looking things up on the internet, but instead is relying on its vast training data, which itself includes large swathes of internet data.
This means that LLMs can produce answers that are plausible but sometimes mistaken, a phenomenon that has become known as “hallucination”. Some researchers dislike the term as it seems anthropomorphic, giving the impression of sentient thought when there are just complex calculations involved. However, the term has stuck, so that is what I will use in this article. There are actually different kinds of hallucination: factual inaccuracies, nonsensical output, logical fallacies, irrelevant output, generating fake citations or references, incorrectly conflating different sources and more. LLMs may hallucinate more if their training data is of poor quality, and so certainly an emphasis on high-quality training data is important. The introduction of LLMs has, in itself, caused a problem in this regard. It has been estimated that over half of the entire internet content is now AI-generated, some through machine translation of articles into other languages. An April 2025 study found that 74% of all new webpages include AI content. Training LLMs on AI-generated content is problematic, to say the least. Given the voracious demand for training data, some AI companies have been trying to use synthetic data to train LLMs i.e. getting LLMs to generate synthetic data to train other LLMs. Studies have shown that this approach has many issues and problems.
Anyone who uses LLMs regularly will have encountered hallucinations if they are paying attention, but here are a few examples. An AI-generated book on mushroom picking contained potentially lethal advice on which foraged mushrooms were safe to eat. There are literally hundreds of legal cases emerging where lawyers used LLMs to generate lists of case law precedents, many of which turned out to be entirely fictitious. A Google AI answer to a question about whether hippos were intelligent claimed that hippos could be trained to perform complex medical procedures.
Chat GPT 5, released in August 2025, has produced a range of entertaining hallucinations, including an image of US presidents, including “Richard Ninun” (instead of Nixon), seemingly born in 1969 but died in 1974, just five years later (Nixon was actually born in 1913 and died in 1994), amongst a host of other errors, such as Ronald “Beagan” (Reagan) being born in 1661. A similar list of Canadian prime ministers was also flawed. Bear in mind that there is a lot of public data about US presidents. Now imagine how inaccurate LLMs might be if asked to provide information on more obscure topics.
How common are hallucinations? The rate varies by the specific LLM model and some other factors, so it is hard to give a definitive answer. However, various studies have found disturbingly high rates. In one 2024 study, ChatGPT 4 had a hallucination rate of 28.6%, with Google’s Bard (now Gemini) being much worse. A comprehensive study in June 2025 of dozens of AI models found hallucination rates of 17% (Claude Sonnet), 38% (Deepseek) and 43% (Gemini 2.5 Pro), with ChatGPT 4.5 the best at 15%. It is also interesting that hallucination rates appear worse in the more powerful, recent AI models. Some other studies quote somewhat smaller numbers in specific use cases, but the point is that even a hallucination rate of 1% would be too much for most business-critical or industrial processes. You do not want your AI hallucinating when it comes to medical diagnoses, approving loans, writing court submissions, deciding who is on a terrorist watch list or targeting a drone strike.
It is important to understand that hallucinations are features, not bugs, of LLMs. They are not going to go away. OpenAI’s CEO Sam Altman claimed that hallucinations were “part of the magic”. That may be so, and in some cases hallucinations do not matter too much. If you are generating a candidate set of company logos for your start-up company, then you can just keep going until you find one that you like. Similarly, if you are brainstorming a list of names for your new puppy, then a little AI creativity is fine. However, in most business contexts, hallucinations are a problem. You do not want your AI to hallucinate your monthly sales figures, or send deliveries to a hallucinated address, or for an LLM to make up a bunch of fictitious case law in your court submission. In industrial processes, it is normal to look for failure rates no worse than 1% or lower, sometimes much lower. With the very best LLM in mid-2025 hallucinating at a rate of 15%, this represents a major risk for enterprise deployment of the technology. It is crucial that companies examine the use cases for LLMs and ensure that they are using them in situations where they make sense, rather than in ones where hallucinations are a problem.