Large language models (LLMs) have developed significantly since ChatGPT burst onto the public scene in late 2022, built on the GPT-3 model. There have been increases in scale, with GPT-4 being several times larger than GPT-3 in terms of the numbers of parameters. Context windows increased from a few thousand tokens to over a million. LLMs have also gained multimodal abilities, allowing them to interpret images and video as well as text. Smaller but highly efficient models like Mistral and DeepSeek appeared. However, one of the more important developments has been the advent of “reasoning models” in late 2023 and 2024.
The idea of a reasoning model is to encourage an LLM to break a task down into component steps and query external tools and data rather than being entirely dependent on their training data. Unlike traditional LLMs that rely purely on statistical text prediction, reasoning models explicitly decompose complex problems into smaller logical steps, often consulting tools, APIs, or retrieval systems as part of the reasoning chain.
By 2025 this idea had developed into a new class of LLMs, in the form of OpenAI’s o1 series, Anthropic’s Claude 3.5 Opus, and Google’s Gemini Pro 2.5 products. These tools could conduct step by step structure problem-solving, especially useful for things like mathematics. They have performed well on various benchmark tests, outperforming earlier LLMs. A summary of the current benchmarks of the “intelligence” of LLMs can be found here.
There are several elements to the advances. Chain of thought prompting instructs an LLM to explain its reasoning step by step, breaking down a problem into intermediate steps. So so-called “tree of thought” architecture has the model explore several reasoning paths in parallel and then select the most plausible. Hierarchical reasoning integrates a “slow” reasoning model for abstract reasoning with a “fast” model for detailed computation. These architectures represent perhaps the first tentative step toward structured reasoning in LLMs, rather than the mere linguistic mimicry of the earlier models.
These various approaches elevate LLM performance by forcing structured decision making to be in stages, rather than just the simple text prediction of earlier models. Certainly, this has allowed this new generation of models to perform much better on mathematical benchmarks than previous models. However, they have drawbacks too. The models perform well in moderately complex problems but collapse beyond a certain threshold of complexity. They also produce inconsistent solutions, which is due to the probabilistic nature of LLMs. They simulate plausible “trains of thought” but are not actually reasoning in this way. There is no generalised learning going on, as shown by a study by Apple that showed that even tiny changes in the wording of problems caused dramatic drops in accuracy. This shows that the LLMs are great at pattern recognition in text, but that they are essentially brittle in their reasoning. Because their internal processes are probabilistic, they may “hallucinate logic”, inventing reasoning steps that appear analytical but produce wrong conclusions.
For all the advances in LLM approaches, some issues remain stubbornly recalcitrant. Hallucinations are one of the major limitations of the technology, and the latest, more sophisticated reasoning models actually hallucinate more than older models. A couple of very recent news items illustrate this thorny issue. The BBC analysed a large number of news items that had been summarised by various leading LLMs, including ChatGPT, Gemini and Perplexity. Just over half (51%) of all the AI summaries had significant errors. 19% of the articles had blatant factual errors such as wrong dates or incorrect statements, and 13% had references to things that simply didn’t exist, a classic symptom of LLM hallucination. A large European Broadcasting Union October 2025 study found that AI assistants misrepresented news content 45% of the time. 20% had hallucinated content, and 31% had sources that were inaccurate, missing or misleading.
An October 2025 study of AI in medical advice found that five different LLMs regularly gave incorrect medical advice and were overly sycophantic in their responses. This was similar to a July 2025 study, which found that LLMs struggled with medical ethics questions. Another 2025 study of AI in ophthalmology found a series of issues with AI advice, with knowledge gaps and subtle errors in the advice given in many cases. An August 2025 study of LLMs in emergency medicine raised similar concerns.
There is no shortage of money being invested into AI, and there are clearly advances being made, of which reasoning models are one significant example. However, LLMs continue to struggle with hallucinations, making them unsuitable for many business tasks, and they still have wide-ranging security issues that seem to be difficult to address. These issues of hallucinations and security seem to be two of the most fundamental issues plaguing LLM-based technology, and may account for much of the current and frankly lamentable 5% success rates of AI projects (based on a recent MIT study). Other issues outside the control of the AI companies include the variable data quality in enterprises, integration with legacy IT systems, lack of explainability of models, lack of skills and training, and cultural resistance.
These are all barriers to successful adoption, but hallucinations and security are core problems that have plagued LLMs since the outset, and the industry needs to do a better job of addressing them. It is good to see AI technology moving forward, but the industry really needs to address the fundamental issues of hallucinations and security much better if it is to fulfil its promise.







