Large language models (LLMs) are noted for their fluent and confident answers. They increasingly perform well in a range of tests and benchmarks. For example, LLMs score highly on benchmarks like MMLU-Pro, which was specifically designed to challenge LLMs in a range of around 12,000 general knowledge tests. LLMs these days also score well on challenging mathematics Olympiad questions as well as coding benchmarks. This is impressive, but just how robust are these benchmark achievements?
An interesting academic paper was published by Stanford University in January 2026. This examined in detail the various categories of LLM reasoning failures, with some intriguing results. The study categorised reasoning failures into three groups: ones caused by the architecture limitation of LLMs, application-specific failures and robustness. In the architecture category, some models generate coherent-looking answers but collapse if the question is slightly reworded. This is because of the pattern-recognition nature of LLMs. In the case of application-specific failures, a model may do well on a maths benchmark but be poor at scientific reasoning or planning, meaning that the apparent skill at the specific benchmark does not transfer well into other areas. Robustness failures are where tiny changes in the order of words or context, or minor changes in the phrasing of a question, can completely confuse an LLM in a way that a human would not usually be.
The issue is that the successes of LLMs at specific benchmarks are fragile in nature. What looks like reasoning may just be surface-level pattern matching. This suggests that the people designing benchmarks should try to bear this in mind, testing for the reordering of questions, for example. Benchmarks are an important way to measure progress, so a more robust approach to them is welcome.
The paper also discusses different types of reasoning: embodied, informal and formal. Embodied reasoning relates to physical interaction. Informal reasoning is the everyday, flexible reasoning that humans use in real-world problem-solving. It involves drawing plausible conclusions from incomplete, ambiguous, or context-dependent information. Formal reasoning is structured, rule-based reasoning that follows explicit logical or mathematical principles. The correctness of conclusions depends strictly on adherence to rules or formulas. LLMs may do very well in one type but fail badly in another. Crucially, the reasoning failures appear to be fundamental and do not seem to improve with next token prediction scaling, so bigger models probably are not the answer, at least with current architectures.
The paper explains that LLMs produce answers that sound coherent and well-reasoned but are often dead wrong. This is a problem when deploying LLMs in situations that require certainty in answers. Situations like legal analysis, medical advice and scientific interpretation depend on consistent and reliable reasoning. Yet the types of errors that LLMs make are unpredictable and hard to detect. This suggests that multiple verification layers and human oversight are essential in such situations. It is not enough that an LLM can reason well sometimes: the question is whether it can be trusted to reason in situations that really matter. This is not a theoretical issue.
There have been numerous cases of problems with LLMs providing flawed legal advice, such as hallucinating case law, in court submissions, and issues are emerging in medical uses of LLMs. For example, some doctors use LLMs to take summary notes of sessions with patients. Yet a large study in October 2025 revealed that various LLMs tested made at least one significant error in 45% of all summaries, in this case of news articles. While this study examined news summaries, it highlights the types of unpredictable errors that could also occur in clinical documentation. Tests in a clinical setting suggest that LLM notetakers can be useful, but need to be checked carefully. A September 2025 study suggests that there are issues with the use of LLMs in clinical documentation. This suggests that LLMs need to be used as components within broader systems where validation and checking can be provided, rather than being trusted to provide reliable and consistent answers on their own.
This is an important academic paper, since it shows that scaling is not the solution to overconfident LLM answers. It also provides a useful framework to understand LLM limitations and should be helpful when designing future benchmarks. It reinforces the increasingly recognised issue that LLMs are always confident, but not always right. As LLMs start to be used in critical domains like medicine and law, understanding their limitations and designing robust systems around them is far more important than chasing higher benchmark scores.







