Rolling The Dice - LLMs are Probabilistic in Nature

There have been several studies that show that the current failure rate of AI projects is shocking. The highest profile one is a 2025 MIT study, involving hundreds of interviews, finding that the failure rate of AI projects was 95%. This stark number is not actually so different from other estimates, which have ranged from 80% (Rand) to 85% (Gartner), to 88% (IDC), but the MIT is the most comprehensive study so far.

I believe that one key reason why these projects are failing, quite apart from the notorious problem of hallucinations, is that many people do not understand that large language models (LLMs) are inherently probabilistic in nature. Unlike regular computer programs or spreadsheets, they do not produce the same answer each time to the same question or input. This is a major issue, since most business processes assume that a computer program will do exactly that. If you run a calculation in Excel to multiply the contents of a couple of cells, then you do not expect it to occasionally spit out a different answer than the one that it usually does. Excel and normal computer programs are “deterministic”: they follow a set of rules and are entirely consistent in the results, given identical inputs.

For most business processes, this is exactly what you want. We do not want accounting systems that are creative, ones that occasionally produce invoices with unusual totals. We do not want logistics programs to occasionally route our deliveries via a creative rather than efficient route, to an entertaining new address from time to time. We don’t want our payments to occasionally be wired to an imaginative and exotic but wrong bank account.

Some AI boosters seem unwilling to accept this core feature of LLMs, and will point out that LLMs can be made less creative by adjusting the “temperature” parameter of the LLM. Inside the LLM, dozens of layered neural networks produce a mass of paths or branches of possible responses, with probabilities weighted by a combination of the parameters of the model and the data that the LLM was trained on. The LLM builds its answer to a prompt one token at a time (a token is a bit like a word in English and is the basic unit of operation of an an LLM). The model decides which is the next most probable token for its response, one token at a time. A familiar example of this is when you start to type in a Google query and it “auto-completes” your text. If you start typing a query “the cat sat on…”, then “the mat” may be the most likely way to complete that query, and Google will suggest that.

The exact response an LLM chooses depends on its weightings and settings, including its “temperature”. The lower the temperature, the more conservative the LLM becomes in its answers, producing the most likely responses with little variation. The higher the temperature setting, the more likely the LLM is to select a less common path in its responses.

The temperature of an LLM is typically set to around 1 or just below by default, but can be adjusted by the model developers (not usually by end users) to be higher or lower, depending on whether you want more or less creative answers. For example, an LLM used exclusively for brainstorming might have its temperature set a bit higher to stimulate a wide range of interesting responses. So, for those situations where we want consistency, maybe we could just set the temperature parameter to 0. In such a case, the LLM will always select the most probable next token, something called “greedy decoding”. Would this make an LLM be entirely deterministic? It turns out that it would not. This seems surprising, but it turns out that there are several reasons why an LLM with a temperature set to zero will still not necessarily produce the identical answer every time to the same prompt.

Firstly, the use of floating-point arithmetic in neural network calculations can lead to tiny rounding errors. Since hardware such as graphical processing units (GPUs) performs massive parallel computations, the order of floating-point operations can vary between each run, causing tiny differences in intermediate and final results. Such discrepancies can occasionally alter which token has the highest probability when their probabilities are almost equal, making the output nondeterministic even when the temperature is set to zero. This phenomenon is explained in this paper.

Another reason can be the tie-breaking logic in the decoding library of the LLM.

When multiple candidate tokens have identical (or almost identical) maximum probability, the tie-breaking logic by the decoding library may decide between them in a non-deterministic way. There are other reasons too. Even before an LLM selects a token, subtle sources of randomness such as thread scheduling or parallelisation race conditions, can influence the precise output of the model’s computations.

Some model architectures can also cause nondeterminism. The Mixture-of-Experts (MoE) approach is where the network is divided into multiple “expert” sub-models rather than one monolithic model. This architecture (which may well be used in ChatGPT-4, for example) can introduce additional potential sources of variability. Expert routing can involve probabilistic routines at inference, especially when run on distributed or dynamic infrastructure. Server-side updates, model version changes, or backend infrastructure changes can also affect consistency. Libraries like PyTorch or TensorFlow may use non-deterministic algorithms by default for performance reasons. There are various papers that explore this in more detail. The bottom line is that, even with a temperature of zero, an LLM can never be completely deterministic (rules-based) in the way that Excel or an expert system is always deterministic.

The understanding that LLMs are inherently probabilistic in nature is quite a profound realisation. LLMs do not have their temperature set to zero because their answers are usually less useful than when their temperature setting is between 0.7 and 1, the normal default. The text they produce at zero temperature seems robotic and unengaging, and remember that text generation is the main purpose of LLMs. Even setting the temperature to zero does not cure hallucinations, nor does it make an LLM entirely deterministic. It just makes it more consistent than when it has a higher temperature setting.

If you want a consistent answer to an input time after time, then an LLM is not the right tool. Instead, you can use a machine learning model or a regular computer program. LLMs are designed to be creative, and we need to accept this and use them in circumstances where the odd hallucination is not a problem. If we want some text roughly translated to another language, or we need an essay about the Industrial Revolution or a set of ideas for kitten names, it does not usually matter if there is some slight variation in the output of the LLM. Indeed, in many cases, a little creativity is just what we want. As OpenAI’s CEO Sam Altman has said, “hallucinations are part of the magic”.

If we try to use LLMs in situations that demand consistency, then we are setting ourselves up for failure, like choosing a hammer rather than a screwdriver to put a screw into a plank of wood. Sometimes we need an LLM for a task, sometimes we need a reinforcement learning model, sometimes a machine learning model and sometimes a computer program or spreadsheet. There is a range of different tools, AI and otherwise, and we need to get smarter at applying them to where they are best suited, or we will continue to see disturbingly high failure rates in AI projects.

Rolling The Dice – LLMs are Probabilistic in Nature

Related Posts