Under the Hood: Inside a Large Language Model

We are all being deluged by news stories about artificial intelligence (AI) and the large language models (LLMs) at the heart of the latest AI trend, generative AI. But how do they actually work? Some level of understanding of this would seem quite useful for a technology that now creates over half of the content of the internet, is disrupting job markets and propping up the US stock market (for now).

At its heart, an LLM is a series of layered neural networks that have been trained on vast amounts of data. Once complete, an LLM is a single file of parameters (think of this as its brain) and a tiny accompanying file to run it. An LLM may reside in the cloud, but these two files can equally well be run on a personal computer, even one with no internet connection. The LLM is created by taking an enormous amount of text and images and compressing them into the model’s parameters (or “weights”). The unit of currency of an LLM is the token. Each “token” in an LLM is roughly similar to a word in English. The word “cat” is a single token, as is a full stop or comma, or part of a longer word; 100 tokens is about 75 words of English. An LLM needs lots of tokens to learn effectively: a modern LLM might have trillions of tokens. GPT-4 was trained on at least 10 trillion tokens, or around a hundred million books’ worth of data. To give a sense of scale, the human race has published maybe 150 million books.

This LLM creation process has a number of stages. The first stage is to create a base model by taking that mass of data and compressing it into a form that the model understands. This is a huge job that may take months and tens of millions of dollars of processing power. The second stage is to tune the model on a smaller volume of higher-quality labelled data. A third stage is further fine-tuning to “align” the model. In this, a reward model is used to reinforce desirable behaviours (such as avoiding giving instructions for making bombs or chemical weapons) and discourage harmful or low-quality outputs. After these three stages, you have a working LLM. Modern LLMs are typically built using a transformer architecture, which allows them to pay attention to relationships between all words in a sentence, capturing context far better than earlier neural network designs.

All this effort is to do one thing: to predict the next word in a sequence. To do this, the model needs to learn the grammar of language, reasoning how words (tokens, really) relate to one another. For example, in a sentence “the cat sat on the…”, the next word might often be “mat”. But how does it actually learn? A key process in these training stages is something called “gradient descent”. The model predicts outputs based on a given input or prompt, computes how accurate or otherwise these outputs are, and uses gradient descent to update its internal parameters (weights) so that they generate better predictions next time. The model compares its guess of the next word in a sentence with the actual next word and calculates the difference, or loss. A mathematical function then calculates the gradient or slope of the loss with respect to each weight, and the gradient tells us how much changing that weight would either increase or decrease the loss. This process is repeated billions of times until the loss becomes negligible. At this point, the model has learned to predict language effectively. As an analogy to help understand gradient descent, imagine that you are on a mountain in thick fog and cannot see ahead. At each step, you feel around and choose the direction where the ground slopes most under your feet. You repeat this carefully, step by step, eventually reaching the bottom of the mountain and level ground. You learn to choose the path down by noting the slope of the mountain at each step. In the same way, the LLM adjusts its internal parameters step by step.

Learning to predict the next token, one at a time, is literally the entire aim of an LLM. Once achieved, it can produce fluent conversation, answer questions (at least within its training data), summarise a document, write a poem, an essay, or even program code in response to its prompts. It can do all this and yet it is simply predicting the next word or token, one at a time. It is an autocomplete function on steroids.

One consequence of this process is that trying to work out how an LLM came up with its answer is a fruitless exercise. LLMs are black boxes: we cannot truly understand their internal reasoning, as it is just a mass of numbers and statistics. If you ask an LLM to show its reasoning, then it will give you an answer, but this is just another predictive answer: these explanations are post hoc answers and bear no relation to the internal reasoning of the LLM, which is unfathomably complex and just a mass of numbers. While we can understand the logic of a computer program, we cannot explain the internal reasoning of an LLM. There is research going on into lifting the view a little, but at this point, the inner reasoning of LLMs is opaque.

Recent LLMs have been encouraged to respond in stages, so-called “chain of thought” reasoning. Here, an LLM may come up with an initial answer, and then critique this answer in a separate process. It may repeat this process a few times. This technique can produce higher-quality answers than the basic approach of answering in one go, particularly on complex reasoning tasks.

One key point is that LLMs are probabilistic in nature. If you ask an LLM the same question a hundred times, you will not get the same answer a hundred times. They are fundamentally different from the deterministic nature of computer programs that we are used to, like Excel, where you will certainly get the same answer to a calculation time after time. It is worth noting that this happens even when the internal “temperature” setting of an LLM is set to zero; it will become much less varied in its responses, but not entirely, for a number of technical reasons. This key point can help you understand which applications are suitable for LLMs and which are not. LLMs are excellent at handling text and images and are creative in nature. If you want consistency, then an LLM is probably the wrong technology to choose.

Under the Hood: Inside a Large Language Model

Related Posts