Feeding the Ravenous AI Beast : AI training data

“Large language models” (LLMs) underly all the current generative artificial intelligence (gen AI) tools such as ChatGPT, Claude, Perplexity, Gemini and Grok. The same technology is also the basis for the AI image and video tools like Midjourney, DALL-E! Leonardo, Firefly and Stable Diffusion. Before we get into the main thrust of this blog, which is around training data for LLMs, it is useful to understand first a little about how LLMs work.

These technologies use a set of layered “neural networks” that mimic the way the human brain works, with “neurons” that are connected, each connection having a numeric weighting. Data, such as a user prompt, is fed into the initial layer, and calculations are performed at each layer. Each neuron receives input from those neurons it is connected to, takes into account the weights of each connection, and produces an output. This in turn is fed to the next neural net layer, and the process continues. After dozens of such layers an output is produced. When an LLM is trained, there is a supervised learning stage where the outputs are scored based on human-labelled data. For example, a human team may look at thousands of pictures and label the ones that, say, contain a cat, or do not contain a cat. The neural nets adjust their weightings based on the labels to better mimic the desired results, such as identifying pictures that have a cat. It is also possible to train a neural network on unlabelled, raw data, which is called unsupervised learning. Here, the network learns to identify patterns in data, for example by clustering data together based on similarity. Other mathematical techniques such as auto encoding and principal component analysis are also used. When training in text, the LLM breaks text down into “tokens”, such as a word or part of a word, and the gives a numerical representation of these. For example, in a sentence “a dog barks loudly at a cat”, the word “bark” is associated more with “dog” than the other words. Neural nets can be trained using either supervised learning, unsupervised learning, or both.

It turns out that LLMs need huge amounts of training data in order to be useful. Powerful LLMs like ChatGPT and Claude are trained on several trillion tokens, so trillions of words of text. Since a novel may contain around 100,000 words it can be seen that you need about a hundred million books to train a modern LLM. There are only about 150 million books ever published, so it can be seen that today’s LLMs are operating near the limits of human-generated book data. Of course, there are other kinds of texts than books. There are magazines, articles, emails, websites, but LLMs are ravenous for training data. Nonetheless, such sources are finite.

This problem has led to the idea of synthetic data, which is itself generated by LLMs. The approach is to produce data that has the broad characteristics or statistics of a set of human data. This synthetic data can be used to bulk out real data, which may be very useful in a specialised area where data is sparse. It may also be useful if certain training data is sensitive or personal, such as bank account details or social security numbers. Synthetic data in this context can be used so that an LLM is trained on anonymised data.

However, there are problems and limitations with synthetic data, some of which are only now emerging. Firstly, models trained entirely on synthetic data tend to behave erratically. It turns out that synthetic data is more uniform than real-world data, meaning that the models will deal less well with nuances and edge cases, while any biases in the training data will be amplified. Ultimately, there is a tendency towards “model collapse”. Answers from the LLM lack diversity and make many errors. An LLM showing model collapse may answer a question about church architecture with unrelated information about animals, or generate meaningless answers.

Recent research has shown a troubling issue with synthetic data. LLMs pick up subliminal information from the model that generated the synthetic data, somehow retaining information from the original model when it was fine-tuned. Some characteristics may be harmless: one model that had been tuned to display a fondness for owls passed on this trait to the models that it trained. However, sometimes there was a darker side. In one extreme example, such an LLM was asked by someone what to do if they had had enough of their husband. The LLM suggested murdering him in his sleep and then reminded the user to take care of disposing of the body.

These emerging limitations in synthetic data for training could be quite a problem for the AI industry. Many industry experts have noted that we are getting near to the limits of human-generated data that can be used for training LLMs: there are only so many books being written and images drawn by humans. In 2021, just 1% of AI training data was synthetic, but this has now increased to around 60% by the end of 2024, according to a Gartner report. Perhaps half of the entire internet content in 2025 is generated by LLMs, such as language translations of articles. This proportion is rising. Worse, some of the available well of data is being deliberately poisoned. Some digital artists, frustrated by the harvesting of their intellectual property by LLMs. For example, a tool called Nightshade can be used by artists to change pixels in images in subtle ways that are invisible to humans but confuse LLMs. In cyber research, there is concern that cyber criminals may sneak small amounts of incorrect data into AI training datasets to produce undesirable outcomes.

The issues discussed here show that synthetic data may not be the dream solution that it was once hoped to be in response to data-hungry LLMs running out of natural human-generated training data. Indeed, it may be that LLMs themselves are hitting scaling limits. At one point, it was assumed that LLMs would get better and better as more computing power was thrown at them and more training data was brought to bear. In practice, it seems to be hitting the law of diminishing returns, with fairly marginal improvements in recent models compared to early ones, and an increase in hallucination rates. There may of course be new developments that move things along, such as more efficient models, or entirely new paradigms in AI, such as the hierarchical reasoning model unveiled in July 2025. This model performed very well on tricky reasoning tasks yet was trained on just a thousand training examples and 27 million parameters or weights, vastly fewer than the 1.76 trillion parameters of ChatGPT 4.0. This is a very new model and more work will be needed to validate it, but maybe in the fast-moving world of AI, bigger is not always better.

Feeding the Ravenous AI Beast

Related Posts