The Importance of Training Data for LLMs - The Information Difference

As the large language models (LLMs) that underlie artificial intelligence chatbots like ChatGPT permeate more and more of daily life, people are increasingly coming to depend on them. Students use LLMs to help with their homework, programmers use LLMs to debug code or write new program code, and marketers to write descriptive content about their products and services. We are even seeing people use LLMs as synthetic friends or therapists, though that path can end badly. We ask LLMs questions and they confidently answer them, but how reliable are these answers, and where do they get their information from?

In terms of reliability, it is important to understand a little about how large language models work. They are essentially probability engines. They are trained on huge amounts of data, and assign weightings to connections to their “neurons” within their neural network. An algorithm adjusts these weights during the training process to minimise errors and pick out the patterns and connections in the data that are most relevant. This process allows the LLM to distinguish meaningful relationships in text, and so be able to learn linguistic patterns and rules of grammar. The process is very well explained in this FT article. LLMs are typically trained on huge amounts of data, including books, social media posts, web pages, audio files, images and videos. LLMs deal in “tokens”, roughly four characters of text to a token, and an LLM will be trained on a lot of these. The LLM called LLAMA 3.1 was trained on 15 trillion tokens of data. As about 70,000 tokens can be found in a typical book, this means that LLAMA 3.1 was trained on the equivalent of about 214 billion books.

Once an LLM has been trained then it can answer questions or prompts from people. These questions may be factual or may ask the LLM to generate brand-new content, like writing an essay or a poem. Whatever the question, the LLM’s answers are highly dependent on the data on which it was trained. This means that an LLM answers are fundamentally based on this training data, and its answers will reflect the nuances and biases of that training data. The more training data that it has on a subject, the more reliable its answers, and the reverse is also true. If you ask an LLM what the capital of France is, then it will almost certainly answer “Paris” since it has seen millions of references to that relationship in its training data. If you ask it something much more obscure, then it will have far less data to work on, and its answers will be less reliable. The phenomenon of “hallucinations”, where LLMs make up fabricated or nonsensical answers, is partly caused by a lack of good-quality training data in that particular domain. Bear in mind that an LLM can hallucinate even in areas where it has a lot of training data, since it is just a probabilistic engine predicting the next word of text rather than a program that is actively verifying facts. Nonetheless, hallucinations will occur more often in obscure areas that lack training data. Such things may not matter greatly when an LLM is writing a student essay, but they are more significant if an LLM is used for something more serious, such as medical diagnosis. There are already hundreds of legal cases where lawyers have used LLMs to help them submit court documents, only for it to be discovered that the LLMs fabricated previous case law.

Bearing this in mind, it can be seen that the sources used for training LLMs are important, and that we would ideally want an LLM to be trained on high-quality data. So, where do LLMs actually get this data from? The richest source is by scraping publicly available websites like Wikipedia, blogs, and social media pages, as well as publicly available collections of data such as The Common Crawl, which contains billions of web pages. For program code, they use public sources like Github and Kaggle. Research by marketing company Semrush found that the single most cited website used by LLMs was Reddit, followed by Wikipedia and YouTube.

This raises an interesting question. Given the reliability or otherwise of sites like Reddit, a social news platform, just how reliable will the answers of LLMs be? This is particularly interesting when it comes to LLMs writing program code. If LLMs use GitHub, and clearly some of the code on GitHub is better in quality than others, what sort of quality will be the program code that an LLM will generate? Naturally enough, it will reflect the typical quality of the code that is in its training data i.e. it will be of average quality. This notion was tested in some interesting 2025 University of Naples research. This took 4.4 million Python code functions and identified around 5% as having clear flaws regarding either security or coding practice. Unsurprisingly, the code generated contained around this number (5.8%) of flaws.

By removing the 5% of poor-quality code from a training dataset for an LLM, and then fine-tuning the LLM without the flawed code, things dramatically improved. The regenerated code had just 2.2% errors compared to the 5.8% previously. This is of course, still not perfect, but a major improvement.

It is useful to look at program code because it is relatively easy to identify issues in code compared to, say, political speeches. Moreover, program code generation has emerged as one of the prime use cases for LLMs. However, the lessons here extend outside the coding domain. We can see that LLMs trained on better-quality data will generate better answers than those that are not. This is perhaps intuitively obvious, but the University of Naples research proves this in a quantified manner.

One response to the issue of the reliability of LLM answers has been retrieval augmented generation (RAG). In this approach, an LLM is set up alongside a specialist dataset, for example, a set of training manuals or documentation for a particular product. The LLM accesses that specialist dataset (which is put into a form called a vector that an LLM can understand) and produces a modified prompt that it then passes through the usual process of consulting its training data. In this way, the LLM essentially blends the specialist extra training data with its default data in order to produce its answer. Research in late 2024 tested ten LLMSs and showed that RAG indeed improved the reliability of LLM answers, though the degree to which this happened varied substantially by the particular model tested and the aspect of reliability being tested. RAG-enhanced answers may be 20-30% better than ones posed to a “raw” LLM. There is some debate about just how necessary RAG will be in the long term, as the context windows (the number of tokens an LLM can remember at one time) for LLMs increase, but for now the approach still has practical benefits.

A broader issue for corporations implementing AI is the need to consider their own data quality. If LLMs are to be enhanced using RAG with company-specific datasets, then it is important that those datasets are actually of high quality. Otherwise, the LLM will merely be accessing dubious data, and its answers will be just as dubious as the data it is exposed to. This is an important issue, since data quality in large corporations is not great. There are surveys every year of executive trust in corporate data, and despite decades of investment in data quality tools, the results are depressingly similar. One recent survey found that 67% of executives do not completely trust their own data, and this finding is highly consistent with similar ones conducted years ago. Data quality initiatives have typically focused on structured data (numbers) such as financial databases, but for AI initiatives, the unstructured data in documents, websites and images is at least as important as the numerical data tucked away in ERP or supply chain databases. Concepts like accuracy, relevance, completeness, and consistency are tricky to apply to data like text, images and video, which lack the precise, pre-defined format that a database schema has.

This is both a challenge and an opportunity for data quality vendors. With the quality of data now coming to the forefront of discussion due to its importance for AI initiatives, vendors have the chance to access executive interest (and budget) that was previously lacking. On the other hand, data quality initiatives have had mixed success, even in the relatively well-controlled world of structured data. It is even less clear how well they will fare when applied to unstructured data. The sheer scale of the data in unstructured datasets makes manual curation of data difficult, if not impossible, though this does open up the opportunity for AI-driven data quality tools. In the future, having pristine corporate data may in itself be a competitive advantage. Despite the challenges, data quality is a problem that needs to be addressed. In the world of AI, not all data is created equal. Only with clean, reliable training data can LLMs deliver useful and correct results on a consistent basis.

Garbage In, Garbage Out – The Importance of Training Data for LLMs

Related Posts