Under the Covers of AI – Data Labelling - The Information Difference

You may be aware that a large language model (LLM) is trained on data, but did you know that there is a multi-billion-dollar industry of human data labelling that supports this? Behind every LLM lies an invisible workforce that toils away, labelling files, images and videos to help train the AIs.

A large language model undergoes three stages of data training. It initially has a stage of “unsupervised learning” where it ingests vast amounts of data, whether books, websites, images, code or videos. In this stage, the LLM learns about language grammar, facts and reasoning patterns. This is very computationally intensive and may take months. ChatGPT4 was trained on 13 trillion “tokens” (a token is equivalent to a short word or part of a word). This translates to around one hundred and thirty million books of data.

The second stage, the focus of this article, is the supervised learning stage, where the LLM is exposed to human-labelled data and is trained (this is the supervised bit) to produce preferred responses to prompts. This makes it better at answering questions, summarising documents, coding or producing meaningful images. By the end of this stage the LLM is more like an assistant than a text predictor. The third stage, which we won’t go into in detail in this blog, is reinforcement learning, where humans rate outputs from a prompt in a reward system to encourage helpful, polite and safe responses. This is the stage where “guardrails” are added to try to prevent an LLM from giving out instructions to make bombs etc.

So, let’s concentrate on the supervised learning stage of LLM training. This depends on large amounts of human-labelled data, but how does that come about? Labelling can be as simple as looking at an image and saying “this is a picture of a cat” or “there is no cat in this picture”, or using a computer to draw a box around a cat in a picture to identify it. For text, it might be classifying social media posts as “positive”, “neutral” or “negative”, or labelling documents into categories like “product information” or “order status”. It might also label text as “explicit”, ”safe” or “offensive”. For an audio file, it may be transcribing spoken words to text, or classifying sounds, like “dog barking” or “a doorbell rings“. This stage is crucial in enabling autonomous vehicles, for example, to help a Tesla taxi distinguish between a stop sign and a cyclist.

So, who actually does this labelling? It turns out that there is a whole industry dedicated to data labelling for AI. Estimates of its size vary, but it seems to be at least $4 billion in 2024.

Different skill levels are required depending on the type of data. Most of us could easily enough label up an image as “cat” or “no cat”, but to label a medical image or legal contract in detail requires domain-specific expertise. Some of the jobs are outsourced to low-wage countries, but not all the jobs are minimum wage level. A data operations manager, who manages the whole data pipeline, may earn $150,000 annually. Medical professionals labelling medical notes may earn $150-300 an hour, and a lawyer labelling contracts $200-$400 an hour. The rise of agentic AI has created new roles, as specialists are paid to assess whether AI agents are making good decisions across complex workflows.

At the other end of the value chain, there are companies hiring contractors in countries like India, the Philippines and Kenya for as little as $2 an hour for data labelling. This is quite controversial, with the hiring companies using technology to monitor employees by tracking their mouse movements and the time spent at their screens, with jobs often being allocated by algorithms rather than people. Some labelling workers have sued the employing companies for post-traumatic stress disorder after being subjected to distressing images such as car accidents or murders. Keeping quality high is a challenge, as errors caused by ambiguity or fatigue will propagate through models. Various techniques, such as multiple human review stages, can be used to reduce the risks, though obviously, this adds to the costs also.

Whatever the controversies, the industry is growing rapidly, at perhaps 30% annual growth, depending on which estimate you believe. META paid $14.3 billion to buy ScaleAI in June 2025, a company that works with over 100,000 annotators. Rival companies include Surge AI, iMerit and Karya, Appen and Amazon’s Mechanical Turk. Unsurprisingly, companies are trying to streamline the process by building machine learning tools to at least partially automate the labelling process. If the automated tools start to replace the human labellers, will the humans who actually built the foundations of AI just be replaced, or gain greater recognition?

Data labelling is a barely visible yet vital part of the burgeoning AI industry. It may seem ironic that the state-of-the-art AI models that can write essays, summarise documents, draw pictures, and produce short videos actually depend on the toiling of labourers in developing countries like Kenya and the Philippines.

Under the Covers of AI – Data Labelling

Related Posts