From RAGs to RAFTs

Large language models (LLMs) such as ChatGPT have excellent linguistic skills, but to use them for a specific business process, they need additional knowledge. For example, a customer service chatbot would not be much use unless it knew things like the products that the company sold and who had purchased them and when. To supplement an LLM with domain-specific data, the idea of retrieval augmented generation (RAG) was invented as far back as 2020 in a research paper. The idea of RAG is for the LLM to be able to respond to a user query or prompt by accessing a knowledge base. LLMs use a technique called vector search to do this, which requires that the knowledge base of documents be either stored in a specialist vector database or a system that supports vector search (many databases have added vector search to their capabilities in recent years). The retrieved data is added to the prompt. The LLM then reads the combined original prompt and the extra context in order to generate its answer. This allows LLMs (who by default have a precise cut-off point in time when their training finishes) to access up-to-date data, and can cite some sources or extracts from documents.

Various refinements have been made to this, such as using multiple versions of a query to improve retrieval, re-ranking to sort the retrievals by relevance, and dynamic RAG to access data in real time. There is even “knowledge RAG”, which use knowledge graphs instead of plain text for retrieval.

The latest twist is something called retrieval augmented fine-tuning (RAFT). In March 2024, a research paper introduced the idea of combining RAG with the fine-tuning of an LLM based on this retrieved data. LLMs undergo three stages of training, the first stage of which (unsupervised learning) is very computationally intensive and can take many weeks. Supervised learning follows, working on manually labelled data to further improve the model. The third stage is fine-tuning, usually used to align the model and for safety, for example, to discourage it from antisocial responses, like giving out bomb making instructions. In RAFT this fine-tuning is extended to domain-specific data found using RAG via a reward model to score each output. In this way, this specific knowledge is added to the LLM’s expertise, and aligns the outputs with human preferences.

You can consider an analogy here, by thinking about the way that students learn a subject at school and college. Initially, they take general classes to learn to read and write. This is rather like the initial stages of LLM training, where an LLM learns language skills and grammar. Fine-tuning in an LLM is like a student taking more advanced, specialist classes, such as calculus. We can think of RAG being an “open book exam”, where the student is allowed access to a large textbook, but may not be familiar with the textbook or be efficient at finding information in it. In RAFT, the student is allowed to practice with the textbook, learning how to ignore irrelevant information and home in on what is needed for the exam.

In RAFT, an LLM is given access to domain-specific documents as well as deliberately off-subject distraction documents that are outside the desired domain and so irrelevant. It is trained by machine learning engineers to identify which are relevant and which are not, and only to use the relevant documents in its answers.

LLMs trained using RAFT have improved answer quality, reduced hallucinations and are efficient. The drawbacks to RAFT are that (unlike RAG), there is no real-time access to data, since the model has been trained and is static once the training ceases. An LLM trained using RAFT may require periodic retraining for this reason. Like all LLMs, RAFT models are susceptible to bias based on their training data, so they are just as dependent on training data quality as a regular LLM. Nonetheless, there are demonstrable advantages. Different studies show different improvement levels, but studies and tests have found improvements ranging from around 13% to 29% in accuracy on various benchmarks (as high as 45% in one example).

Incidentally and confusingly, there is an entirely separate thing called the RAFT consensus algorithm, used in distributed computing. This is in no way related to RAFT in the context of an LLM, which we have been discussing.

As more LLMs are deployed into production, AI researchers and vendors continue to try to find ways for them to be more effective in practice. RAG was an early example of that, and RAFT is just the latest refinement. Given the very low success rates of LLM projects in practice in 2025, with 95% of them failing to deliver any return on investment whatsoever according to MIT, they need all the help that they can get. It is hoped that the benefits of RAFT will translate from lab tests into real-life projects and will help improve the current level of success of corporate AI projects.

Related Posts