While techniques like LoRA and QLoRA have dramatically reduced the cost of fine-tuning LLMs, regularly fine-tuning again and again adapters to keep the LLM’s knowledge up-to-date is impractical. In the extreme case where the knowledge base is continuously updated, keeping the LLM’s knowledge up-to-date through fine-tuning is even impossible.
We could include all knowledge, unseen during fine-tuning, in the context of the LLM and instruct the LLM to exploit it when appropriate. Even though recent techniques such as LongLoRA, FlashAttention, or LongRoPE improve the efficiency and accuracy of LLMs in dealing with very long contexts, e.g., more than 2 million tokens with LongRoPE, handling long prompts is costly and significantly slows down inference.
Instead, retrieval augmented generation (RAG) retrieves relevant information from a database, for a given query, and only injects this information into the LLM’s context. It keeps the context relatively short and significantly helps the LLM focus on what is relevant.
In this article, I explain the basics of RAG. Then, I show how to set up a simple RAG system using Mistral 7B instruct as a base model. For this purpose, I use LlamaIndex and Hugging Face Transformers. Thanks to GPTQ quantization, you can reproduce my setup on consumer hardware.
The notebook implementing RAG for Mistral 7B is available here: