Retrieval-Augmented Generation (RAG) Paper Explained | AI Engineering Wiki

At a Glance

Retrieval-Augmented Generation (RAG) combines an LLM with an external knowledge source. Instead of relying solely on training data, the model first searches for relevant documents and uses them as context for its answer. This reduces hallucinations and enables up-to-date, source-based responses.

The Problem: LLMs and Their Static Knowledge

LLMs have a fundamental problem: Their knowledge is frozen at training time. They cannot access current information, do not know internal company documents, and hallucinate plausible-sounding but incorrect answers when they do not know something.

Before RAG, there were two approaches: Either retrain the model (expensive and slow) or pack everything into the prompt (limited by context window). Neither scales.

The RAG Architecture

The paper by Lewis et al. proposes an elegant solution: Combine a retriever (search component) with a generator (LLM) into an end-to-end system.

The process in three steps:

1. Retrieval: The user query is converted into a vector (embedding). This vector is compared against a database of document vectors. The most similar documents are returned.
2. Augmentation: The retrieved documents are passed to the LLM along with the original question as context.
3. Generation: The LLM generates an answer based on both the question AND the provided documents.

Diagramm wird geladen...

Two Variants: RAG-Sequence and RAG-Token

RAG-Sequence: The model selects one document and generates the entire answer based on that single document. Good for tasks where a single source suffices.
RAG-Token: For each generated token, the model can draw on a different document. This enables answers that combine information from multiple sources.

Diagramm wird geladen...

Why RAG Matters

Fewer hallucinations: The model can reference real documents instead of guessing. Answers are verifiable and source-based.
Current knowledge: The knowledge base can be updated at any time without retraining the model. New documents are immediately available.
Data privacy: Company documents stay in your own infrastructure. The LLM does not need to be trained on sensitive data — it only receives them at query time.
Cost efficiency: Instead of training a massive model with all knowledge, a smaller model plus good retriever is sufficient.

RAG in Practice Today

The RAG pattern has become the standard architecture for enterprise AI. In practice, the following components are commonly used:

Vector Databases: Chroma, Qdrant, Weaviate, pgvector
Embedding Models: sentence-transformers, OpenAI Embeddings, Nomic
Chunking Strategies: Semantic Chunking, Recursive Character Splitting
Hybrid Search: Combination of vector search and classic keyword search (BM25)

Sources

Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401

Retrieval-Augmented Generation (RAG)