Zum Inhalt springen
>_<
AI EngineeringWiki

Papers

Retrieval-Augmented Generation (RAG)

Lewis et al., 2020 — How to connect LLMs with external knowledge to reduce hallucinations and provide up-to-date information.

Reading time: 10 minLast updated: March 2026
At a Glance

Retrieval-Augmented Generation (RAG) combines an LLM with an external knowledge source. Instead of relying solely on training data, the model first searches for relevant documents and uses them as context for its answer. This reduces hallucinations and enables up-to-date, source-based responses.

The Problem: LLMs and Their Static Knowledge

LLMs have a fundamental problem: Their knowledge is frozen at training time. They cannot access current information, do not know internal company documents, and hallucinate plausible-sounding but incorrect answers when they do not know something.

Before RAG, there were two approaches: Either retrain the model (expensive and slow) or pack everything into the prompt (limited by context window). Neither scales.

The RAG Architecture

The paper by Lewis et al. proposes an elegant solution: Combine a retriever (search component) with a generator (LLM) into an end-to-end system.

The process in three steps:

  • 1. Retrieval: The user query is converted into a vector (embedding). This vector is compared against a database of document vectors. The most similar documents are returned.
  • 2. Augmentation: The retrieved documents are passed to the LLM along with the original question as context.
  • 3. Generation: The LLM generates an answer based on both the question AND the provided documents.
Diagramm wird geladen...

Two Variants: RAG-Sequence and RAG-Token

  • RAG-Sequence: The model selects one document and generates the entire answer based on that single document. Good for tasks where a single source suffices.
  • RAG-Token: For each generated token, the model can draw on a different document. This enables answers that combine information from multiple sources.
Diagramm wird geladen...

Why RAG Matters

  • Fewer hallucinations: The model can reference real documents instead of guessing. Answers are verifiable and source-based.
  • Current knowledge: The knowledge base can be updated at any time without retraining the model. New documents are immediately available.
  • Data privacy: Company documents stay in your own infrastructure. The LLM does not need to be trained on sensitive data — it only receives them at query time.
  • Cost efficiency: Instead of training a massive model with all knowledge, a smaller model plus good retriever is sufficient.

RAG in Practice Today

The RAG pattern has become the standard architecture for enterprise AI. In practice, the following components are commonly used:

  • Vector Databases: Chroma, Qdrant, Weaviate, pgvector
  • Embedding Models: sentence-transformers, OpenAI Embeddings, Nomic
  • Chunking Strategies: Semantic Chunking, Recursive Character Splitting
  • Hybrid Search: Combination of vector search and classic keyword search (BM25)

Sources

  • Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401

Next step: move from knowledge to implementation

If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.

Why AI Engineering
  • Local and self-hosted by default
  • Documented and auditable
  • Built from our own runtime
  • Made in Austria
Not legal advice.