LoRA: Low-Rank Adaptation Explained | AI Engineering Wiki

At a Glance

LoRA (Low-Rank Adaptation) enables adapting large language models to specific tasks without modifying all model parameters. Instead of updating entire weight matrices, LoRA adds small, trainable matrices. This drastically reduces memory requirements and training time.

The Problem: Fine-Tuning Is Expensive

Adapting a pretrained LLM to a specific task (fine-tuning) normally means updating all model parameters. For a model with 7 billion parameters, that means 7 billion values need to be stored, computed, and optimized.

This requires enormous GPU resources. Full fine-tuning of a 70B model needs multiple high-end GPUs with over 100 GB of combined VRAM. For most companies and developers, this is not practical.

The LoRA Idea: Low-Rank Decomposition

LoRA's central insight: The changes made to weight matrices during fine-tuning have a low rank. This means the actual adaptation can be represented by much smaller matrices.

Instead of modifying a large weight matrix W directly, LoRA adds two small matrices A and B:

W' = W + BA

W: Original matrix (frozen, not trained)
B: Small matrix (d x r), trainable
A: Small matrix (r x d), trainable
r: Rank (typically 4-64, much smaller than d)

The original weights W remain frozen. Only the small matrices A and B are trained. With rank r=16 and dimension d=4096, you train 131,072 parameters per layer instead of 16.7 million — a reduction by a factor of 128.

Diagramm wird geladen...

How LoRA Works Technically

LoRA is typically applied to the attention matrices (Q, K, V, O) of the Transformer. The process:

1. Freeze: All original model parameters are frozen (requires_grad=False). They do not change during training.
2. Insert LoRA modules: Small matrices A and B are inserted alongside each selected weight matrix. A is randomly initialized, B with zeros (so the starting state is identical to the original).
3. Train: Only the LoRA matrices are trained. This requires a fraction of the memory and compute time.
4. Merge: After training, the LoRA matrices can be merged into the original weights (W + BA). The result is a normal model with no additional latency.

Advantages of LoRA

Memory efficiency: Instead of duplicating the entire model, you store only the small LoRA adapters. A LoRA adapter for a 7B model is typically only 10-50 MB.
Fast training: Fewer trainable parameters mean faster training steps and less GPU memory required.
No quality loss: With properly chosen rank, LoRA achieves results comparable to full fine-tuning.
Adapter switching: Multiple LoRA adapters can be trained for different tasks and swapped at runtime — on the same base model.

Diagramm wird geladen...

Further Developments: QLoRA and DoRA

QLoRA (2023): Combines LoRA with 4-bit quantization. The base model is loaded in 4-bit, LoRA matrices trained in 16-bit. This allows fine-tuning a 70B model on a single GPU.
DoRA (2024): Decomposed LoRA separates weights into magnitude and direction, applying LoRA only to the direction. This improves training quality.

Sources

Hu, E. J. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685