Attention Is All You Need

At a Glance

"Attention Is All You Need" introduced the Transformer architecture in 2017 — a model based entirely on attention mechanisms, dispensing with recurrence (RNNs) and convolutions. Originally developed for machine translation, it is now the foundation for GPT, BERT, LLaMA, and virtually every modern LLM.

The Problem Before Transformers

Before 2017, Recurrent Neural Networks (RNNs) and LSTMs dominated natural language processing. These models process text sequentially — word by word, left to right. This had two major drawbacks:

Slow training: Because each word must wait for the previous one, computation cannot be parallelized. More GPUs barely help.
Forgetfulness: With long texts, the model "forgets" the beginning. Information is lost over 100+ tokens (the vanishing gradient problem).

The Core Idea: Self-Attention

The Transformer solves both problems with a single mechanism: Self-Attention. Instead of processing text sequentially, every word simultaneously looks at all other words in the sentence and calculates how relevant each one is.

Technically, this works via three vectors per word: Query (Q), Key (K), and Value (V). The attention formula computes a weighted sum:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The "√d_k" is a scaling factor that prevents dot products from becoming too large at high dimensions. Softmax converts the scores into probabilities.

Diagramm wird geladen...

Multi-Head Attention

A single attention computation captures only one type of relationship. The Transformer therefore uses Multi-Head Attention: The Q/K/V vectors are split into multiple "heads," each learning a different kind of relationship (e.g., syntactic proximity, semantic similarity, coreference).

The original paper uses 8 heads. The results of all heads are concatenated and combined through a linear projection.

Encoder-Decoder Architecture

The original Transformer consists of two parts:

Encoder: Processes the input text and creates a context-rich representation. Consists of 6 identical layers with self-attention and feed-forward networks.
Decoder: Generates the output text word by word. In addition to self-attention, it has cross-attention to the encoder output. Also 6 layers.

Modern LLMs like GPT use only the decoder part (autoregressive models), while BERT uses only the encoder. This shows how flexible the architecture is.

Diagramm wird geladen...

Why Was This Paper Revolutionary?

Parallelization: All attention computations can run simultaneously on GPUs. Training became orders of magnitude faster, making models with billions of parameters practical.
Long contexts: Self-attention connects every token directly to every other token. There is no information loss over distance.
Universal architecture: Transformers work not only for text. The same architecture is used today for images (Vision Transformer), audio (Whisper), code (Codex), and multimodal models.

Impact on Today's Models

Virtually every relevant AI model is based on the Transformer:

GPT Series (OpenAI): Decoder-only Transformer
BERT (Google): Encoder-only Transformer
LLaMA (Meta): Decoder-only with improvements like RMSNorm and SwiGLU
Claude (Anthropic): Transformer-based with Constitutional AI
Mistral, Qwen, Gemma: All Transformer variants

Sources

Vaswani, A. et al. (2017). "Attention Is All You Need." arXiv:1706.03762