Zum Inhalt springen
>_<
AI EngineeringWiki

Papers

Attention Is All You Need

Vaswani et al., 2017 — The paper that introduced the Transformer architecture and enabled all of modern AI.

Reading time: 10 minLast updated: March 2026
At a Glance

"Attention Is All You Need" introduced the Transformer architecture in 2017 — a model based entirely on attention mechanisms, dispensing with recurrence (RNNs) and convolutions. Originally developed for machine translation, it is now the foundation for GPT, BERT, LLaMA, and virtually every modern LLM.

The Problem Before Transformers

Before 2017, Recurrent Neural Networks (RNNs) and LSTMs dominated natural language processing. These models process text sequentially — word by word, left to right. This had two major drawbacks:

  • Slow training: Because each word must wait for the previous one, computation cannot be parallelized. More GPUs barely help.
  • Forgetfulness: With long texts, the model "forgets" the beginning. Information is lost over 100+ tokens (the vanishing gradient problem).

The Core Idea: Self-Attention

The Transformer solves both problems with a single mechanism: Self-Attention. Instead of processing text sequentially, every word simultaneously looks at all other words in the sentence and calculates how relevant each one is.

Technically, this works via three vectors per word: Query (Q), Key (K), and Value (V). The attention formula computes a weighted sum:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The "√d_k" is a scaling factor that prevents dot products from becoming too large at high dimensions. Softmax converts the scores into probabilities.

Diagramm wird geladen...

Multi-Head Attention

A single attention computation captures only one type of relationship. The Transformer therefore uses Multi-Head Attention: The Q/K/V vectors are split into multiple "heads," each learning a different kind of relationship (e.g., syntactic proximity, semantic similarity, coreference).

The original paper uses 8 heads. The results of all heads are concatenated and combined through a linear projection.

Encoder-Decoder Architecture

The original Transformer consists of two parts:

  • Encoder: Processes the input text and creates a context-rich representation. Consists of 6 identical layers with self-attention and feed-forward networks.
  • Decoder: Generates the output text word by word. In addition to self-attention, it has cross-attention to the encoder output. Also 6 layers.

Modern LLMs like GPT use only the decoder part (autoregressive models), while BERT uses only the encoder. This shows how flexible the architecture is.

Diagramm wird geladen...

Why Was This Paper Revolutionary?

  • Parallelization: All attention computations can run simultaneously on GPUs. Training became orders of magnitude faster, making models with billions of parameters practical.
  • Long contexts: Self-attention connects every token directly to every other token. There is no information loss over distance.
  • Universal architecture: Transformers work not only for text. The same architecture is used today for images (Vision Transformer), audio (Whisper), code (Codex), and multimodal models.

Impact on Today's Models

Virtually every relevant AI model is based on the Transformer:

  • GPT Series (OpenAI): Decoder-only Transformer
  • BERT (Google): Encoder-only Transformer
  • LLaMA (Meta): Decoder-only with improvements like RMSNorm and SwiGLU
  • Claude (Anthropic): Transformer-based with Constitutional AI
  • Mistral, Qwen, Gemma: All Transformer variants

Sources

Next step: move from knowledge to implementation

If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.

Why AI Engineering
  • Local and self-hosted by default
  • Documented and auditable
  • Built from our own runtime
  • Made in Austria
Not legal advice.