What is attention vs. self-attention?

Attention is the general mechanism: a model produces a weighted combination of some set of vectors. Self-attention is the specific case where the queries, keys, and values all come from the same sequence — every token attends to other tokens in the same input.

What are queries, keys, and values?

They are three different views of each token, produced by three learned linear projections of its input vector. The query asks a question, the key advertises what a token can answer, and the value is the actual information that gets mixed in once the model decides which keys match the query.

How does the model decide attention scores?

It takes the dot product between a query vector and every key vector. High dot product means the query and key point in similar directions, so that token gets a high weight. Softmax then normalizes those scores into a distribution.

Why divide by √(d_k)?

As the key dimension d_k gets larger, the dot products between random query and key vectors tend to grow in magnitude. Dividing by √(d_k) keeps the variance of the scores roughly constant, which prevents softmax from saturating and gives more stable gradients.

Why does self-attention need positional encoding?

Self-attention treats the input as a set: it has no built-in notion of order. Two sentences with the same tokens in different orders would produce the same attention output. Positional encodings give each position a distinct vector that gets added to the token embedding, so the model can tell positions apart.

Was this the first paper to use attention?

No. Attention had been used in earlier sequence-to-sequence systems on top of RNNs and CNNs. The contribution of the Transformer was to drop recurrence and convolution entirely and rely on attention for the whole encoder and decoder.

Why is attention expensive for long sequences?

Standard self-attention computes a score for every pair of tokens in the sequence, so memory and compute scale roughly like the square of the sequence length. That's why long-context architectures use tricks like sparse, linear, or windowed attention to keep this cost manageable.

Public paper breakdown

Attention Is All You Need Explained Step by Step

Understand self-attention, query/key/value vectors, the scaled dot-product attention equation, and why Transformers improved over RNNs.

Upload your own paper Read the breakdown

Based on the original Transformer paper by Vaswani et al. (2017). · Read the original

What you’ll learn

Why RNNs made sequence modeling harder to parallelize
What attention vs self-attention actually means
How Q, K, and V are projected from token vectors
How attention scores are computed from dot products
How the scaled dot-product attention equation works step by step
Why positional encoding is needed
Why Transformers improved training speed and long-range interactions
What tradeoff comes from O(n²) attention

What problem is this paper solving?

Before Transformers, the standard way to process text was with recurrent neural networks (RNNs). An RNN reads a sequence one token at a time:

Read one word.
Update a hidden state.
Move to the next word.
Update the hidden state again.
Continue until the end.

That works, but it creates two big problems.

Problem 1

It’s sequential

You can’t process all tokens at once during training because each step depends on the previous hidden state. That kills parallelism on modern hardware.

Problem 2

Long-range dependencies are hard

If one word needs information from a far-away word, that information has to pass through many recurrent steps before it arrives, and signal degrades along the way.

Can we model a sequence without recurrence, and instead let each token directly look at the other tokens it needs?
The paper’s fundamental question.

Start here

Core idea, in one sentence

For each token, the model looks at the other tokens, decides which ones matter most, and builds a new representation by combining information from them.

The Transformer paper replaces recurrence with self-attention. Instead of processing tokens one at a time and threading information through a hidden state, every token compares itself to every other token, computes attention weights, and combines information across the sequence in one shot.

That gives the model better parallelism and shorter paths between distant tokens, but it also introduces a quadratic cost as sequences get longer. The rest of this page walks through how the mechanism actually works — projections, scores, the equation, and the tradeoffs.

Attention vs. self-attention

Attention is the general idea. Self-attention is the specific case the Transformer uses. Mixing these up is one of the most common sources of confusion when reading the paper for the first time.

Attention (general)

One set looks at another set

One set of representations looks at another set of representations and decides what matters. In older encoder–decoder translation models, the decoder would attend to encoder states. That counts as attention — but the two sides are different sequences.

Self-attention (this paper)

A sequence looks at itself

Queries, keys, and values all come from the same sequence. Each token in the sentence can look at the other tokens in that same sentence. That’s why it’s called self-attention.

Attention already existed before this paper. What changed here is that self-attention became the main mechanism for building sequence representations, instead of recurrence.

Intuition

Why pronouns are the cleanest example

“The animal didn’t cross the street because it was tired.”

Suppose the model is updating the token it. To understand what it refers to, the model may need to look at:

animal
maybe tired
maybe cross

The point of attention is to let the model assign different importance to those words. Instead of only inheriting information step by step from earlier hidden states, the token it can directly ask: which other words in this sentence matter most for me right now?

How the architecture works at a high level

The Transformer doesn’t read the sequence one token at a time the way an RNN does. Here’s what it does instead.

Start with representations for all tokens
Each token gets a vector. At the first layer this is the token embedding plus positional information; in later layers it’s the hidden representation from the previous layer.
Create three vectors per token
From each token vector, the model produces a query, a key, and a value using three learned linear projections. Same input, three different views.
Compare tokens to each other
Every query is compared with every key using a dot product. This produces a score for each (query token, key token) pair — a measure of how relevant they are to each other.
Compute attention weights
Scores get scaled, then normalized with softmax so each token’s row of weights sums to 1. These are the attention weights.
Mix information across the sequence
Each token replaces its representation with a weighted blend of the value vectors from the tokens it attended to. That’s the new context-aware representation.

The model processes the whole sequence together rather than moving left to right through a recurrent hidden state. That’s the structural shift the paper introduces.

Key concepts

Four building blocks that make the rest of the paper readable.

Self-attention

Each token looks at the other tokens in the same sequence and decides which ones matter most for building its new representation.

Query, key, value

Queries ask what information a token is looking for. Keys describe what other tokens offer. Values carry the information that gets mixed in.

Scaled dot-product attention

The model compares queries and keys, scales the scores, normalizes them with softmax, and uses the result to combine value vectors.

Positional encoding

Self-attention alone does not know token order, so the model needs a separate way to inject position information.

How Q, K, and V are actually built

Query, key, and value are not hand-designed. They’re three learned views of the same token vector.

For each token, the model starts with that token’s current vector representation. At the first layer, this is the token embedding plus positional information. In later layers, it’s the hidden representation coming from the previous layer. Call that token vector x.

The model then creates three new vectors from x using three different learned weight matrices:

q = x W_Q,\quad k = x W_K,\quad v = x W_V

q = x · W_Q, k = x · W_K, v = x · W_V

qQuery

What this token is looking for.

kKey

What this token offers for matching.

vValue

The information this token contributes if it’s attended to.

Why three different projections?

The same token plays three different roles inside the attention mechanism, and the model needs to express each role separately:

It needs a way to ask what information it wants — that’s the query.
It needs a way to signal what kind of information it contains — that’s the key.
It needs a way to provide content if another token attends to it — that’s the value.

So one token vector becomes three different learned views of that token, one per role.

How the model decides attention scores

For each pair of tokens, the model computes a single number: a compatibility score. It does this with a dot product between the query of one token and the key of another.

One pair at a time

\text{score}(i, j) = q_i \cdot k_j

score(i, j) = q_i · k_j

A larger score means the model thinks those two tokens are more relevant to each other for the current context. A smaller score means the match is weaker. So the score is a learned measure of compatibility between what token i is looking for and what token j offers.

Mental model: one row at a time

Take the sentence “The cat sat on the mat.” and focus on sat. It compares its query against every other token’s key. You can read the output as a row of scores:

sat → catscorehigh
sat → matscoremedium
sat → thescorelow

All pairs at once

In matrix form, every query is compared with every key in a single matrix multiply. The result is an n × n table of scores where each row tells you, for one token, how relevant every other token is.

S = Q K^{\text{top}}

S = Q · Kᵀ

From there, the model scales by √(d_k), applies softmax, and gets weights that add up to 1. Those final weights are the attention weights.

The main equation, step by step

At first glance the equation looks intimidating, but it’s a clean four-step recipe: similarity, scale, normalize, mix.

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\dfrac{Q K^{\text{top}}}{\sqrt{d_k}}\right) V

Attention(Q, K, V) = softmax( Q · Kᵀ / √(d_k) ) · V

Compute similarity scores
$Q K^{\text{top}}$
Compare each query with each key. For a sequence of n tokens, you get an n × n matrix of scores. Each row says: for this token, how relevant is every other token?
Scale by √(d_k)
$\dfrac{1}{\sqrt{d_k}}$
Here d_k is the dimension of the key vectors. If the vectors are high-dimensional, dot products can get large. Large values make softmax too peaky, which can make training unstable. Dividing by √(d_k) keeps the scores in a reasonable range so gradients stay healthy.
Apply softmax
$\operatorname{softmax}(\,\cdot\,)$
Softmax turns each row of scores into weights that add up to 1. These are the attention weights — they tell the model how much one token should use information from each other token.
Multiply by V
$(\,\cdot\,) V$
Use those weights to combine the value vectors. The output for each token is a weighted combination of the value vectors from the other tokens — its new context-aware representation.

Example: query, key, and value

Take the sentence “The cat sat on the mat.” Suppose we’re updating the token sat.

What “sat” is looking for

Query: who did the action, and where it happened.
Keys: each other token advertises what it can offer (subject, location, filler).
Values: the actual information that gets mixed in once the model has decided which tokens to pay attention to.

How attention picks them up

The attention score decides how much sat should pull information from each other token. Strengths shown here are illustrative — not literal values from the paper.

Compared to “sat”	Possible role	Attention
cat	who did the action	high
mat	where it happened	medium
the	less useful by itself	low

In plain English

Strip out the math. For each token, the model does five things:

Compare it to all other tokens.
Decide which ones matter most.
Turn that into weights.
Combine information from those tokens.
Produce a better representation of the original token.

That’s the core mechanism. Everything else in the paper — multi-head attention, encoder/decoder stacks, residual connections, layer norm — is built on top of it.

Why this improved over RNNs

Better parallelism

RNNs process tokens step by step, which makes them hard to train in parallel. Transformers can process all tokens together during training because attention has no sequential dependency.

Shorter paths for long-range dependencies

In an RNN, distant information has to move through many recurrent updates to influence a target token. In self-attention, one token can directly attend to another in a single layer.

More flexible context building

RNNs maintain a single running hidden state. Self-attention lets each token choose which other tokens matter most for its current representation, instead of squeezing everything through one bottleneck.

The tradeoff

Full self-attention compares every token with every other token, so the cost grows roughly like O(n²) with sequence length. That's why long-context models invest heavily in approximations and memory tricks.

Common confusion points

Things that trip up most readers on a first pass through the Transformer paper.

Section-by-section breakdown

Below is the full Transformer paper, parsed into sections. Click any heading to expand the original prose and equations, then open the plain-English explanation card for a worked walkthrough.

Original paper Open in shared view

Abstract

p.1

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include a...

1 Introduction

p.2

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been ...

2 Background

p.2

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ...

3 Model Architecture

p.2

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder map...

3.1 Encoder and Decoder Stacks

p.3

Encoder: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a m...

3.2.1 Scaled Dot-Product Attention

Mathp.4

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of di...

3.2.2 Multi-Head Attention

Mathp.4

Instead of performing a single attention function with $d_{\text{model}}$-dimensional keys, values and queries, we found...

3.2.3 Applications of Attention in our Model

p.5

The Transformer uses multi-head attention in three different ways: • In "encoder-decoder attention" layers, the queries ...

3.3 Position-wise Feed-Forward Networks

Mathp.5

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forwa...

3.4 Embeddings and Softmax

p.5

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens...

3.5 Positional Encoding

Mathp.6

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequen...

4 Why Self-Attention

p.6

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly u...

5.1 Training Data and Batching

p.7

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences wer...

5.2 Hardware and Schedule

p.7

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described th...

5.3 Optimizer

Mathp.7

We used the Adam optimizer [20] with $\beta_1 = 0.9$, $\beta_2 = 0.98$ and $\epsilon = 10^{-9}$. We varied the learning ...

5.4 Regularization

p.7

We employ three types of regularization during training: Residual Dropout: We apply dropout [33] to the output of each ...

6.1 Machine Translation

p.8

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms...

6.2 Model Variations

p.8

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measu...

6.3 English Constituency Parsing

p.9

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. T...

7 Conclusion

p.10

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing...

Attention Visualizations

p.13

More public breakdowns

We’re adding new public breakdowns over time. Each one walks through a foundational paper with the same step-by-step style.

BERT
Coming soon
Devlin et al., 2018
Masked language modeling, bidirectional context, and the pre-train / fine-tune recipe that reset NLP.
LoRA
Coming soon
Hu et al., 2021
Why low-rank adapters work, what gets frozen vs. trained, and the math behind parameter-efficient fine-tuning.
CLIP
Coming soon
Radford et al., 2021
Contrastive image–text training, the joint embedding space, and how zero-shot transfer falls out of it.
Diffusion Models
Coming soon
Ho et al., 2020
Forward noise, reverse denoising, and the variational objective that powers modern image synthesis.