Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of net...
This abstract introduces one of the most impactful papers in deep learning—a paper that fundamentally changed how we think about building very deep neural networks. The core insight is deceptively simple: instead of training networks to learn functions directly, train them to learn the difference (residual) between the input and desired output. This seemingly small change makes it possible to train networks that are 8× deeper than what was previously feasible, without getting worse performance.
Let me break down what makes this paper revolutionary and why it matters.
The abstract starts with a crucial observation:
"Deeper neural networks are more difficult to train."
To understand this, we need to think about what happens during training. When we train a neural network, we adjust the weights using backpropagation. The key mathematical operation is computing gradients using the chain rule:
where is the loss (a measure of how wrong our predictions are), and represents weights in layer .
The Vanishing Gradient Problem:
This is why before ResNets, training networks beyond ~20-30 layers became impractical.
Here's where the authors' key insight comes in:
"We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions."
Traditionally, a layer learns to map an input directly to an output. If we denote the input as and the layer's function (the weights and computations) as , the output is simply:
Instead, residual networks learn the difference (residual) between input and output:
This is called a "skip connection" or "identity connection"—we literally add the original input back to the learned transformation .
Why does this matter mathematically?
When we backpropagate through this layer, the gradient becomes:
Notice the crucial "+1" term. Even if is very small (vanishing), the gradient still has a baseline of 1 flowing directly through the skip connection. This ensures gradients don't completely vanish, even in very deep networks.
Intuitive analogy: Instead of asking "what should the output be?", we ask "what small adjustment should we make to the input?" This is often an easier learning problem—the network can just learn small refinements rather than learning everything from scratch.
The abstract supports the theoretical benefits with extensive experimental evidence:
"The depth of representations is of central importance for many visual recognition tasks."
This statement encapsulates a fundamental principle: deeper networks can learn hierarchical features. Early layers learn simple patterns (edges, textures), middle layers combine these into shapes and object parts, and deep layers recognize complex objects. By making it possible to train very deep networks, ResNets enable much richer feature hierarchies.
"These residual networks are easier to optimize, and can gain accuracy from considerably increased depth."
Before residual learning, adding more layers would often decrease accuracy—the network was too hard to train. ResNets inverted this: more layers → better performance (up to a point). This is because:
| Aspect | Before ResNets | With ResNets |
|---|---|---|
| Practical depth limit | ~20-30 layers | 100-1000+ layers |
| Training difficulty | Gets harder as you go deeper | Gets easier (more layers help) |
| Gradient flow | Vanishes in deep networks | Maintained by skip connections |
| Parameter count | More layers ≠ better results | More layers = better results |
The abstract summarizes a paradigm shift: by training networks to learn residuals (differences) instead of absolute functions, we overcome the gradient vanishing problem that plagued deep learning. This one architectural change—the skip connection—enabled training of networks 8× deeper while improving accuracy. The empirical validation across multiple datasets and tasks demonstrates that this isn't just a theoretical improvement; it's a practical breakthrough that fundamentally expanded what deep neural networks could do.
Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]....
This introduction tackles a fundamental problem that limited deep learning in 2015: deeper neural networks don't always work better, even though they should in theory. The authors present compelling evidence of this problem and then introduce their solution—residual learning with "shortcut connections"—that will allow training networks 8× deeper than previously possible while actually improving accuracy.
The section has three key parts:
Let me walk you through each carefully.
The authors begin with an established fact from computer vision:
"Deeper networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multilayer fashion, and the 'levels' of features can be enriched by the number of stacked layers (depth)."
What does this mean in plain language?
Think of deep networks as having layers that progressively understand more complex patterns:
Each successive layer builds on the previous one. So intuitively, more layers = better features = better accuracy.
By 2015, this intuition was validated empirically. The state-of-the-art models on the ImageNet dataset (a massive image classification benchmark) were all "very deep"—around 16 to 30 layers. So the question seemed straightforward: Can we just keep stacking more layers?
Here's where things get interesting. The authors identify a surprising phenomenon that contradicts expectations:
[See Figure 1: The figure shows training and test error for 20-layer vs. 56-layer "plain" networks on CIFAR-10. Notice the deeper network has higher training error—this is the key surprise.]
Before diving into the main problem, the authors acknowledge a previous obstacle: vanishing/exploding gradients.
Quick background: In neural networks, we train using backpropagation, which computes gradients (rates of change) of the loss with respect to each parameter. Mathematically, we need to compute:
where is the loss (error) and is a weight parameter.
When networks get deep, these gradients are computed by multiplying many partial derivatives together (via the chain rule). If each derivative is less than 1, multiplying many of them together can make the gradient exponentially small—essentially zero. This prevents the network from learning early layers.
Good news: By 2015, normalized initialization (smart weight initialization) and batch normalization had largely solved this. So this isn't the problem anymore.
Here comes the crucial insight. Despite solving vanishing gradients, deeper networks still fail—but not because they overfit (where training error is low but test error is high). Instead:
This is counterintuitive! The deeper network is worse even on the data it's training on.
Why is this paradoxical?
The authors present a logical argument by construction. Consider:
Now, here's the key: There exists a solution for the deeper network that should work just as well as the shallower one: Simply make the added layers perform identity mappings (outputs equal inputs), and copy the learned parameters from the shallower network.
Mathematically, if the shallower network computes (some function of input ), then the deeper network can compute:
$ \mathcal{H}(\mathbf{x}) + f_{\text{identity}}(\text{extra layers}) = \mathcal{H}(\mathbf{x}) + (\text{input unchanged}) $ In other words, the extra layers could just be "pass-through" layers, leaving outputs unchanged while the earlier layers do all the work. **The mystery**: If this solution is theoretically available, why can't standard optimization algorithms (like SGD with backpropagation) find it? **The conclusion**: "Our current solvers are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)." In other words, the optimization landscape is so difficult that gradient descent gets stuck. --- ## Part 3: The Solution—Residual Learning This is where the paper's contribution enters. Rather than hoping layers will directly learn the target function, the authors propose learning the *difference* between the target and the input. ### The Core Mathematical InsightLet's denote:
Instead of learning directly, define the residual mapping as:
Here's the intuition: If the optimal mapping is close to the identity (i.e., ), then —it's easier to learn a mapping to zero than to learn an identity mapping!
Consider an extreme case:
The authors' hypothesis: Even when identity mapping isn't optimal, learning the difference from identity is generally easier than learning the absolute mapping.
The elegant part: this can be implemented simply with a "shortcut connection" (also called a skip connection) as shown in [Figure 2]:
where:
Key advantages:
Mathematically, when we backpropagate the loss, the gradient reaching is:
The "1" term means even if the gradient through shrinks to zero, the shortcut ensures a gradient of at least 1 propagates backward—this helps prevent vanishing gradients!
The authors make four specific claims, then present evidence:
Concrete achievements:
The section builds a compelling case for a simple but powerful idea:
Learning to modify inputs slightly (via residual functions) is easier than learning to transform inputs completely (via absolute functions).
This insight, implemented through shortcut connections, allows networks to be trained much deeper than before while maintaining (and improving) accuracy. The mathematical reformulation is elegant precisely because it's so simple, yet it unlocks the ability to train far deeper networks—the foundation of modern deep learning.
Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with ...
Before the authors present their residual learning framework in detail, they need to establish context. This section does three important things:
The key insight: the authors are arguing that reformulating a problem in terms of residuals (differences from a baseline) makes optimization easier. This is a conceptual framework that appears in multiple domains, and they're now applying it to deep neural networks.
In previous work on image recognition, researchers discovered that encoding residual vectors (the differences between actual data and some reference point) works better than encoding the original data directly.
VLAD (Vector-Locally Aggregated Descriptors) works like this:
Fisher Vectors extend this probabilistically: they encode residuals with respect to a learned Gaussian Mixture Model (GMM) rather than a fixed dictionary.
Why does this work? Think about it intuitively: if you're describing an image, saying "this object is slightly rotated from the standard position" (a residual) is often more informative than saying "the pixel intensities are [long list of numbers]" (the original data). The residual captures the meaningful deviation from a pattern.
In vector quantization (a compression technique), encoding residuals is more effective than encoding original vectors. This shows a general principle: residual encoding is more efficient for optimization.
The authors now reference a completely different field: solving Partial Differential Equations (PDEs) using Multigrid methods.
The Problem: Solving PDEs directly is computationally expensive. For example, solving Laplace's equation over a large domain with fine discretization is slow.
The Multigrid Solution:
Instead of solving the full system at once, Multigrid reformulates the problem at multiple scales:
Key insight from the paper: These multilevel solvers converge much faster than standard solvers because they reformulate the problem to explicitly target what still needs to be learned (the residuals), rather than trying to learn everything from scratch.
Why mention this? The authors are drawing a conceptual parallel: just as Multigrid solvers work better when reformulated around residuals, neural networks might optimize better if their layers learn residual functions rather than trying to directly learn the complete transformation.
This section reviews how shortcut connections (the mechanism that enables residual learning) have been explored before.
The authors trace the history of shortcut connections through several papers:
Early MLPs with direct input-to-output connections (citations 34, 49): In the 1990s, people sometimes added a linear layer directly from network input to output. This is a very simple form of a shortcut.
Intermediate auxiliary classifiers (citations 44, 24): To combat vanishing/exploding gradients, researchers directly connected hidden layers to the output, allowing error signals to flow backward through shorter paths. Remember from the Introduction: gradients get very small () or very large as they propagate backward through many layers. Short paths help with this.
Layer response centering (citations 39, 38, 31, 47): Various papers proposed using shortcut connections to normalize layer responses and error propagation.
Inception layers (citation 44): Google's Inception architecture includes branches with shortcuts alongside deeper branches, allowing the network to learn both simple and complex transformations in parallel.
The most important comparison is to Highway Networks (by Srivastava et al., citations 42, 43), which were developed around the same time as ResNets.
Highway Networks use gated shortcuts. Formally, instead of simply adding a residual like ResNet's:
Highway Networks compute:
Where:
In plain language: The network learns how much to use the complex transformation versus how much to pass through the identity. If , mostly the identity passes through. If , mostly the transformation is used.
The authors argue that ResNets are better because:
Simpler design: ResNet always learns residuals. There's no gate. The shortcut is parameter-free (no learnable ).
Stronger gradient flow: The identity connection is never "closed" (gates can't reduce it). This guarantees that:
Better scaling to extreme depth: Most importantly, Highway Networks had not demonstrated success with very deep networks (>100 layers). ResNets do.
Recall from the Introduction that the authors defined:
Where:
Why this formulation? Suppose the optimal solution is close to the identity (i.e., the output should be nearly the same as the input). Then:
This is the fundamental optimization advantage that the authors are building on.
| Domain | Key Insight |
|---|---|
| Image Retrieval | Encoding residuals from dictionary vectors is more efficient |
| Scientific Computing | Solving residual problems across multiple scales converges faster |
| Shortcut Connections | Existing work shows shortcuts help; Highway Networks show gated versions work, but ResNets' simpler parameter-free approach is more powerful |
The unified message: Residual reformulations work better because they target what still needs to be learned, rather than forcing the system to learn everything from scratch. ResNets apply this principle elegantly to deep neural networks using identity shortcuts.
Let us consider $\mathcal{H}(\mathbf{x})$ as an underlying mapping to be fit by a few stacked layers (not necessarily th...
This section tackles a fundamental problem: Why do deeper neural networks perform worse than shallower ones, even on the training data? The authors propose a conceptual shift in how we think about what layers in a neural network should learn. Instead of asking layers to learn the full desired transformation, they suggest asking layers to learn only the difference (residual) from the input. This simple reframing turns out to be remarkably powerful.
Let me start with the foundational claim:
"If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions."
What does this mean?
Suppose you have a desired function that you want your neural network to learn, denoted as , where:
The authors make a mathematical argument: if your neural network layers can theoretically learn any complicated function (the Universal Approximation Theorem), then they can certainly learn the difference:
This difference is called the residual function. Notice:
The key insight: Since both representations are theoretically equivalent in terms of what they can express, the question becomes: Which is easier for an optimizer to actually learn?
The authors ground this idea in an empirical puzzle shown in Figure 1:
The Problem: When you add more layers to a network:
Why should a solution exist? Consider a 20-layer network that works well. You could always create a 56-layer network by:
This constructed solution should have the same training error as the 20-layer network. But optimizers can't find it.
The hypothesis: The problem is that optimizers struggle to learn identity mappings using multiple nonlinear layers. Think about it: if you stack ReLU activation functions and matrix multiplications, it's quite difficult to arrange them so the output exactly equals the input.
Here's where the reformulation saves the day:
Instead of asking layers to learn , ask them to learn .
Why is this better?
If identity mappings are optimal (or close to optimal):
Mathematically:
Driving weights toward zero is a much more natural behavior for gradient descent than arranging nonlinear layers to reproduce the input exactly.
But what if identity mappings aren't optimal? The authors offer a more nuanced view:
"If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping."
Intuition:
Imagine you're trying to describe a complicated function to someone:
The second approach is easier because:
The term "preconditioning" comes from optimization theory. A preconditioned problem is one where you've reformulated it to be easier to solve, often by providing a good initial guess or structure.
The section concludes by noting (referencing Figure 7, which we don't see here) that:
"The learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning."
What does this mean?
When the authors trained networks using the residual learning framework, they observed that (the learned residual) tends to have small magnitude. This confirms their hypothesis: the optimal functions are indeed close to identity mappings, and reformulating the problem to learn small perturbations from identity is exactly the right way to frame it.
| Level | Claim |
|---|---|
| Mathematical equivalence | Both and can theoretically express the same functions |
| Optimization difficulty | Learning (small perturbations) is easier than learning (a function from scratch) |
| Empirical reality | Deep networks actually learn small residuals, confirming that this formulation aligns with what networks naturally want to do |
Although I can't see Figure 2 directly in your context, based on the description, it should show the fundamental building block: a shortcut connection where:
This is the architectural instantiation of the conceptual reformulation explained in this section.
We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we ...
After establishing in Section 3.1 that it's easier for neural networks to learn residual functions (the difference between desired output and input) rather than learning functions from scratch, this section shows us how to actually implement this idea in practice.
The key question: How do we physically build a neural network that learns residual functions? The answer is shortcut connections—skip connections that bypass some layers and add their output back in. This section formalizes exactly how to do this and addresses practical implementation details.
Let's start with the fundamental equation:
Variables:
Think of this equation as a recipe:
Geometric intuition: If represents small adjustments (perturbations) to , then is just plus those adjustments. This is much easier for optimization because:
The paper gives a specific example. For a block with two layers:
Let's unpack this:
First layer:
Nonlinearity:
Second layer:
Complete picture:
Then the paper mentions applying another nonlinearity after the addition: . This gives the final output of the building block, which feeds into the next block.
Here's a practical issue that arises: What if and don't have the same number of dimensions?
For the addition in Equation (1) to work, both vectors must be the same size. This happens when:
Solution: Use a projection shortcut (Equation 2)
Where:
The authors emphasize that can take various forms:
This reduces to a simple linear transformation (where is the identity matrix), providing no nonlinear benefit. You need at least two layers to get the advantage of learning a nonlinear residual.
In principle:
The mathematical notation might suggest fully-connected (dense) layers, but the framework applies equally to convolutional neural networks:
This generalization is crucial because the paper's main applications use convolutional architectures for image processing.
| Aspect | Benefit |
|---|---|
| Simple formulation | Easy to implement, understand, and analyze |
| Parameter efficiency | Identity shortcuts add zero parameters; projections only when needed |
| Computational efficiency | Element-wise addition is negligible cost |
| Fair comparison | Residual and plain networks can have identical parameter counts |
| Optimization advantage | Networks learn residual perturbations rather than functions from scratch |
| Extensibility | Works with different block sizes and architectural styles |
The elegance of this section lies in showing that a small, mathematically simple modification—adding a single skip connection—provides substantial practical and theoretical benefits for training very deep networks.
Perfect! The final output is:
$ \mathbf{y} =
=
The skip connection is crucial for deep networks. Let me explain why through the lens of backpropagation. If we take the derivative of with respect to :
where is the identity matrix. This is crucial:
This solves the vanishing gradient problem that plagued training of very deep networks.
The paper notes:
"if has only a single layer, Eqn.(1) is similar to a linear layer: , for which we have not observed advantages."
Let's understand why. With a single layer:
This is just a linear transformation with a modified weight matrix. It doesn't provide any advantage over a standard layer because:
With two or more layers and nonlinearity:
Now the residual path computes a nonlinear transformation, and the skip connection provides gradient shortcuts in deeper blocks.
The paper mentions Equation (2):
This is needed when the dimensions of and don't match — for example, when changing the number of channels in a convolutional layer. The learnable projection matrix reshapes to match 's output dimension, enabling element-wise addition.
However, the paper emphasizes that identity mapping is sufficient in most cases and is preferred because:
| Aspect | Impact |
|---|---|
| What is learned | The residual , not the full output |
| Skip connection | Adds the input directly to the output |
| Gradient flow | Identity component ensures gradients always flow through skip path |
| Depth enabler | Makes training very deep networks possible (ResNet-152+) |
| Nonlinearity requirement | must have ≥2 layers + activation to be beneficial |
| Parameter efficiency | Skip adds zero extra parameters |
This seemingly simple equation — — fundamentally changed deep learning by making it practical to train networks with 100+ layers. It's an elegant solution to a critical problem: how to backpropagate through very deep architectures without gradient degradation.
Visualizing the ReLU activation function used in the residual mapping



Computing W_1 * x for the first layer




Applying ReLU activation to the result of first layer


First element of W_2 times activation

Second element of W_2 times activation




Computing the final output y = F + x using element-wise addition

The bottleneck design achieves parameter reduction by first reducing channels (1×1 conv), then doing computation in a smaller space (3×3 conv), then expanding back (1×1 conv). This is more efficient than direct convolutions on high-channel-count data.
The residual equation embodies several breakthrough ideas:
Instead of learning the full mapping , the network learns the residual . This shifts the optimization landscape to make learning easier.
The derivative ensures that:
The projection matrix handles cases where dimensions change, making skip connections applicable throughout the network.
The skip connection adds only element-wise addition (negligible cost) while providing massive architectural benefits.
By enabling much deeper networks (ResNet-152, ResNet-1001) without degradation, residual connections demonstrated that depth, when properly architected, improves performance. This unlocked the entire era of very deep learning models we see today.
This single equation became the foundation for modern deep learning practice!
Computing W1 * x (first layer output)




Applying ReLU activation after first layer (keeping positive values)




Showing how the residual connection affects gradients during backpropagation


Visualizing the ReLU activation function mentioned in the residual block



Example: projecting 3D input down to 2D via linear transformation (like Ws in the residual equation)




We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion,...
This section is crucial because it translates the theoretical ideas about residual learning (introduced in 3.1 and 3.2) into concrete network designs. The authors need to show:
This matters because the entire paper's contribution depends on comparing equally-resourced networks (same parameters, same computational cost) where the only difference is the presence of residual connections. If the residual network had more parameters, we wouldn't know if improvements came from residual learning or just from having a bigger model.
The authors base their plain network design on VGG nets [41], which were the state-of-the-art reference architecture at the time. Let me break down the design rules:
Design Rule (i): For layers producing the same output feature map size, use the same number of filters.
Design Rule (ii): When halving the feature map size, double the number of filters.
To understand rule (ii), think about what happens to your data:
where:
When you reduce spatial dimensions by half (stride-2 downsampling), you go from to , reducing the spatial cost by a factor of 4.
By doubling the number of filters, you multiply the channel dimension by 2, which only increases cost by 2×.
Result: Net effect is — you keep computational cost roughly constant across layers at different resolutions.
The authors build a 34-layer plain network with these rules. Key facts:
Computational comparison:
This is important because it shows the plain baseline is actually efficient, not just a strawman architecture.
The residual network takes the plain network and inserts shortcut connections as described in Section 3.2. The key innovation is the building block:
where:
When the input and residual function output have the same dimensions, you directly add them:
In Figure 3 (right), these are shown as solid line shortcuts. This is the simplest case:
When spatial dimensions or channel counts change, direct addition isn't possible. Consider this scenario:
If has shape (56×56 spatial, 64 channels) but has shape (half spatial resolution, double channels), you cannot perform element-wise addition.
The authors present two options:
Option A: Zero-Padding (Identity with Padding)
Option B: Projection Shortcut (with Learned Parameters)
where is typically a convolution that:
Changes spatial dimensions via stride-2 (for downsampling)
Changes channel count via different output filters
Advantage: Learned transformation can adapt to the dimension change
Disadvantage: Adds parameters and computation (though convolutions are cheap)
"When the shortcuts go across feature maps of two sizes, they are performed with a stride of 2."
This is important: if your residual block halves the spatial resolution, the shortcut must also halve it:
Now both can be added: .
The critical point: Both the plain and residual networks have:
This means any performance difference comes purely from the residual learning framework, not from having more capacity.
The authors emphasize that the residual function can be flexible:
But a single-layer residual block doesn't help: is essentially just a linear layer with an identity shortcut—experiments show no advantage.
The notation uses fully-connected layer notation for simplicity, but everything applies to convolutional layers:
| Aspect | Plain Network | Residual Network |
|---|---|---|
| Layers | 34 | 34 |
| Parameters | Same | Same |
| FLOPs | 3.6B | 3.6B (shortcuts add negligible computation) |
| Building block | ||
| Shortcut strategy | None | Identity (same dim) or projection (different dim) |
| Dimension handling | N/A | Zero-padding or learned |
The authors can now run controlled experiments: plain vs. residual networks with identical computational resources. Any performance improvement cannot be attributed to:
It must come from how the information flows through the network—specifically, from having gradient highways (shortcut connections) that make optimization easier. This experimental setup is what makes ResNets' dramatic improvements so convincing.
Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sa...
After spending sections 3.1–3.3 explaining the conceptual innovation of residual learning and describing what the networks look like architecturally, this section tells us the practical details of how the authors actually trained these networks.
This is crucial because:
Think of it like a recipe: sections 3.1–3.3 describe the dish you're making, and section 3.4 gives you the exact ingredients, cooking temperature, and timing.
Let me work through the image preparation step-by-step:
Step 1: Scale Augmentation
"The image is resized with its shorter side randomly sampled in [256, 480]"
What does this mean?
Why do this? This creates scale variation in the training data. Networks trained on multiple scales generalize better to images of different sizes.
Step 2: Cropping and Flipping
"A 224×224 crop is randomly sampled from an image or its horizontal flip"
After resizing:
Why? Random cropping and flipping create data augmentation—the network sees different versions of the same image, which reduces overfitting.
Step 3: Normalization
"with the per-pixel mean subtracted"
For each pixel position across all training images, the authors:
Why? This centers the data around zero, which helps the optimization algorithm (SGD) work more efficiently. This is a standard preprocessing technique in machine learning.
Step 4: Color Augmentation
"The standard color augmentation in [21] is used"
This refers to random adjustments to color channels (brightness, contrast, etc.). The authors don't detail it here but reference their source.
"We adopt batch normalization (BN) [16] right after each convolution and before activation"
What is Batch Normalization? Within each training mini-batch:
Mathematically, for a mini-batch, if is the output of a convolution for the -th sample:
where is the mean and is the variance across the mini-batch, and is a small constant for numerical stability.
Why this placement? By normalizing before the ReLU activation (rather than after), the internal distributions stay stable during training, which speeds up convergence and allows higher learning rates.
"We initialize the weights as in [13]"
This refers to He initialization, a specific method for setting initial weight values that accounts for the number of input neurons, helping prevent vanishing/exploding gradients at the start of training.
Let me explain the optimization setup:
Learning Procedure:
Learning Rate Schedule:
"The learning rate starts from 0.1 and is divided by 10 when the error plateaus"
The learning rate controls step size in SGD. Mathematically, a weight update looks like:
where are weights at iteration , is the gradient of the loss, and is the learning rate.
The schedule works as:
Why reduce learning rate? Early in training, large steps help escape local minima. Later, smaller steps allow fine-tuning near the optimum.
Regularization:
The momentum update rule is:
where is the velocity/momentum term and in this case. Intuitively, this makes the optimizer accelerate in consistent directions and dampen oscillations.
"For comparison studies we adopt the standard 10-crop testing"
During testing (evaluation):
Why? This reduces variance in predictions and is more robust than a single evaluation.
"For best results, we adopt the fully-convolutional form as in [41, 13], and average the scores at multiple scales"
For their best reported numbers:
| Component | Choice | Purpose |
|---|---|---|
| Scale augmentation | Shorter side in [256, 480] | Learn features at multiple scales |
| Cropping | Random 224×224 crops | Prevent overfitting to specific image regions |
| Batch normalization | Applied before activation | Stabilize training, enable higher learning rates |
| Learning rate | 0.1, divided by 10 on plateau | Start with exploration, finish with refinement |
| Momentum | 0.9 | Smooth optimization trajectory |
| Weight decay | 0.0001 | Prevent overfitting through regularization |
| Test augmentation | 10-crop averaging | Reduce prediction variance |
The key insight: these are standard practices in deep learning, applied consistently to both plain and residual networks. This fairness is essential for validating the authors' claim that ResNets are fundamentally better architectures, not just better-trained networks.
We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are tr...
This section is the empirical heart of the ResNet paper. The authors are demonstrating that their residual learning framework actually solves the degradation problem—the phenomenon where deeper networks perform worse than shallower ones, even on training data. This is crucial because:
Let me walk you through the key experiments and findings.
The authors train two plain convolutional networks:
Key expectation: The 34-layer network should perform at least as well as the 18-layer network, since it has more representational capacity. Mathematically, the solution space of the 18-layer network is a subset of the 34-layer network's solution space—anything the 18-layer network can learn, the 34-layer one could theoretically learn too.
Looking at Table 2 and Figure 4 (left), the 34-layer plain net performs worse than the 18-layer net. Even more concerning: the 34-layer net has higher training error throughout training, not just validation error. This is the degradation problem in action.
The authors make an important argument here. In very deep networks trained without batch normalization, gradients can "vanish"—become exponentially small—as they propagate backward through many layers. This would prevent learning because the weight updates would be negligible:
where is the loss function and is a weight in an early layer.
But here's the key: These networks use batch normalization (BN), introduced in section 3.4. Batch normalization ensures that the activations (outputs of each layer) maintain reasonable statistics throughout training, which prevents gradient vanishing. The authors verify this by checking that gradients have "healthy norms"—they don't shrink to zero.
Instead, the authors conjecture that the optimization problem itself is fundamentally harder. The 34-layer network might have an exponentially low convergence rate, meaning:
where is the number of iterations and is the convergence rate. For very deep nets, might be so small that even after training for millions of iterations, the error barely decreases. This is subtly different from vanishing gradients—the gradients aren't zero, but the optimization landscape is shaped in a way that makes it extremely slow to navigate.
Now the authors add shortcut connections to the same 18-layer and 34-layer networks, creating ResNets. Recall from equation (1) in section 3.2:
This simple addition fundamentally changes optimization. Instead of learning directly, layers learn the residual , which represents the difference from what would happen if the layer did nothing.
Observation 1: Deeper is Better Again
With ResNets, the 34-layer network is 2.8% better than the 18-layer version (Table 2). The degradation problem is gone! Looking at Figure 4 (right), both training and validation error decrease smoothly with depth—the relationship is now monotonic in the right direction.
Observation 2: Massive Improvement Over Plain Nets
The 34-layer ResNet reduces top-1 error by 3.5% compared to its plain counterpart (Table 2). This is a huge improvement in validation accuracy. Importantly, Figure 4 shows that the training error is "considerably lower"—the ResNet actually learns the training data better, which then translates to better generalization.
Observation 3: Shallow Networks Benefit Too
Even the 18-layer ResNet (where degradation isn't a problem) converges faster than its plain equivalent. This tells us that residual learning helps even when we're not fighting the degradation problem—it just makes optimization easier in general.
In section 3.3, the authors mentioned that shortcuts need special handling when input and output have different dimensions. Let me explain the three options tested:
Option A: Identity shortcuts with zero-padding
Option B: Projection shortcuts
Option C: All-projection
Table 3 shows something surprising: all three options significantly outperform the plain baseline. The differences between A, B, and C are small—B is slightly better than A, and C is marginally better than B.
The key insight is: learning the shortcut connection isn't essential. The identity mapping (or zero-padded identity) is sufficient to fix the degradation problem. The authors hypothesize that option C performs marginally better only because it adds more capacity (parameters), not because projection shortcuts are fundamentally necessary.
This is important for practical reasons: identity shortcuts require no computation or memory overhead, making the ResNets more efficient.
Training a 34-layer ResNet takes significant computational time. To go even deeper without excessive training time, the authors modify the basic building block.
Instead of the 2-layer block (), they use a 3-layer bottleneck block:
Why this design? Here's the intuition:
Looking at Figure 5, a concrete example: suppose input has 256 channels. The bottleneck:
Computational benefit: Both designs have similar FLOPs (floating-point operations), but the bottleneck design focuses computation on the layer where it matters most.
Here's a mathematical argument. Consider a bottleneck block with:
If we used projection shortcuts instead of identity shortcuts, we'd need a projection at:
For identity shortcuts, we need nothing—zero parameters.
The asymmetry is crucial: "the shortcut is connected to the two high-dimensional ends." This means if you project, you pay a parameter cost at both ends, effectively doubling the model size and computation compared to identity shortcuts.
This is why parameter-free identity shortcuts are particularly important for bottleneck architectures—they keep the models efficient while still getting the optimization benefits of residual learning.
The authors build three very deep networks using the bottleneck design:
ResNet-50: Replace each 2-layer block in the 34-layer net with a 3-layer bottleneck → 50 total layers
ResNet-101: More bottleneck blocks added → 101 layers
ResNet-152: Even more bottleneck blocks → 152 layers!
Even the 152-layer ResNet has lower complexity than VGG-19, the previous state-of-the-art deep network:
Yet it performs much better! And crucially: no degradation problem. Deeper networks continue to improve.
Looking at Tables 3 and 4, the 50/101/152-layer ResNets show "considerable margins" of improvement over the 34-layer versions. Every deeper network improves on every metric.
Table 4 compares ResNets with previous best methods. The 152-layer ResNet achieves:
This beats all previous ensemble results—methods that combined multiple models—using just a single model!
The authors combine six models (including two 152-layer ResNets) to form an ensemble. This achieves:
This wins 1st place in ILSVRC 2015, one of the most prestigious computer vision competitions at the time.
| Concept | Insight |
|---|---|
| The Problem | Plain networks suffer from degradation: deeper nets have higher training error, suggesting optimization difficulty (not capacity or gradient vanishing) |
| The Solution | Residual shortcuts enable much deeper networks by learning residuals rather than absolute functions |
| Shortcut Type | Identity shortcuts (free!) work as well as learned projections, especially important for bottleneck designs |
| Scalability | Bottleneck design allows extreme depths (152 layers) with less computation than shallower VGG networks |
| Empirical Result | ResNets achieve state-of-the-art results, winning ILSVRC 2015 with depths 8× greater than previous methods |
The fundamental insight: making the optimization problem easier matters more than making the network deeper. Shortcuts make optimization easier, which allows us to use the additional representational capacity of deeper networks effectively.
We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in ...
In the previous ImageNet section, the authors demonstrated that residual networks (ResNets) solve the "degradation problem" (where deeper networks train worse than shallower ones) on a large-scale dataset. However, a critical question remains: Is this a general phenomenon, or specific to ImageNet?
Section 4.2 answers this by testing on CIFAR-10, a smaller, simpler dataset. More importantly, it goes beyond just reporting numbers—it provides mechanistic analysis of why ResNets work. The authors investigate:
This section is crucial because it suggests the benefits of residual learning are fundamental and generalizable, not just lucky artifacts of ImageNet.
CIFAR-10 is a much smaller dataset than ImageNet:
The authors deliberately use simple architectures on this dataset because they want to study optimization behavior rather than achieve state-of-the-art performance. This is a smart choice: simpler architectures make it easier to isolate the effect of depth from other architectural choices.
The network architecture is parameterized by a single variable , which controls depth:
Architecture formula:
Feature map sizes and filters:
Total depth: weighted layers
So for : we get layers; for : we get layers.
The authors test , yielding networks of 20, 32, 44, and 56 layers—then push much further with (110 layers) and (1202 layers).
Parameterization benefit: By varying only , the authors ensure:
This is methodologically superior to just arbitrarily building different-sized networks, because it isolates the effect of depth.
For ResNets on CIFAR-10:
The identity shortcut means: if input goes through two 3×3 convolutions (the residual function ), the output is:
where both and have identical dimensions (since CIFAR-10 uses small, uniform operations).
The training setup is critical for understanding the results:
Regularization:
Optimization:
Data augmentation (for training only):
Special case for 110-layer network:
This warm-up strategy is pragmatic: very deep networks are sensitive to initialization, so starting with a smaller learning rate helps avoid getting stuck in poor local minima early on.
Figure 6 (left) shows training and testing curves for plain networks of varying depths (20, 32, 44, 56 layers).
Key observation: As networks get deeper, both training and testing error increase—exactly like on ImageNet:
Why this matters: This is the degradation problem in action. Notice it's not just overfitting (testing error being worse than training error)—the problem is that training error itself gets worse. This rules out the simple explanation "we're overfitting because the model is too big."
The paper notes that plain-110 has training error > 60%, which is so bad it's not even plotted.
Figure 6 (middle) shows the same networks converted to ResNets (by adding identity shortcuts).
The transformation is striking:
The 110-layer ResNet:
This directly parallels the ImageNet findings: residual connections enable optimization of very deep networks.
This is where the section becomes particularly insightful. Rather than just showing that ResNets work, the authors investigate how they work by analyzing the magnitudes of layer responses.
For each layer in the network:
Call this post-BN, pre-activation output the "response" of that layer.
Why measure this? The response magnitude tells us how much each layer is contributing to the computation. A small response means the layer is doing minimal modification to the input signal; a large response means substantial transformation.
Figure 7 plots the standard deviation (std) of layer responses:
where is the response of layer and is its mean. The standard deviation measures the typical magnitude of activations in that layer.
Main empirical observation:
Why this matters: Recall from Section 3.1 (earlier in the paper, not shown here, but referenced), the core insight of residual learning is:
Instead of learning: (some arbitrary transformation)
Learn: (an identity plus a small modification)
The layer response analysis empirically validates this motivation. In ResNets, the residual function tends to be small, meaning:
(The residual correction is much smaller in magnitude than the signal itself)
The paper observes an interesting trend:
Interpretation: As networks get deeper, each individual layer modifies the signal less. This makes intuitive sense: with many layers available, no single layer needs to make a drastic change. The accumulated effect of many small modifications compounds to useful feature learning.
This is reminiscent of biological neural systems, where individual neurons make small contributions that combine into complex computations.
The authors set , creating a 1202-layer network:
Training behavior:
Testing behavior:
This is a fascinating turn. For the first time, we see residual learning solve one problem (optimization) but reveal another (overfitting):
The degradation problem (solved by ResNets):
The overfitting problem (revealed at 1202 layers):
Why this matters: The 1202-layer network has enough capacity to memorize training examples, but CIFAR-10 is too small to support this memorization transfer to test data. The network learns training-specific noise rather than generalizable features.
Rather than trying to patch this with ad-hoc tricks, they explicitly state:
This intellectual honesty is valuable: they're demonstrating that their core contribution (residual learning for optimization) doesn't automatically solve all problems. Generalization (preventing overfitting) is a separate challenge requiring additional techniques.
By moving from ImageNet to CIFAR-10, the authors show:
The degradation problem is fundamental, not dataset-specific
Residual learning is a universal solution (for optimization)
Layer response analysis provides mechanistic insight
These questions point toward open research directions, which is honest science.
| Depth | Plain Net | ResNet | Improvement |
|---|---|---|---|
| 20 layers | ~8% test error | ~7% test error | Better |
| 56 layers | ~10% test error | ~6.5% test error | Much better |
| 110 layers | >60% train error (fails) | 6.43% test error | Succeeds where plain fails |
| 1202 layers | N/A | 7.93% test error | Works but overfits |
Residual connections solve the optimization problem that prevents training of very deep networks
The mechanism: Enabling each layer to learn small residual modifications rather than full transformations
Empirical validation: Layer response magnitudes confirm residual functions remain small
Limitations: Optimization is one problem; generalization is another. The section honestly addresses where residual learning ends and where other techniques are needed
Generalizability: The phenomenon appears across different datasets and scales, suggesting it's a fundamental property of deep neural network training
Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection basel...
Up to this point in the paper, the authors have demonstrated that ResNets work remarkably well for image classification—specifically on ImageNet. But a natural question arises: Does this improvement only apply to classification, or do the better representations learned by ResNets transfer to other vision tasks?
This section answers that question by showing that ResNets provide significant improvements on object detection—a fundamentally different problem from classification. Instead of just asking "what is in this image?", object detection asks "what objects are in this image, and where are they?" This is much harder computationally and requires richer learned representations.
The key insight: the 28% relative improvement on the COCO dataset (mentioned in the abstract) comes solely from using better feature representations. Everything else about the detection pipeline stays the same.
The authors use Faster R-CNN as their detection framework. You don't need to understand all details, but the key idea is:
The crucial point: Steps 2 and 3 depend entirely on the quality of the learned representations from the deep network used in step 2.
The authors make a controlled comparison:
Critical detail: Everything else is kept identical. The detection pipeline, hyperparameters, training procedure—all the same. This means any performance difference comes purely from the network architecture and the representations it learns.
The authors evaluate on three benchmark datasets:
The standard metric for object detection is mean Average Precision (mAP). While we won't derive it in detail, here's the intuition:
The notation means:
The paper states:
"we obtain a 6.0% increase in COCO's standard metric (mAP@[.5, .95]), which is a 28% relative improvement"
Let's parse this mathematically. If the baseline (VGG-16) achieves mAP = and ResNet-101 achieves mAP = :
\text{Absolute improvement} = m_{\text{ResNet}} - m_{\text{VGG}} = 6.0%
\text{Relative improvement} = \frac{m_{\text{ResNet}} - m_{\text{VGG}}}{m_{\text{VGG}}} = 28%
From these two equations, we can solve for the baseline:
So the VGG-16 baseline achieved roughly 21.4% mAP, and ResNet-101 improved it to roughly 27.4% mAP on COCO.
Why is the relative improvement so large (28%) while the absolute improvement is modest (6%)? Because object detection is harder than classification—the baseline accuracies are lower. A 6-point improvement on a 21% baseline is proportionally much larger than a similar 6-point improvement would be on, say, a 90% classification accuracy.
The learned representations from ResNets are universally better, not just for classification. Here's why this is significant:
These are fundamentally different tasks with different computational requirements. Yet the same backbone network (the feature extractor) helps both.
This suggests that ResNets learn more robust, general-purpose visual features that capture object structure, textures, spatial relationships, etc. at multiple scales—the same features useful for detecting objects as for classifying them.
This demonstration of transfer learning is crucial for the field:
Recall from Section 4.1 and 4.2:
These properties—learning through small, controlled modifications—apparently produce features that are particularly valuable for detecting where objects are and what they are.
| Aspect | Details |
|---|---|
| Main Question | Do better ImageNet networks help with object detection? |
| Experiment | Same Faster R-CNN pipeline with different backbones (VGG-16 vs ResNet-101) |
| Result | 6.0% absolute improvement (28% relative) on COCO mAP@[.5, .95] |
| Interpretation | ResNet representations are more useful for downstream vision tasks |
| Implication | Better backbone networks → better performance on diverse tasks |
This section provides empirical validation that ResNets don't just win ImageNet—they provide more useful features for the entire computer vision ecosystem.
In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initial...
The core innovation of the ResNet paper is showing that residual networks can be trained much deeper than traditional CNNs. However, the authors need to demonstrate that this benefit transfers beyond image classification to other important computer vision tasks. This section shows how ResNets can be adapted for object detection—a much harder problem than classification.
Object detection requires not just identifying what is in an image, but also where it is. The key insight here is that better feature representations (learned by deeper ResNets) can dramatically improve detection performance.
VGG-16 (the previous state-of-the-art backbone network) has fully connected (fc) layers at the end, which process global image information. ResNet-101 has no such fc layers—it's purely convolutional. This creates an architectural mismatch that the authors must address for object detection.
Key Idea: Instead of converting conv features to fully connected layers, keep everything convolutional and let subsequent detection networks adapt to this structure.
Implementation Details:
The authors use layers with stride ≤ 16 pixels on the original image to generate shared feature maps:
This matches VGG-16's approach of using 13 conv layers to produce 16-pixel stride feature maps, allowing a fair comparison.
For each proposed object region:
The final output has two parallel branches instead of one:
When you train on ImageNet (a large, diverse dataset) and then fine-tune on detection data (typically smaller datasets like COCO), the statistics that Batch Normalization learned might not be appropriate anymore.
Batch Normalization recap: For each layer , BN computes:
where:
Then it scales and shifts:
where and are learned parameters.
Rather than continue updating BN statistics during fine-tuning, they:
Why? This reduces memory consumption during Faster R-CNN training, which is computationally expensive. While it might seem suboptimal to not update BN statistics, empirically this works well because:
Dataset Setup:
Key Metric: Mean Average Precision (mAP) — a standard detection metric that averages precision across all object categories
Results from Table 7:
Interpretation: Deeper, residual-learning-based features are more discriminative for localization and classification tasks.
MS COCO is significantly more challenging than PASCAL VOC:
The authors report two evaluation metrics:
mAP@IoU=0.5: Measures detection accuracy at a high Intersection-over-Union (IoU) threshold
mAP@[0.5:0.05:0.95]: The "standard COCO metric"
| Metric | ResNet-101 | VGG-16 | Improvement |
|---|---|---|---|
| mAP@0.5 | 54.2% | 47.3% | +6.9% |
| mAP@[0.5:0.95] | 34.2% | 28.2% | +6.0% |
The absolute improvements are nearly identical across both metrics:
Why is this significant?
Usually when you improve detection, you get bigger gains in the lenient metric than the strict metric. Getting equal improvements across strict and lenient metrics means:
This is crucial evidence that deeper networks learn representations that help with both:
To handle the computational demands of training on 80k images:
This section proves a crucial principle in deep learning:
Good features learned on one task (ImageNet classification) transfer well to another task (object detection), provided the backbone network is powerful enough.
The 28% relative improvement in the strictest COCO metric demonstrates that ResNets:
This discovery was transformative for computer vision—ResNets became the backbone of choice for nearly every downstream task.
For completeness, we report the improvements made for the competitions. These improvements are based on deep features an...
This section describes additional techniques the authors applied to improve their object detection results for competition submissions. These are practical engineering improvements built on top of the ResNet-based detection framework described in Appendix A. Think of this as the "secret sauce" — while the base ResNet architecture provides better features, these techniques show how to squeeze even more performance out of those features through clever post-processing and testing strategies.
The improvements fall into several categories:
Let me walk through each one.
In object detection, you need two things: what is the object (classification) and where is the object (localization via bounding boxes). The initial detection system produces a "regressed box" — the model's best guess at where an object is located.
The key insight: If you're wrong about where an object is, use that wrong answer to get a better answer.
Let's say the model produces:
Box refinement applies this process iteratively:
Non-Maximum Suppression removes duplicate detections:
Result: ~2 percentage points improvement in mAP (mean Average Precision).
Each region proposal is processed independently to extract features. But objects exist in context — what objects are nearby? What does the whole scene look like? This technique adds a "global view" to each local decision.
In the Fast R-CNN module (which processes each region):
Step 1: Extract global feature
What does this mean? Rather than pooling features from a small region, you pool from the entire image boundary box. This gives you a "summary" feature vector of the whole scene.
Step 2: Combine local and global
For each region proposal, you now have:
Concatenate them:
where denotes concatenation (stacking vectors end-to-end).
Step 3: Make predictions
Feed through the classification and box regression heads to get final predictions.
The entire system is trained end-to-end, so the network learns what global context matters.
Result: ~1 percentage point improvement in mAP@0.5 (the stricter metric).
Objects appear at different sizes in images. A person might be 50 pixels tall in one image and 500 pixels in another. By testing at multiple image scales, you have a better chance of catching objects at their natural size.
In the standard approach (from Appendix A), the image's shorter side is rescaled to exactly pixels.
Multi-scale testing instead uses an image pyramid:
Each version has the same aspect ratio, just different sizes.
Compute feature maps using the ResNet backbone for each scale
Select adjacent pairs of scales (e.g., 600 and 800), following a technique from prior work
Pool RoI features from both scales
Merge predictions from both scales using maxout: For each feature dimension, take the maximum value across the two scales:
This is intuitive: at different scales, different features become prominent, so taking the max captures useful information from both.
Why not multi-scale training? The authors didn't have time to implement this due to computational constraints. Multi-scale testing alone (without retraining the model) still works well because the ResNet features are reasonably robust across scales.
Result: ~2 percentage points improvement in mAP.
When you have two datasets with similar tasks:
Instead of only using the 80k train set, combine it with the 40k validation set (80k + 40k = 120k total) for training, then evaluate on test-dev.
This is valid because:
Results with this approach:
These become the baseline for further improvements (box refinement, context, multi-scale).
Different neural networks, even with the same architecture, will converge to different local optima due to random initialization and stochastic training. Their errors are partially independent, so averaging them reduces error.
Faster R-CNN has two stages that can each be ensembled:
Stage 1: Region Proposal Network (RPN)
Stage 2: Detection on Regions
Mathematically, for region :
where is the confidence score from network .
Box coordinates are similarly averaged:
Results with ensemble of 3 networks:
This ensemble won 1st place in COCO 2015 detection.
The authors take the trained COCO model (which saw 80k+40k images) and fine-tune it on other datasets:
PASCAL VOC:
ImageNet Detection (200 object categories):
The key insight: Better base features (from ResNet) + clever training tricks (ensemble, multi-scale, box refinement) = state-of-the-art across multiple datasets.
| Technique | Improvement | Mechanism |
|---|---|---|
| Box refinement | ~2 points | Iterate: use regressed box to get better prediction |
| Global context | ~1 point | Concatenate full-image features with region features |
| Multi-scale testing | ~2 points | Test at 5 scales, merge features using maxout |
| Ensemble (3 models) | ~3-4 points | Average scores and boxes from 3 networks |
| Combined effect | ~8 points total | All techniques stacked together |
The cascading improvements show that good representation learning (ResNet) combined with smart inference techniques produces exceptional results.
The ImageNet Localization (LOC) task [36] requires to classify and localize the objects. Following [40, 41], we assume t...
The ImageNet Localization task is different from pure classification. It's not just asking "what object is in this image?" but also "where exactly is it?" The task requires the network to:
This section explains how the ResNet framework adapted from object detection methods can dramatically outperform previous approaches (like VGG) on this combined classification-localization task. The key innovation is using a per-class RPN (Region Proposal Network) that learns class-specific bounding box regressors rather than using a single generic proposal mechanism.
The ImageNet Localization task assumes a two-stage pipeline:
This is formalized by the per-class regression (PCR) strategy: for each of the 1000 classes, train a separate bounding box regressor. This makes sense intuitively—a cat might typically appear in different image positions or sizes than a car, so learning class-specific patterns helps.
In the previous object detection section, the RPN was category-agnostic—it generated proposals without knowing what object class they contained. Here, we make it category-aware by making the RPN per-class.
Before explaining the modification, recall that a standard RPN ends with:
The authors replace these with per-class versions:
Classification layer output dimension: Instead of binary (object/no-object), we have a 1000-dimensional output vector:
where each dimension performs binary logistic regression to predict "is this object class or not?"
Regression layer output dimension: Instead of 4 coordinates per anchor, we have:
This contains values total. For each of the 1000 classes, we learn a separate 4-dimensional bounding box regressor (change in x-position, y-position, width, and height relative to an anchor box).
The bounding box regression is defined relative to anchor boxes. At each spatial location in the feature map, there are multiple predefined "anchor" boxes of different aspect ratios and scales. These are translation-invariant templates. The regression outputs predict:
where the offsets are learned to adjust the anchor to match the ground truth.
As in ImageNet classification (referenced from Section 3.4), the network is trained with:
A critical practical issue: negative samples (non-objects) vastly outnumber positive samples (actual objects).
The balancing strategy:
This means the actual mini-batch contains 4 positive and 4 negative anchors per image. This prevents the loss from being dominated by easy negative examples.
The network is applied fully-convolutionally across the entire image (not just at the center). This means:
The paper compares methods when given the true class label beforehand:
VGG-16 (center-crop evaluation): error ResNet-101 (center-crop evaluation): error
This is a dramatic percentage point improvement! The reduction from to is a relative improvement of:
Why is ResNet so much better here? The deeper residual architecture learns better spatial features for precise localization. The skip connections allow gradients to flow more effectively for fine-grained bounding box prediction.
With dense and multi-scale testing: error
When the network must predict its own class (with top-5 classification error from Table 4), the top-5 localization error rises to . The small increase () shows the architecture is robust to classification errors.
A subtle but important observation: ImageNet Localization images typically contain one dominant object (unlike general detection datasets). This creates a problem for Fast R-CNN:
Instead, they use R-CNN, which is RoI-centric:
Training procedure:
Testing procedure:
This additional refinement stage reduces top-5 localization error from (single model on validation) to (ensemble on test set).
| Configuration | Top-5 Error |
|---|---|
| ResNet-101 (validation, single model) | 10.6% |
| ResNet ensemble (test set) | 9.0% |
| ILSVRC 2014 state-of-the-art | ~14.5% |
Relative improvement:
But the paper claims a 64% relative reduction, which seems to reference a different baseline. The point is: this is a massive jump in performance.
The ResNet localization success demonstrates that:
This versatility helped ResNets dominate ILSVRC 2015 across multiple competition tracks.
| Concept | Explanation |
|---|---|
| Per-class RPN | Instead of generic proposals, learn 1000 class-specific bounding box regressors |
| Anchor boxes | Use predefined templates; learn offsets relative to these templates |
| Anchor sampling | Maintain 1:1 positive/negative ratio to avoid imbalance |
| Fully convolutional testing | Process entire image; don't crop at test time |
| R-CNN refinement | Additional stage that handles single-object ImageNet setting better than Fast R-CNN |
| Why ResNet wins | Deeper architecture + skip connections = better spatial features for localization |
The key insight is that localization is fundamentally a feature learning problem—better features (from deeper networks) lead to better bounding box predictions.