Deep Residual Learning for Image Recognition

Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Abstract

p.1

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of net...

Deep Residual Learning: Understanding the Abstract

The Big Picture

This abstract introduces one of the most impactful papers in deep learning—a paper that fundamentally changed how we think about building very deep neural networks. The core insight is deceptively simple: instead of training networks to learn functions directly, train them to learn the difference (residual) between the input and desired output. This seemingly small change makes it possible to train networks that are 8× deeper than what was previously feasible, without getting worse performance.

Let me break down what makes this paper revolutionary and why it matters.

The Problem: Why Deep Networks Are Hard to Train

The abstract starts with a crucial observation:

"Deeper neural networks are more difficult to train."

To understand this, we need to think about what happens during training. When we train a neural network, we adjust the weights using backpropagation. The key mathematical operation is computing gradients using the chain rule:

\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial w_2} \cdot \frac{\partial w_2}{\partial w_1} \cdots

where $\mathcal{L}$ is the loss (a measure of how wrong our predictions are), and $w_i$ represents weights in layer $i$ .

The Vanishing Gradient Problem:

When networks are very deep, we multiply many of these partial derivatives together
If each gradient is less than 1, multiplying many of them together gives us: $(0.5) \times (0.5) \times (0.5) \times \cdots \approx 0$
When gradients become extremely small (vanish), early layers barely update, and training stalls
Conversely, if gradients are greater than 1, they can explode, causing instability

This is why before ResNets, training networks beyond ~20-30 layers became impractical.

The Solution: Residual Learning Framework

Here's where the authors' key insight comes in:

"We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions."

Traditional Neural Network Layer

Traditionally, a layer learns to map an input directly to an output. If we denote the input as $x$ and the layer's function (the weights and computations) as $F(x)$ , the output is simply:

y = F(x)

Residual Learning: A Small but Powerful Change

Instead, residual networks learn the difference (residual) between input and output:

y = x + F(x)

This is called a "skip connection" or "identity connection"—we literally add the original input $x$ back to the learned transformation $F(x)$ .

Why does this matter mathematically?

When we backpropagate through this layer, the gradient becomes:

\frac{\partial y}{\partial x} = \frac{\partial (x + F(x))}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}

Notice the crucial "+1" term. Even if $\frac{\partial F(x)}{\partial x}$ is very small (vanishing), the gradient still has a baseline of 1 flowing directly through the skip connection. This ensures gradients don't completely vanish, even in very deep networks.

Intuitive analogy: Instead of asking "what should the output be?", we ask "what small adjustment should we make to the input?" This is often an easier learning problem—the network can just learn small refinements rather than learning everything from scratch.

Empirical Results and Validation

The abstract supports the theoretical benefits with extensive experimental evidence:

ImageNet Results

152-layer ResNet: The authors trained networks with 152 layers—8× deeper than VGG networks (which used ~19 layers)
Maintained lower complexity: Despite being much deeper, the ResNet had fewer parameters and computations than comparably accurate shallow networks
3.57% error rate: On the ImageNet test set (which has 1000 categories), the ensemble achieved state-of-the-art results
Won 1st place in the ILSVRC 2015 classification competition

CIFAR-10 Results

They tested even more extreme depths: networks with 100 and 1000 layers
These ultra-deep networks remained trainable, which would have been impossible before residual learning

Other Tasks

COCO object detection: A 28% relative improvement just from using deeper representations
Multiple 1st place finishes: ImageNet detection, localization, COCO detection, and segmentation

Key Technical Insights Presented

1. Depth Enables Better Representations

"The depth of representations is of central importance for many visual recognition tasks."

This statement encapsulates a fundamental principle: deeper networks can learn hierarchical features. Early layers learn simple patterns (edges, textures), middle layers combine these into shapes and object parts, and deep layers recognize complex objects. By making it possible to train very deep networks, ResNets enable much richer feature hierarchies.

2. Easier Optimization

"These residual networks are easier to optimize, and can gain accuracy from considerably increased depth."

Before residual learning, adding more layers would often decrease accuracy—the network was too hard to train. ResNets inverted this: more layers → better performance (up to a point). This is because:

Gradients flow more reliably through skip connections
The network can always choose to learn identity (by setting $F(x) = 0$ ) if that layer isn't useful
The learning problem is decomposed into "what small change should I make?" rather than "what should the output be?"

Why This Matters (Conceptual Summary)

Aspect	Before ResNets	With ResNets
Practical depth limit	~20-30 layers	100-1000+ layers
Training difficulty	Gets harder as you go deeper	Gets easier (more layers help)
Gradient flow	Vanishes in deep networks	Maintained by skip connections
Parameter count	More layers ≠ better results	More layers = better results

Mathematical Notation Reference

$x$ : Input to a layer
$F(x)$ : The learned transformation (the neural network weights applied to input)
$y = x + F(x)$ : The residual block output
$\mathcal{L}$ : Loss function (measures prediction error)
$\frac{\partial}{\partial x}$ : Partial derivative with respect to $x$ (used in backpropagation)
3.57% error: Out of 1000 ImageNet categories, on average the model's top prediction is wrong only 3.57% of the time

Conclusion

The abstract summarizes a paradigm shift: by training networks to learn residuals (differences) instead of absolute functions, we overcome the gradient vanishing problem that plagued deep learning. This one architectural change—the skip connection—enabled training of networks 8× deeper while improving accuracy. The empirical validation across multiple datasets and tasks demonstrates that this isn't just a theoretical improvement; it's a practical breakthrough that fundamentally expanded what deep neural networks could do.

1. Introduction

p.1

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]....

Introduction to Residual Learning: A Deep Dive

Big Picture: What's This Section About?

This introduction tackles a fundamental problem that limited deep learning in 2015: deeper neural networks don't always work better, even though they should in theory. The authors present compelling evidence of this problem and then introduce their solution—residual learning with "shortcut connections"—that will allow training networks 8× deeper than previously possible while actually improving accuracy.

The section has three key parts:

The context: Why deep networks matter and what we expected
The problem: Why adding layers sometimes makes things worse
The solution: A clever reformulation that fixes this problem

Let me walk you through each carefully.

Part 1: Setting the Stage—Why Depth Matters

The authors begin with an established fact from computer vision:

"Deeper networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multilayer fashion, and the 'levels' of features can be enriched by the number of stacked layers (depth)."

What does this mean in plain language?

Think of deep networks as having layers that progressively understand more complex patterns:

Layer 1-2: Detect edges and simple textures
Middle layers: Detect corners, simple shapes (e.g., "is this a wheel?")
Deep layers: Detect complex objects (e.g., "this is a car")

Each successive layer builds on the previous one. So intuitively, more layers = better features = better accuracy.

By 2015, this intuition was validated empirically. The state-of-the-art models on the ImageNet dataset (a massive image classification benchmark) were all "very deep"—around 16 to 30 layers. So the question seemed straightforward: Can we just keep stacking more layers?

Part 2: The Degradation Problem—Something's Wrong

Here's where things get interesting. The authors identify a surprising phenomenon that contradicts expectations:

[See Figure 1: The figure shows training and test error for 20-layer vs. 56-layer "plain" networks on CIFAR-10. Notice the deeper network has higher training error—this is the key surprise.]

The Vanishing Gradient Problem (Already Solved)

Before diving into the main problem, the authors acknowledge a previous obstacle: vanishing/exploding gradients.

Quick background: In neural networks, we train using backpropagation, which computes gradients (rates of change) of the loss with respect to each parameter. Mathematically, we need to compute:

\frac{\partial \mathcal{L}}{\partial w_i}

where $\mathcal{L}$ is the loss (error) and $w_i$ is a weight parameter.

When networks get deep, these gradients are computed by multiplying many partial derivatives together (via the chain rule). If each derivative is less than 1, multiplying many of them together can make the gradient exponentially small—essentially zero. This prevents the network from learning early layers.

Good news: By 2015, normalized initialization (smart weight initialization) and batch normalization had largely solved this. So this isn't the problem anymore.

The Real Problem: Degradation (Not Overfitting!)

Here comes the crucial insight. Despite solving vanishing gradients, deeper networks still fail—but not because they overfit (where training error is low but test error is high). Instead:

$\text{Training error (deeper)} > \text{Training error (shallower)}$

This is counterintuitive! The deeper network is worse even on the data it's training on.

Why is this paradoxical?

The authors present a logical argument by construction. Consider:

A shallower network that has learned parameters well
A deeper network made by adding extra layers to the shallower one

Now, here's the key: There exists a solution for the deeper network that should work just as well as the shallower one: Simply make the added layers perform identity mappings (outputs equal inputs), and copy the learned parameters from the shallower network.

Mathematically, if the shallower network computes $\mathcal{H}(\mathbf{x})$ (some function of input $\mathbf{x}$ ), then the deeper network can compute:

$ \mathcal{H}(\mathbf{x}) + f_{\text{identity}}(\text{extra layers}) = \mathcal{H}(\mathbf{x}) + (\text{input unchanged}) $ In other words, the extra layers could just be "pass-through" layers, leaving outputs unchanged while the earlier layers do all the work. **The mystery**: If this solution is theoretically available, why can't standard optimization algorithms (like SGD with backpropagation) find it? **The conclusion**: "Our current solvers are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)." In other words, the optimization landscape is so difficult that gradient descent gets stuck. --- ## Part 3: The Solution—Residual Learning This is where the paper's contribution enters. Rather than hoping layers will directly learn the target function, the authors propose learning the *difference* between the target and the input. ### The Core Mathematical Insight

Let's denote:

$\mathbf{x}$ : the input to a stack of layers
$\mathcal{H}(\mathbf{x})$ : the desired/target mapping we want to learn
$\mathcal{F}(\mathbf{x})$ : a new function we'll actually learn (the "residual")

Instead of learning $\mathcal{H}(\mathbf{x})$ directly, define the residual mapping as:

$\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}$

Then, the original mapping is rewritten as:

$\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}$

**Why does this help?**

Here's the intuition: If the optimal mapping is close to the identity (i.e., $\mathcal{H}(\mathbf{x}) \approx \mathbf{x}$ ), then $\mathcal{F}(\mathbf{x}) \approx 0$ —it's easier to learn a mapping to zero than to learn an identity mapping!

Consider an extreme case:

Hard version: "Make these 10 nonlinear layers output exactly what came in" (identity mapping)—this requires precise calibration of nonlinearities to cancel out
Easy version: "Make these 10 nonlinear layers output zero" (residual mapping)—often just means learning to suppress signals, which is more straightforward

The authors' hypothesis: Even when identity mapping isn't optimal, learning the difference from identity is generally easier than learning the absolute mapping.

Implementation: Shortcut Connections

The elegant part: this can be implemented simply with a "shortcut connection" (also called a skip connection) as shown in [Figure 2]:

$\text{Output} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$

where:

The top path (nonlinear layers) learns $\mathcal{F}(\mathbf{x})$
The bottom path (shortcut) is just $\mathbf{x}$ itself (identity)
These are added together

Key advantages:

No extra parameters: The shortcut connection doesn't learn anything; it's pure addition
No extra computation: Just element-wise addition between the outputs
Easy to implement: Standard deep learning libraries (like Caffe) support this without modifying solvers
Backpropagation works naturally: The gradient splits into two paths:
- Direct path through the shortcut: gradient = 1
- Path through the layers: gradient computed normally

Mathematically, when we backpropagate the loss, the gradient reaching $\mathbf{x}$ is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial (\mathcal{F}(\mathbf{x}) + \mathbf{x})} \cdot \frac{\partial(\mathcal{F}(\mathbf{x}) + \mathbf{x})}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial (\mathcal{F}(\mathbf{x}) + \mathbf{x})} \cdot (1 + \text{gradient from } \mathcal{F})$

The "1" term means even if the gradient through $\mathcal{F}$ shrinks to zero, the shortcut ensures a gradient of at least 1 propagates backward—this helps prevent vanishing gradients!

Part 4: Claims and Results

The authors make four specific claims, then present evidence:

ResNets are easy to optimize: Residual networks can be trained successfully to very deep depths
Plain networks degrade: Standard stacked networks show higher training error with increased depth
ResNets improve with depth: Unlike plain networks, deeper ResNets continue improving, not degrading
Results generalize: These benefits appear across different datasets (ImageNet, CIFAR-10) and different tasks (detection, localization, segmentation)

Concrete achievements:

152-layer ResNet on ImageNet (compared to previous ~30 layers)
3.57% top-5 error on ImageNet (won 1st place in ILSVRC 2015)
Successfully trained models with 100+ layers on CIFAR-10
28% relative improvement on COCO detection (a downstream task)

Key Takeaway

The section builds a compelling case for a simple but powerful idea:

Learning to modify inputs slightly (via residual functions) is easier than learning to transform inputs completely (via absolute functions).

This insight, implemented through shortcut connections, allows networks to be trained much deeper than before while maintaining (and improving) accuracy. The mathematical reformulation $\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}$ is elegant precisely because it's so simple, yet it unlocks the ability to train far deeper networks—the foundation of modern deep learning.

2. Related Work

p.2

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with ...

Section 2: Related Work – Deep Dive Explanation

Big Picture: Why This Section Matters

Before the authors present their residual learning framework in detail, they need to establish context. This section does three important things:

Shows that residual representations aren't entirely new – they've been used successfully in other domains (image retrieval, computer graphics)
Explains why residual thinking works – by reviewing successful applications in other fields
Positions their contribution relative to existing work – especially comparing to "highway networks," a concurrent approach

The key insight: the authors are arguing that reformulating a problem in terms of residuals (differences from a baseline) makes optimization easier. This is a conceptual framework that appears in multiple domains, and they're now applying it to deep neural networks.

Part 1: Residual Representations in Image Recognition

The Core Idea

In previous work on image recognition, researchers discovered that encoding residual vectors (the differences between actual data and some reference point) works better than encoding the original data directly.

VLAD and Fisher Vectors

VLAD (Vector-Locally Aggregated Descriptors) works like this:

You have a dictionary $\mathcal{D} = \{d_1, d_2, \ldots, d_k\}$ of reference vectors (learned patterns)
For each input vector $\mathbf{x}$ , find the closest dictionary entry $d^*$
Instead of encoding $\mathbf{x}$ itself, encode the residual: $\mathbf{r} = \mathbf{x} - d^*$

Fisher Vectors extend this probabilistically: they encode residuals with respect to a learned Gaussian Mixture Model (GMM) rather than a fixed dictionary.

Why does this work? Think about it intuitively: if you're describing an image, saying "this object is slightly rotated from the standard position" (a residual) is often more informative than saying "the pixel intensities are [long list of numbers]" (the original data). The residual captures the meaningful deviation from a pattern.

Vector Quantization

In vector quantization (a compression technique), encoding residuals is more effective than encoding original vectors. This shows a general principle: residual encoding is more efficient for optimization.

Part 2: Residual Solutions in Scientific Computing

Multigrid Methods for PDEs

The authors now reference a completely different field: solving Partial Differential Equations (PDEs) using Multigrid methods.

The Problem: Solving PDEs directly is computationally expensive. For example, solving Laplace's equation $\nabla^2 u = f$ over a large domain with fine discretization is slow.

The Multigrid Solution:

Instead of solving the full system at once, Multigrid reformulates the problem at multiple scales:

Solve the PDE on a coarse grid (fewer unknowns) → get approximate solution $u_{\text{coarse}}$
On a fine grid, compute the residual error: $\mathbf{r} = \mathbf{f} - \mathcal{A}u_{\text{coarse}}$ (where $\mathcal{A}$ is the differential operator discretized)
Solve for how to correct the coarse solution by solving the residual problem on intermediate scales
Each level is responsible for capturing residuals at that particular scale

Key insight from the paper: These multilevel solvers converge much faster than standard solvers because they reformulate the problem to explicitly target what still needs to be learned (the residuals), rather than trying to learn everything from scratch.

Why mention this? The authors are drawing a conceptual parallel: just as Multigrid solvers work better when reformulated around residuals, neural networks might optimize better if their layers learn residual functions rather than trying to directly learn the complete transformation.

Part 3: Shortcut Connections – The Technical Implementation

This section reviews how shortcut connections (the mechanism that enables residual learning) have been explored before.

Historical Context

The authors trace the history of shortcut connections through several papers:

Early MLPs with direct input-to-output connections (citations 34, 49): In the 1990s, people sometimes added a linear layer directly from network input to output. This is a very simple form of a shortcut.
Intermediate auxiliary classifiers (citations 44, 24): To combat vanishing/exploding gradients, researchers directly connected hidden layers to the output, allowing error signals to flow backward through shorter paths. Remember from the Introduction: gradients get very small ( $\nabla^2 f \approx 0$ ) or very large as they propagate backward through many layers. Short paths help with this.
Layer response centering (citations 39, 38, 31, 47): Various papers proposed using shortcut connections to normalize layer responses and error propagation.
Inception layers (citation 44): Google's Inception architecture includes branches with shortcuts alongside deeper branches, allowing the network to learn both simple and complex transformations in parallel.

Highway Networks: The Key Comparison

The most important comparison is to Highway Networks (by Srivastava et al., citations 42, 43), which were developed around the same time as ResNets.

How Highway Networks Work

Highway Networks use gated shortcuts. Formally, instead of simply adding a residual like ResNet's:

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Highway Networks compute:

\mathbf{y} = \mathcal{F}(\mathbf{x}) \text{odot} \mathbf{g}(\mathbf{x}) + \mathbf{x} \text{odot} (1 - \mathbf{g}(\mathbf{x}))

Where:

$\mathcal{F}(\mathbf{x})$ is the "transform" branch (learnable layers, similar to ResNet)
$\mathbf{g}(\mathbf{x})$ is the gating function (a learned neural network that outputs values between 0 and 1)
$\text{odot}$ denotes element-wise multiplication
$1 - \mathbf{g}(\mathbf{x})$ is the carry gate (the identity shortcut, but weighted)

In plain language: The network learns how much to use the complex transformation versus how much to pass through the identity. If $\mathbf{g}(\mathbf{x}) \approx 0$ , mostly the identity passes through. If $\mathbf{g}(\mathbf{x}) \approx 1$ , mostly the transformation is used.

ResNet's Simpler Choice

The authors argue that ResNets are better because:

Simpler design: ResNet always learns residuals. There's no gate. The shortcut is parameter-free (no learnable $\mathbf{g}$ ).
Stronger gradient flow: The identity connection is never "closed" (gates can't reduce it). This guarantees that:
- Gradients flow back through the shortcut unimpeded
- All information always passes to the next layer (plus learned residuals)
- The network can't learn to completely ignore a layer by closing its gate
Better scaling to extreme depth: Most importantly, Highway Networks had not demonstrated success with very deep networks (>100 layers). ResNets do.

Mathematical Clarity: The Residual Principle

Recall from the Introduction that the authors defined:

\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}

Where:

$\mathcal{H}(\mathbf{x})$ is the desired output function we want to learn
$\mathcal{F}(\mathbf{x})$ is the residual function – what the layers actually learn
$\mathbf{x}$ is the identity shortcut (input passes through unchanged)

Why this formulation? Suppose the optimal solution is close to the identity (i.e., the output should be nearly the same as the input). Then:

In a plain network, layers must learn $\mathcal{H}(\mathbf{x}) \approx \mathbf{x}$ , which requires many nonlinear activations to approximate an identity mapping
In a residual network, layers must learn $\mathcal{F}(\mathbf{x}) \approx 0$ , which is much easier (networks are naturally good at learning to output near-zero)

This is the fundamental optimization advantage that the authors are building on.

Summary: Connecting All Three Parts

Domain	Key Insight
Image Retrieval	Encoding residuals from dictionary vectors is more efficient
Scientific Computing	Solving residual problems across multiple scales converges faster
Shortcut Connections	Existing work shows shortcuts help; Highway Networks show gated versions work, but ResNets' simpler parameter-free approach is more powerful

The unified message: Residual reformulations work better because they target what still needs to be learned, rather than forcing the system to learn everything from scratch. ResNets apply this principle elegantly to deep neural networks using identity shortcuts.

3.1. Residual Learning

p.3

Let us consider $\mathcal{H}(\mathbf{x})$ as an underlying mapping to be fit by a few stacked layers (not necessarily th...

Section 3.1: Residual Learning - A Deep Dive

The Big Picture

This section tackles a fundamental problem: Why do deeper neural networks perform worse than shallower ones, even on the training data? The authors propose a conceptual shift in how we think about what layers in a neural network should learn. Instead of asking layers to learn the full desired transformation, they suggest asking layers to learn only the difference (residual) from the input. This simple reframing turns out to be remarkably powerful.

Part 1: The Core Hypothesis

Let me start with the foundational claim:

"If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions."

What does this mean?

Suppose you have a desired function that you want your neural network to learn, denoted as $\mathcal{H}(\mathbf{x})$ , where:

$\mathcal{H}$ is the "ideal" transformation you want
$\mathbf{x}$ is the input to a stack of layers
The output should be $\mathcal{H}(\mathbf{x})$

The authors make a mathematical argument: if your neural network layers can theoretically learn any complicated function (the Universal Approximation Theorem), then they can certainly learn the difference:

$\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}$

This difference is called the residual function. Notice:

$\mathcal{F}(\mathbf{x})$ represents how much $\mathcal{H}(\mathbf{x})$ deviates from the identity (just returning the input unchanged)
If we can learn $\mathcal{F}(\mathbf{x})$ , we can always recover the original function: $\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}$

The key insight: Since both representations are theoretically equivalent in terms of what they can express, the question becomes: Which is easier for an optimizer to actually learn?

Part 2: The Motivation from the Degradation Problem

The authors ground this idea in an empirical puzzle shown in Figure 1:

The Problem: When you add more layers to a network:

Training error gets worse, not better
This happens even though a solution should exist (by construction)
This isn't caused by overfitting—the validation error also degrades

Why should a solution exist? Consider a 20-layer network that works well. You could always create a 56-layer network by:

Copying all the weights from the 20-layer network into the first 20 layers
Making the added 36 layers into identity mappings (they just return their input unchanged)

This constructed solution should have the same training error as the 20-layer network. But optimizers can't find it.

The hypothesis: The problem is that optimizers struggle to learn identity mappings using multiple nonlinear layers. Think about it: if you stack ReLU activation functions and matrix multiplications, it's quite difficult to arrange them so the output exactly equals the input.

Part 3: The Residual Learning Solution

Here's where the reformulation saves the day:

Instead of asking layers to learn $\mathcal{H}(\mathbf{x})$ , ask them to learn $\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}$ .

Why is this better?

If identity mappings are optimal (or close to optimal):

In the standard formulation, you'd need the nonlinear layers to arrange themselves to output exactly $\mathbf{x}$
In the residual formulation, you just need the layers to output something close to $\mathbf{0}$ (zero)

Mathematically: $\text{Standard: } \text{Learn } \mathcal{H}(\mathbf{x}) = \mathbf{x}$ $\text{Residual: } \text{Learn } \mathcal{F}(\mathbf{x}) = \mathbf{x} - \mathbf{x} = \mathbf{0}$

Driving weights toward zero is a much more natural behavior for gradient descent than arranging nonlinear layers to reproduce the input exactly.

Part 4: Preconditioning for Non-Identity Cases

But what if identity mappings aren't optimal? The authors offer a more nuanced view:

"If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping."

Intuition:

Imagine you're trying to describe a complicated function to someone:

Hard way: "Here's the complete new function you need to learn"
Easy way: "Take the identity mapping and tweak it slightly here and there"

The second approach is easier because:

You start with a reasonable baseline (the identity, which does something sensible)
You only need to learn the small deviations from that baseline
This gives the optimizer better "scaffolding" to work with

The term "preconditioning" comes from optimization theory. A preconditioned problem is one where you've reformulated it to be easier to solve, often by providing a good initial guess or structure.

Part 5: Empirical Support

The section concludes by noting (referencing Figure 7, which we don't see here) that:

"The learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning."

What does this mean?

When the authors trained networks using the residual learning framework, they observed that $\mathcal{F}(\mathbf{x})$ (the learned residual) tends to have small magnitude. This confirms their hypothesis: the optimal functions are indeed close to identity mappings, and reformulating the problem to learn small perturbations from identity is exactly the right way to frame it.

Summary: The Three-Layer Logic

Level	Claim
Mathematical equivalence	Both $\mathcal{H}(\mathbf{x})$ and $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ can theoretically express the same functions
Optimization difficulty	Learning $\mathcal{F}(\mathbf{x})$ (small perturbations) is easier than learning $\mathcal{H}(\mathbf{x})$ (a function from scratch)
Empirical reality	Deep networks actually learn small residuals, confirming that this formulation aligns with what networks naturally want to do

Connection to Figure 2

Although I can't see Figure 2 directly in your context, based on the description, it should show the fundamental building block: a shortcut connection where:

Input $\mathbf{x}$ branches off
One path: $\mathbf{x}$ goes through stacked nonlinear layers to produce $\mathcal{F}(\mathbf{x})$
The other path: $\mathbf{x}$ passes through unchanged
They're added together: $\mathcal{F}(\mathbf{x}) + \mathbf{x}$

This is the architectural instantiation of the conceptual reformulation explained in this section.

3.2. Identity Mapping by Shortcuts

Mathp.3

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we ...

Section 3.2: Identity Mapping by Shortcuts – Detailed Explanation

Big Picture: What Are We Doing Here?

After establishing in Section 3.1 that it's easier for neural networks to learn residual functions (the difference between desired output and input) rather than learning functions from scratch, this section shows us how to actually implement this idea in practice.

The key question: How do we physically build a neural network that learns residual functions? The answer is shortcut connections—skip connections that bypass some layers and add their output back in. This section formalizes exactly how to do this and addresses practical implementation details.

The Core Building Block: Equation (1)

Let's start with the fundamental equation:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Breaking Down the Notation

Variables:

$\mathbf{x}$ = input vector to this block (a column vector of features)
$\mathbf{y}$ = output vector from this block
$\mathcal{F}(\mathbf{x}, \{W_i\})$ $F (x, {W_{i}})$ = the residual function we want to learn
- Takes input $\mathbf{x}$ and learned parameters $\{W_i\}$ (a set of weight matrices)
- Outputs a vector of the same dimension as $\mathbf{x}$
The " $+$ " operation = element-wise addition (adding two vectors component by component)

What's Actually Happening?

Think of this equation as a recipe:

Compute the residual mapping: Pass $\mathbf{x}$ through several nonlinear layers (represented by $\mathcal{F}$ ) to get some output
Add the shortcut: Add the original input $\mathbf{x}$ directly to that output
Final result: Equation (1) gives you the total output $\mathbf{y}$

Geometric intuition: If $\mathcal{F}(\mathbf{x}, \{W_i\})$ represents small adjustments (perturbations) to $\mathbf{x}$ , then $\mathbf{y}$ is just $\mathbf{x}$ plus those adjustments. This is much easier for optimization because:

If the identity function is best, the network just needs to drive $\mathcal{F}$ toward zero
If we need small modifications to identity, the network learns those small adjustments
Compare this to learning the full transformation from scratch with no reference point

Concrete Example: Two-Layer Residual Block

The paper gives a specific example. For a block with two layers:

\mathcal{F} = W_2\sigma(W_1\mathbf{x})

Let's unpack this:

First layer: $W_1\mathbf{x}$

$W_1$ is a weight matrix (dimensions: $d_{\text{hidden}} \times d_{\text{input}}$ )
Multiplying by $\mathbf{x}$ produces an intermediate hidden vector
This is a linear transformation

Nonlinearity: $\sigma(\cdot)$

$\sigma$ denotes the ReLU (Rectified Linear Unit) activation function: $\sigma(z) = \max(0, z)$
Applied element-wise to the output of the first layer
This introduces the crucial nonlinearity that lets networks learn complex patterns

Second layer: $W_2(\cdot)$

$W_2$ is another weight matrix (dimensions: $d_{\text{output}} \times d_{\text{hidden}}$ )
Transforms the nonlinear hidden representation back to the output dimension

Complete picture:

\mathbf{y} = W_2\sigma(W_1\mathbf{x}) + \mathbf{x}

Then the paper mentions applying another nonlinearity after the addition: $\sigma(\mathbf{y})$ . This gives the final output of the building block, which feeds into the next block.

Critical Constraint: Dimension Matching

Here's a practical issue that arises: What if $\mathbf{x}$ and $\mathcal{F}(\mathbf{x}, \{W_i\})$ don't have the same number of dimensions?

For the addition in Equation (1) to work, both vectors must be the same size. This happens when:

The network changes the number of channels (feature maps) in convolutional layers
The spatial dimensions change (e.g., downsampling)

Solution: Use a projection shortcut (Equation 2)

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x}

Where:

$W_s$ is a linear projection matrix (the subscript "s" stands for "shortcut")
It transforms $\mathbf{x}$ from dimension $d_{\text{input}}$ to dimension $d_{\text{output}}$
Matrix dimensions: $d_{\text{output}} \times d_{\text{input}}$

**Key design decision:** The authors use projection **only when necessary** for dimension matching. Here's why: - Projection adds parameters and slight computational cost - They experimentally verified that identity shortcuts (no projection) are sufficient when dimensions match - When dimensions don't match, they use the minimal solution: a linear projection - This keeps the architecture lean and the comparison fair between plain and residual networks --- ## Why No Extra Computational Cost? This is a crucial practical advantage: **Identity shortcut (Equation 1):** - Zero learnable parameters: the identity operation has no weights - Negligible computation: just element-wise addition, which is cheap compared to matrix multiplications - Can directly compare residual vs. plain networks with the same parameter count and computational budget **Projection shortcut (Equation 2):** - Only adds parameters when mathematically necessary (dimension mismatch) - Still much cheaper than adding more layers to the residual function itself This design allows **fair comparison**: when researchers compare a 56-layer residual network to a 56-layer plain network, they're comparing networks with nearly identical computational cost and parameter counts. The only difference is the architecture—how information flows. --- ## Flexibility of the Residual Function

The authors emphasize that $\mathcal{F}$ can take various forms:

**In this paper:** - Two-layer blocks (as shown in Figure 2) - Three-layer blocks (mentioned, shown in Figure 5 later) **Why not one layer?** \mathbf{y} = W_1\mathbf{x} + \mathbf{x} = (W_1 + I)\mathbf{x}

This reduces to a simple linear transformation (where $I$ is the identity matrix), providing no nonlinear benefit. You need at least two layers to get the advantage of learning a nonlinear residual.

In principle:

Could use four, five, or many more layers in $\mathcal{F}$
The "skip distance" (how many layers are skipped) can vary
This architectural flexibility is part of the framework's generality

Extension to Convolutional Layers

The mathematical notation might suggest fully-connected (dense) layers, but the framework applies equally to convolutional neural networks:

$\mathcal{F}(\mathbf{x}, \{W_i\})$ can represent multiple convolutional layers
Each $W_i$ becomes a set of convolutional filters instead of a dense weight matrix
The element-wise addition is performed channel by channel across feature maps
- If you have a spatial 3D tensor (height × width × channels), you add corresponding channels together

This generalization is crucial because the paper's main applications use convolutional architectures for image processing.

Summary: Why This Design?

Aspect	Benefit
Simple formulation	Easy to implement, understand, and analyze
Parameter efficiency	Identity shortcuts add zero parameters; projections only when needed
Computational efficiency	Element-wise addition is negligible cost
Fair comparison	Residual and plain networks can have identical parameter counts
Optimization advantage	Networks learn residual perturbations rather than functions from scratch
Extensibility	Works with different block sizes and architectural styles

The elegance of this section lies in showing that a small, mathematically simple modification—adding a single skip connection—provides substantial practical and theoretical benefits for training very deep networks.

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}.

Perfect! The final output is:

$ \mathbf{y} =

\begin{pmatrix} 0.76 \\ 0.72 \end{pmatrix}

\begin{pmatrix} 1 \\ 2 \end{pmatrix}

\begin{pmatrix} 1.76 \\ 2.72 \end{pmatrix}

Why This Design Is Revolutionary: The Gradient Flow Perspective

The skip connection is crucial for deep networks. Let me explain why through the lens of backpropagation. If we take the derivative of $\mathbf{y}$ with respect to $\mathbf{x}$ :

$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{F}(\mathbf{x}, \{W_i\})}{\partial \mathbf{x}} + \frac{\partial \mathbf{x}}{\partial \mathbf{x}} = \frac{\partial \mathcal{F}}{\partial \mathbf{x}} + \mathbf{I}$

where $\mathbf{I}$ is the identity matrix. This is crucial:

Without the skip connection: Gradients must flow through all layers of $\mathcal{F}$ . In deep networks, this can cause gradients to vanish or explode.
With the skip connection: The gradient gets an additive shortcut through the identity term $\mathbf{I}$ , which always contributes 1 to the gradient flow regardless of how deep $\mathcal{F}$ is.

This solves the vanishing gradient problem that plagued training of very deep networks.

Why $\mathcal{F}$ Must Have At Least Two Layers

The paper notes:

"if $\mathcal{F}$ has only a single layer, Eqn.(1) is similar to a linear layer: $\mathbf{y} = W_1\mathbf{x} + \mathbf{x}$ , for which we have not observed advantages."

Let's understand why. With a single layer:

$\mathbf{y} = W_1\mathbf{x} + \mathbf{x} = (W_1 + \mathbf{I})\mathbf{x}$

This is just a linear transformation with a modified weight matrix. It doesn't provide any advantage over a standard layer because:

There's no nonlinearity $\sigma$ in the residual path
The skip connection doesn't help with vanishing gradients when there's only one layer anyway

With two or more layers and nonlinearity:

$\mathbf{y} = W_2\sigma(W_1\mathbf{x}) + \mathbf{x}$

Now the residual path computes a nonlinear transformation, and the skip connection provides gradient shortcuts in deeper blocks.

The Case of Dimension Mismatch

The paper mentions Equation (2):

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x}$

This is needed when the dimensions of $\mathcal{F}(\mathbf{x}, \{W_i\})$ and $\mathbf{x}$ don't match — for example, when changing the number of channels in a convolutional layer. The learnable projection matrix $W_s$ reshapes $\mathbf{x}$ to match $\mathcal{F}$ 's output dimension, enabling element-wise addition.

However, the paper emphasizes that identity mapping is sufficient in most cases and is preferred because:

It adds no extra parameters or computation
It keeps the model more economical
Experiments show it effectively addresses the degradation problem

Conceptual Summary

Aspect	Impact
What is learned	The residual $\mathcal{F}(\mathbf{x}, \{W_i\})$ , not the full output
Skip connection	Adds the input $\mathbf{x}$ directly to the output
Gradient flow	Identity component ensures gradients always flow through skip path
Depth enabler	Makes training very deep networks possible (ResNet-152+)
Nonlinearity requirement	$\mathcal{F}$ must have ≥2 layers + activation to be beneficial
Parameter efficiency	Skip adds zero extra parameters

This seemingly simple equation — $\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$ — fundamentally changed deep learning by making it practical to train networks with 100+ layers. It's an elegant solution to a critical problem: how to backpropagate through very deep architectures without gradient degradation.

Visualizing the ReLU activation function used in the residual mapping

integral_(-2)^2 piecewise | 1 | x<0
sqrt(2) | x>0
indeterminate | (otherwise) dx = 2 (1 + sqrt(2))≈4.8284

Computing W_1 * x for the first layer

Applying ReLU activation to the result of first layer

First element of W_2 times activation

Second element of W_2 times activation

Computing the final output y = F + x using element-wise addition

point | coordinates (0.76 + 1, 0.72 + 2)

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x}.

The bottleneck design achieves parameter reduction by first reducing channels (1×1 conv), then doing computation in a smaller space (3×3 conv), then expanding back (1×1 conv). This is more efficient than direct convolutions on high-channel-count data.

Summary: The Key Insights

The residual equation $\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x}$ embodies several breakthrough ideas:

1. Residual Learning Principle

Instead of learning the full mapping $\mathbf{x} \to \mathbf{y}$ , the network learns the residual $\mathcal{F}(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ . This shifts the optimization landscape to make learning easier.

2. Gradient Preservation

The derivative $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{F}}{\partial \mathbf{x}} + 1$ ensures that:

Gradients always have a direct path back (the $+1$ term)
Very deep networks can still learn effectively
The vanishing gradient problem is significantly mitigated

3. Flexible Dimension Matching

The projection matrix $W_s$ handles cases where dimensions change, making skip connections applicable throughout the network.

4. Minimal Computational Overhead

The skip connection adds only element-wise addition (negligible cost) while providing massive architectural benefits.

5. Empirical Success

By enabling much deeper networks (ResNet-152, ResNet-1001) without degradation, residual connections demonstrated that depth, when properly architected, improves performance. This unlocked the entire era of very deep learning models we see today.

This single equation became the foundation for modern deep learning practice!

Computing W1 * x (first layer output)

Applying ReLU activation after first layer (keeping positive values)

Showing how the residual connection affects gradients during backpropagation

(f'(0) + 1) + x f''(0) + 1/2 f^(3)(0) x^2 + 1/6 f^(4)(0) x^3 + 1/24 f^(5)(0) x^4 + O(x^5)
(Taylor series)

Visualizing the ReLU activation function mentioned in the residual block

Example: projecting 3D input down to 2D via linear transformation (like Ws in the residual equation)

(1 | 0.2 | -0.1
0.3 | 0.8 | 0.5).{2, 3, 1}

3.3. Network Architectures

p.3

We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion,...

Section 3.3: Network Architectures — A Detailed Explanation

Big Picture

This section is crucial because it translates the theoretical ideas about residual learning (introduced in 3.1 and 3.2) into concrete network designs. The authors need to show:

How to build plain networks (traditional deep networks without shortcuts) for a fair comparison
How to add shortcut connections to create residual networks
How to handle dimension mismatches when adding shortcuts

This matters because the entire paper's contribution depends on comparing equally-resourced networks (same parameters, same computational cost) where the only difference is the presence of residual connections. If the residual network had more parameters, we wouldn't know if improvements came from residual learning or just from having a bigger model.

Part 1: Plain Network Design

The Philosophy

The authors base their plain network design on VGG nets [41], which were the state-of-the-art reference architecture at the time. Let me break down the design rules:

Design Rule (i): For layers producing the same output feature map size, use the same number of filters.

Design Rule (ii): When halving the feature map size, double the number of filters.

Why These Rules Make Sense

To understand rule (ii), think about what happens to your data:

Feature map size describes the spatial resolution: a $56 \times 56$ feature map has $56^2 = 3,136$ spatial locations
Number of filters represents the feature dimensionality at each location
Total computational cost per layer (roughly) equals: spatial size × spatial size × number of filters × filter size

$\text{Rough FLOPs per layer} \propto H \times W \times C \times K^2$

where:

$H, W$ = height and width of feature maps
$C$ = number of filters (channels)
$K$ = filter size (typically 3)

When you reduce spatial dimensions by half (stride-2 downsampling), you go from $H \times W$ to $\frac{H}{2} \times \frac{W}{2}$ , reducing the spatial cost by a factor of 4.

By doubling the number of filters, you multiply the channel dimension by 2, which only increases cost by 2×.

Result: Net effect is $\frac{1}{4} \times 2 = \frac{1}{2}$ — you keep computational cost roughly constant across layers at different resolutions.

The 34-Layer Plain Network

The authors build a 34-layer plain network with these rules. Key facts:

3×3 convolutional filters throughout (inspired by VGG's success)
Stride-2 downsampling to reduce spatial dimensions
Global average pooling at the end (compressing spatial dimensions to 1×1)
1000-way fully-connected softmax for ImageNet classification (1000 classes)

Computational comparison:

Their 34-layer plain net: 3.6 billion FLOPs
VGG-19: 19.6 billion FLOPs
Ratio: Only 18% of VGG's complexity!

This is important because it shows the plain baseline is actually efficient, not just a strawman architecture.

Part 2: Residual Network Design

Building from the Plain Network

The residual network takes the plain network and inserts shortcut connections as described in Section 3.2. The key innovation is the building block:

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x} \quad \text{(Equation 1, from Section 3.2)}$

where:

$\mathbf{x}$ = input to the residual block
$\mathcal{F}(\mathbf{x}, \{W_i\})$ = the residual function (stacked convolutional layers)
$\mathbf{y}$ = output after adding the shortcut

The Two Cases: Same vs. Different Dimensions

Case 1: Identity Shortcuts (Same Dimensions)

When the input $\mathbf{x}$ and residual function output $\mathcal{F}(\mathbf{x})$ have the same dimensions, you directly add them:

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$

In Figure 3 (right), these are shown as solid line shortcuts. This is the simplest case:

No extra parameters
No extra computation
Gradients flow directly through the skip connection during backpropagation

Case 2: Dimension Mismatch (Different Dimensions)

When spatial dimensions or channel counts change, direct addition isn't possible. Consider this scenario:

If $\mathbf{x}$ has shape $56 \times 56 \times 64$ (56×56 spatial, 64 channels) but $\mathcal{F}(\mathbf{x})$ has shape $28 \times 28 \times 128$ (half spatial resolution, double channels), you cannot perform element-wise addition.

The authors present two options:

Option A: Zero-Padding (Identity with Padding)

Keep the shortcut as identity mapping (no learned parameters)
Pad missing dimensions with zeros
When going from 56×56 to 28×28: use stride-2 in the shortcut
When going from 64 channels to 128: pad with 64 zeros to get 128 total dimensions
Advantage: Zero parameters, zero computation
Disadvantage: The padded zeros are never activated or learned

Option B: Projection Shortcut (with Learned Parameters)

Use the learned projection from Equation 2:

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\mathbf{x}$

where $W_s$ is typically a $1 \times 1$ convolution that:

Changes spatial dimensions via stride-2 (for downsampling)
Changes channel count via different output filters
Advantage: Learned transformation can adapt to the dimension change
Disadvantage: Adds parameters and computation (though $1 \times 1$ convolutions are cheap)

Implementation Detail: Stride in Shortcuts

"When the shortcuts go across feature maps of two sizes, they are performed with a stride of 2."

This is important: if your residual block halves the spatial resolution, the shortcut must also halve it:

Input: $56 \times 56$
Residual path: stride-2 convolutions → $28 \times 28$
Shortcut path: stride-2 convolution (for option B) or stride-2 operation (for option A) → $28 \times 28$

Now both can be added: $(28 \times 28 \times 128) + (28 \times 28 \times 128)$ .

Part 3: Why This Design Matters

Fair Comparison

The critical point: Both the plain and residual networks have:

Same number of layers (34 layers)
Same number of parameters (same FLOPs: 3.6 billion)
Same computational cost
Same architecture, except for shortcuts

This means any performance difference comes purely from the residual learning framework, not from having more capacity.

Flexibility Within the Framework

The authors emphasize that the residual function $\mathcal{F}$ can be flexible:

Can have 2 layers (most common in their experiments)
Can have 3 layers
Could theoretically have more

But a single-layer residual block doesn't help: $\mathbf{y} = W_1\mathbf{x} + \mathbf{x}$ is essentially just a linear layer with an identity shortcut—experiments show no advantage.

Extension to Convolutional Layers

The notation uses fully-connected layer notation for simplicity, but everything applies to convolutional layers:

$\mathcal{F}(\mathbf{x}, \{W_i\})$ can represent multiple 3D convolution kernels
Element-wise addition happens channel by channel on 3D feature maps
If each feature map is shape $H \times W \times C$ , addition produces output of the same shape

Summary Table: Plain vs. Residual

Aspect	Plain Network	Residual Network
Layers	34	34
Parameters	Same	Same
FLOPs	3.6B	3.6B (shortcuts add negligible computation)
Building block	$\mathbf{y} = \mathcal{F}(\mathbf{x})$	$\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$
Shortcut strategy	None	Identity (same dim) or projection (different dim)
Dimension handling	N/A	Zero-padding or learned $W_s$

Why This Section Is Crucial for the Paper

The authors can now run controlled experiments: plain vs. residual networks with identical computational resources. Any performance improvement cannot be attributed to:

❌ More parameters
❌ More computation
❌ Larger model capacity

It must come from how the information flows through the network—specifically, from having gradient highways (shortcut connections) that make optimization easier. This experimental setup is what makes ResNets' dramatic improvements so convincing.

3.4. Implementation

p.4

Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sa...

Understanding Section 3.4: Implementation

Big Picture: Why This Section Matters

After spending sections 3.1–3.3 explaining the conceptual innovation of residual learning and describing what the networks look like architecturally, this section tells us the practical details of how the authors actually trained these networks.

This is crucial because:

Reproducibility: Other researchers need to know exact hyperparameters and procedures to reproduce the results
Fair comparison: The paper claims residual networks are better than plain networks, so readers need to know that both were trained identically (except for the skip connections)
Best practices documentation: This became a reference implementation for the field

Think of it like a recipe: sections 3.1–3.3 describe the dish you're making, and section 3.4 gives you the exact ingredients, cooking temperature, and timing.

Breaking Down the Implementation Details

Data Preprocessing (Training)

Let me work through the image preparation step-by-step:

Step 1: Scale Augmentation

"The image is resized with its shorter side randomly sampled in [256, 480]"

What does this mean?

Each training image has two dimensions: height and width
The shorter side refers to whichever of these two is smaller (e.g., if an image is 400×600 pixels, the shorter side is 400)
During training, this shorter side is randomly resized to some value between 256 and 480 pixels (inclusive)
The aspect ratio is preserved, so if the shorter side is 256, the longer side is scaled proportionally

Why do this? This creates scale variation in the training data. Networks trained on multiple scales generalize better to images of different sizes.

Step 2: Cropping and Flipping

"A 224×224 crop is randomly sampled from an image or its horizontal flip"

After resizing:

A random 224×224 pixel square is extracted from anywhere in the resized image (not always centered)
The network then sees this 224×224 crop as input
With 50% probability, this crop is also horizontally flipped

Why? Random cropping and flipping create data augmentation—the network sees different versions of the same image, which reduces overfitting.

Step 3: Normalization

"with the per-pixel mean subtracted"

For each pixel position across all training images, the authors:

Calculate the mean intensity value (averaged across all training images)
Subtract this mean from every pixel in every image

Why? This centers the data around zero, which helps the optimization algorithm (SGD) work more efficiently. This is a standard preprocessing technique in machine learning.

Step 4: Color Augmentation

"The standard color augmentation in [21] is used"

This refers to random adjustments to color channels (brightness, contrast, etc.). The authors don't detail it here but reference their source.

Network Architecture Details

"We adopt batch normalization (BN) [16] right after each convolution and before activation"

What is Batch Normalization? Within each training mini-batch:

The outputs of each convolutional layer are normalized to have mean 0 and variance 1
Then the network learns to scale and shift these normalized values

Mathematically, for a mini-batch, if $z_i$ is the output of a convolution for the $i$ -th sample:

\hat{z}_i = \frac{z_i - \mathbb{E}[z]}{\sqrt{\text{Var}(z) + \epsilon}}

where $\mathbb{E}[z]$ is the mean and $\text{Var}(z)$ is the variance across the mini-batch, and $\epsilon$ is a small constant for numerical stability.

Why this placement? By normalizing before the ReLU activation (rather than after), the internal distributions stay stable during training, which speeds up convergence and allows higher learning rates.

"We initialize the weights as in [13]"

This refers to He initialization, a specific method for setting initial weight values that accounts for the number of input neurons, helping prevent vanishing/exploding gradients at the start of training.

Training Hyperparameters

Let me explain the optimization setup:

Learning Procedure:

Algorithm: SGD (Stochastic Gradient Descent)
Mini-batch size: 256 (the network processes 256 images at a time before updating weights)
Iterations: Up to $60 \times 10^4 = 600,000$ total updates

Learning Rate Schedule:

"The learning rate starts from 0.1 and is divided by 10 when the error plateaus"

The learning rate $\alpha$ controls step size in SGD. Mathematically, a weight update looks like:

W_{t+1} = W_t - \alpha \nabla \mathcal{L}(W_t)

where $W_t$ are weights at iteration $t$ , $\nabla \mathcal{L}(W_t)$ is the gradient of the loss, and $\alpha$ is the learning rate.

The schedule works as:

Start with $\alpha = 0.1$ (relatively large steps)
When validation error stops improving (plateaus), reduce $\alpha$ to $0.01$
If it plateaus again, reduce to $0.001$ , etc.

Why reduce learning rate? Early in training, large steps help escape local minima. Later, smaller steps allow fine-tuning near the optimum.

Regularization:

Weight decay: 0.0001 (adds a penalty $\lambda \sum_i W_i^2$ to the loss, preventing weights from becoming too large)
Momentum: 0.9 (accumulates gradients over iterations for smoother updates)
No dropout: The authors don't use dropout (a technique that randomly zeroes activations during training) because batch normalization provides similar regularization benefits

The momentum update rule is:

v_{t+1} = \beta v_t + \nabla \mathcal{L}(W_t)

W_{t+1} = W_t - \alpha v_{t+1}

where $v_t$ is the velocity/momentum term and $\beta = 0.9$ in this case. Intuitively, this makes the optimizer accelerate in consistent directions and dampen oscillations.

Testing Procedure

"For comparison studies we adopt the standard 10-crop testing"

During testing (evaluation):

Instead of a single random crop, the authors extract 10 fixed crops from each image (four corners + center of the image, and four corners + center of the horizontally flipped version)
The network predicts on all 10 crops
Predictions are averaged to give a final answer

Why? This reduces variance in predictions and is more robust than a single evaluation.

"For best results, we adopt the fully-convolutional form as in [41, 13], and average the scores at multiple scales"

For their best reported numbers:

The fully-connected layer at the end is replaced with global average pooling + a convolutional layer, making the network "fully convolutional"
This allows the network to accept images of different sizes
They evaluate at multiple resolutions: shorter side in $\{224, 256, 384, 480, 640\}$ pixels
All predictions are averaged

Summary: Why These Choices Matter

Component	Choice	Purpose
Scale augmentation	Shorter side in [256, 480]	Learn features at multiple scales
Cropping	Random 224×224 crops	Prevent overfitting to specific image regions
Batch normalization	Applied before activation	Stabilize training, enable higher learning rates
Learning rate	0.1, divided by 10 on plateau	Start with exploration, finish with refinement
Momentum	0.9	Smooth optimization trajectory
Weight decay	0.0001	Prevent overfitting through regularization
Test augmentation	10-crop averaging	Reduce prediction variance

The key insight: these are standard practices in deep learning, applied consistently to both plain and residual networks. This fairness is essential for validating the authors' claim that ResNets are fundamentally better architectures, not just better-trained networks.

4.1. ImageNet Classification

p.4

We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are tr...

Deep Residual Learning: ImageNet Classification Results (Section 4.1)

The Big Picture

This section is the empirical heart of the ResNet paper. The authors are demonstrating that their residual learning framework actually solves the degradation problem—the phenomenon where deeper networks perform worse than shallower ones, even on training data. This is crucial because:

Motivation: If deeper networks always performed better (or at least as well), we wouldn't need ResNets. But they don't—plain deep networks get stuck during training.
Solution validation: ResNets fix this by adding shortcuts that allow information to flow directly through layers, making optimization easier.
Impact: The results are spectacular—they achieve state-of-the-art performance on ImageNet, the most prestigious benchmark at the time.

Let me walk you through the key experiments and findings.

Part 1: The Degradation Problem in Plain Networks

What They're Testing

The authors train two plain convolutional networks:

18-layer plain net: baseline architecture
34-layer plain net: same design philosophy, just deeper

Key expectation: The 34-layer network should perform at least as well as the 18-layer network, since it has more representational capacity. Mathematically, the solution space of the 18-layer network is a subset of the 34-layer network's solution space—anything the 18-layer network can learn, the 34-layer one could theoretically learn too.

The Shocking Result

Looking at Table 2 and Figure 4 (left), the 34-layer plain net performs worse than the 18-layer net. Even more concerning: the 34-layer net has higher training error throughout training, not just validation error. This is the degradation problem in action.

Why Not Vanishing Gradients?

The authors make an important argument here. In very deep networks trained without batch normalization, gradients can "vanish"—become exponentially small—as they propagate backward through many layers. This would prevent learning because the weight updates would be negligible:

$\frac{\partial \mathcal{L}}{\partial W_1} \approx 0 \quad \text{(extremely small)}$

where $\mathcal{L}$ is the loss function and $W_1$ is a weight in an early layer.

But here's the key: These networks use batch normalization (BN), introduced in section 3.4. Batch normalization ensures that the activations (outputs of each layer) maintain reasonable statistics throughout training, which prevents gradient vanishing. The authors verify this by checking that gradients have "healthy norms"—they don't shrink to zero.

The Real Problem: Slow Convergence

Instead, the authors conjecture that the optimization problem itself is fundamentally harder. The 34-layer network might have an exponentially low convergence rate, meaning:

$\text{Training Error}(t) = \text{Error}_0 \cdot e^{-ct}$

where $t$ is the number of iterations and $c$ is the convergence rate. For very deep nets, $c$ might be so small that even after training for millions of iterations, the error barely decreases. This is subtly different from vanishing gradients—the gradients aren't zero, but the optimization landscape is shaped in a way that makes it extremely slow to navigate.

Part 2: Residual Networks Fix the Problem

The Architecture Change

Now the authors add shortcut connections to the same 18-layer and 34-layer networks, creating ResNets. Recall from equation (1) in section 3.2:

$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$

This simple addition fundamentally changes optimization. Instead of learning $\mathcal{F}(\mathbf{x})$ directly, layers learn the residual $\mathcal{F}(\mathbf{x})$ , which represents the difference from what would happen if the layer did nothing.

The Results: Three Major Observations

Observation 1: Deeper is Better Again

With ResNets, the 34-layer network is 2.8% better than the 18-layer version (Table 2). The degradation problem is gone! Looking at Figure 4 (right), both training and validation error decrease smoothly with depth—the relationship is now monotonic in the right direction.

Observation 2: Massive Improvement Over Plain Nets

The 34-layer ResNet reduces top-1 error by 3.5% compared to its plain counterpart (Table 2). This is a huge improvement in validation accuracy. Importantly, Figure 4 shows that the training error is "considerably lower"—the ResNet actually learns the training data better, which then translates to better generalization.

Observation 3: Shallow Networks Benefit Too

Even the 18-layer ResNet (where degradation isn't a problem) converges faster than its plain equivalent. This tells us that residual learning helps even when we're not fighting the degradation problem—it just makes optimization easier in general.

Part 3: Does the Type of Shortcut Connection Matter?

Three Options for Dimension Matching

In section 3.3, the authors mentioned that shortcuts need special handling when input and output have different dimensions. Let me explain the three options tested:

Option A: Identity shortcuts with zero-padding

When dimensions increase (e.g., feature map size halves but channel count doubles), pad the shortcut with zeros
No extra parameters
Mathematical form: $\mathbf{y} = \mathcal{F}(\mathbf{x}) + [\mathbf{x} \; \mathbf{0}]$

Option B: Projection shortcuts

Use a learned $W_s$ (from equation 2) to project the shortcut to match dimensions
Adds parameters, but only for dimension-increase layers
Mathematical form: $\mathbf{y} = \mathcal{F}(\mathbf{x}) + W_s\mathbf{x}$ where $W_s$ is typically a $1 \times 1$ convolution

Option C: All-projection

Use learned projections for every shortcut, even when dimensions match
Most parameters, but theoretically most flexible

The Finding: Option A Works Fine

Table 3 shows something surprising: all three options significantly outperform the plain baseline. The differences between A, B, and C are small—B is slightly better than A, and C is marginally better than B.

The key insight is: learning the shortcut connection isn't essential. The identity mapping (or zero-padded identity) is sufficient to fix the degradation problem. The authors hypothesize that option C performs marginally better only because it adds more capacity (parameters), not because projection shortcuts are fundamentally necessary.

This is important for practical reasons: identity shortcuts require no computation or memory overhead, making the ResNets more efficient.

Part 4: Going Much Deeper with Bottleneck Blocks

Why We Need a New Design

Training a 34-layer ResNet takes significant computational time. To go even deeper without excessive training time, the authors modify the basic building block.

The Bottleneck Architecture

Instead of the 2-layer block ( $\text{conv3×3, conv3×3}$ ), they use a 3-layer bottleneck block:

$\text{[conv1×1, conv3×3, conv1×1]}$

Why this design? Here's the intuition:

The first $1 \times 1$ convolution reduces the number of channels (dimension reduction)
The middle $3 \times 3$ convolution operates on fewer channels (the "bottleneck")
The final $1 \times 1$ convolution increases the number of channels back (dimension restoration)

Looking at Figure 5, a concrete example: suppose input has 256 channels. The bottleneck:

Reduces to 64 channels (via first $1 \times 1$ )
Processes with $3 \times 3$ on 64 channels (much cheaper than 256!)
Expands back to 256 channels (via final $1 \times 1$ )

Computational benefit: Both designs have similar FLOPs (floating-point operations), but the bottleneck design focuses computation on the $3 \times 3$ layer where it matters most.

Why Identity Shortcuts Are Critical for Bottleneck Blocks

Here's a mathematical argument. Consider a bottleneck block with:

Input: $c$ channels, spatial dimensions $h \times w$
Bottleneck: $c'$ channels (where $c' \ll c$ )

If we used projection shortcuts instead of identity shortcuts, we'd need a $1 \times 1$ projection at:

The input (to go from $c$ to $c'$ channels): $(h \times w) \times c \times c'$ parameters
The output (to go from $c'$ back to $c$ channels): $(h \times w) \times c \times c'$ parameters

For identity shortcuts, we need nothing—zero parameters.

The asymmetry is crucial: "the shortcut is connected to the two high-dimensional ends." This means if you project, you pay a parameter cost at both ends, effectively doubling the model size and computation compared to identity shortcuts.

This is why parameter-free identity shortcuts are particularly important for bottleneck architectures—they keep the models efficient while still getting the optimization benefits of residual learning.

Part 5: Going to Extreme Depths (50, 101, 152 Layers)

The Networks Tested

The authors build three very deep networks using the bottleneck design:

ResNet-50: Replace each 2-layer block in the 34-layer net with a 3-layer bottleneck → 50 total layers
- 3.8 billion FLOPs
ResNet-101: More bottleneck blocks added → 101 layers
- Complexity increases but stays manageable
ResNet-152: Even more bottleneck blocks → 152 layers!
- 11.3 billion FLOPs

The Stunning Result

Even the 152-layer ResNet has lower complexity than VGG-19, the previous state-of-the-art deep network:

$\text{ResNet-152: } 11.3 \text{ billion FLOPs} < \text{VGG-19: } 19.6 \text{ billion FLOPs}$

Yet it performs much better! And crucially: no degradation problem. Deeper networks continue to improve.

Looking at Tables 3 and 4, the 50/101/152-layer ResNets show "considerable margins" of improvement over the 34-layer versions. Every deeper network improves on every metric.

Part 6: Comparison with State-of-the-Art

Single Model Results

Table 4 compares ResNets with previous best methods. The 152-layer ResNet achieves:

Single-model top-5 validation error: 4.49%

This beats all previous ensemble results—methods that combined multiple models—using just a single model!

Ensemble Results

The authors combine six models (including two 152-layer ResNets) to form an ensemble. This achieves:

3.57% top-5 error on test set

This wins 1st place in ILSVRC 2015, one of the most prestigious computer vision competitions at the time.

Key Takeaways

Concept	Insight
The Problem	Plain networks suffer from degradation: deeper nets have higher training error, suggesting optimization difficulty (not capacity or gradient vanishing)
The Solution	Residual shortcuts enable much deeper networks by learning residuals rather than absolute functions
Shortcut Type	Identity shortcuts (free!) work as well as learned projections, especially important for bottleneck designs
Scalability	Bottleneck design allows extreme depths (152 layers) with less computation than shallower VGG networks
Empirical Result	ResNets achieve state-of-the-art results, winning ILSVRC 2015 with depths 8× greater than previous methods

The fundamental insight: making the optimization problem easier matters more than making the network deeper. Shortcuts make optimization easier, which allows us to use the additional representational capacity of deeper networks effectively.

4.2. CIFAR-10 and Analysis

p.7

We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in ...

Section 4.2: CIFAR-10 and Analysis — Detailed Explanation

Big Picture: Why This Section Matters

In the previous ImageNet section, the authors demonstrated that residual networks (ResNets) solve the "degradation problem" (where deeper networks train worse than shallower ones) on a large-scale dataset. However, a critical question remains: Is this a general phenomenon, or specific to ImageNet?

Section 4.2 answers this by testing on CIFAR-10, a smaller, simpler dataset. More importantly, it goes beyond just reporting numbers—it provides mechanistic analysis of why ResNets work. The authors investigate:

Whether the degradation problem appears on different datasets
How deep networks can actually get (up to 1202 layers!)
What the residual functions are actually learning by analyzing layer response magnitudes
The limits of ultra-deep networks and the overfitting problem

This section is crucial because it suggests the benefits of residual learning are fundamental and generalizable, not just lucky artifacts of ImageNet.

Part 1: CIFAR-10 Experimental Setup

Dataset and Architecture Overview

CIFAR-10 is a much smaller dataset than ImageNet:

Training images: 50,000 (vs. 1.28 million for ImageNet)
Test images: 10,000 (vs. 50,000 for ImageNet)
Classes: 10 (vs. 1,000 for ImageNet)
Image size: 32×32 pixels (vs. 224×224 for ImageNet)

The authors deliberately use simple architectures on this dataset because they want to study optimization behavior rather than achieve state-of-the-art performance. This is a smart choice: simpler architectures make it easier to isolate the effect of depth from other architectural choices.

Architecture Details: The Parameter $n$

The network architecture is parameterized by a single variable $n$ , which controls depth:

Architecture formula:

First layer: a single 3×3 convolution
Then: $6n$ layers of 3×3 convolutions organized in three stages
Final layers: global average pooling → 10-way fully-connected layer → softmax

Feature map sizes and filters:

Stage 1: $2n$ layers operating on 32×32 feature maps with 16 filters
Stage 2: $2n$ layers operating on 16×16 feature maps with 32 filters
Stage 3: $2n$ layers operating on 8×8 feature maps with 64 filters

Total depth: $6n + 2$ weighted layers

So for $n = 3$ : we get $6(3) + 2 = 20$ layers; for $n = 56$ : we get $6(56) + 2 = 338$ layers.

The authors test $n \in \{3, 5, 7, 9\}$ , yielding networks of 20, 32, 44, and 56 layers—then push much further with $n = 18$ (110 layers) and $n = 200$ (1202 layers).

Why This Design is Smart

Parameterization benefit: By varying only $n$ , the authors ensure:

Network width (number of filters per layer) stays constant across different depths
Computational complexity per layer stays similar
The only variable is depth—exactly what they want to study

This is methodologically superior to just arbitrarily building different-sized networks, because it isolates the effect of depth.

Shortcut Connection Placement

For ResNets on CIFAR-10:

Shortcuts connect pairs of 3×3 convolutional layers (not individual layers)
Total number of shortcuts: $3n$
Type: Identity shortcuts only (option A from earlier discussion)
- This means no extra parameters compared to plain networks

The identity shortcut means: if input $x$ goes through two 3×3 convolutions (the residual function $\mathcal{F}(x)$ ), the output is:

y = \mathcal{F}(x) + x

where both $x$ and $\mathcal{F}(x)$ have identical dimensions (since CIFAR-10 uses small, uniform operations).

Part 2: Training Procedure and Hyperparameters

The training setup is critical for understanding the results:

Regularization:

Weight decay: $\lambda = 0.0001$ (the regularization parameter that penalizes large weights)
No dropout used (they're relying on architecture, not stochastic regularization)
Batch normalization (BN) applied after each convolution

Optimization:

Optimizer: Stochastic Gradient Descent (SGD)
Mini-batch size: 128 (distributed across 2 GPUs)
Initial learning rate: $\alpha_0 = 0.1$
Learning rate schedule:
- Divide by 10 at 32k iterations: $\alpha = 0.01$
- Divide by 10 at 48k iterations: $\alpha = 0.001$
- Stop at 64k iterations

Data augmentation (for training only):

Pad each 32×32 image with 4 pixels on all sides → 40×40 image
Randomly crop back to 32×32
Randomly flip horizontally
Testing: Use original 32×32 images with no augmentation

Special case for 110-layer network:

Initial learning rate of 0.1 is too large (network doesn't converge initially)
Solution: "Warm-up" with $\alpha = 0.01$ for ~400 iterations (until training error < 80%)
Then switch back to $\alpha = 0.1$ and continue with standard schedule

This warm-up strategy is pragmatic: very deep networks are sensitive to initialization, so starting with a smaller learning rate helps avoid getting stuck in poor local minima early on.

Part 3: The Degradation Problem on CIFAR-10 — Empirical Evidence

Plain Networks: The Problem Recurs

Figure 6 (left) shows training and testing curves for plain networks of varying depths (20, 32, 44, 56 layers).

Key observation: As networks get deeper, both training and testing error increase—exactly like on ImageNet:

20-layer plain net: low training error
56-layer plain net: noticeably higher training error

Why this matters: This is the degradation problem in action. Notice it's not just overfitting (testing error being worse than training error)—the problem is that training error itself gets worse. This rules out the simple explanation "we're overfitting because the model is too big."

The paper notes that plain-110 has training error > 60%, which is so bad it's not even plotted.

ResNets: The Solution

Figure 6 (middle) shows the same networks converted to ResNets (by adding identity shortcuts).

The transformation is striking:

20-layer ResNet vs. 56-layer ResNet: The 56-layer version trains faster and achieves lower error
Training error decreases with depth, not increases
Testing error improves with increased depth (with diminishing returns)

The 110-layer ResNet:

Despite being 110 layers deep, it converges well with the warm-up learning rate strategy
Achieves 6.43% test error with fewer parameters than competing methods like FitNet and Highway networks

This directly parallels the ImageNet findings: residual connections enable optimization of very deep networks.

Part 4: Analysis of Layer Responses — Understanding What ResNets Learn

This is where the section becomes particularly insightful. Rather than just showing that ResNets work, the authors investigate how they work by analyzing the magnitudes of layer responses.

What Are "Layer Responses"?

For each layer in the network:

Compute the output of a 3×3 convolutional layer
Apply batch normalization (BN)
Before applying the activation function (ReLU), measure the output

Call this post-BN, pre-activation output the "response" of that layer.

Why measure this? The response magnitude tells us how much each layer is contributing to the computation. A small response means the layer is doing minimal modification to the input signal; a large response means substantial transformation.

The Key Finding: ResNet Responses Are Smaller

Figure 7 plots the standard deviation (std) of layer responses:

\sigma_i = \sqrt{\mathbb{E}[(h_i - \mu_i)^2]}

where $h_i$ is the response of layer $i$ and $\mu_i$ is its mean. The standard deviation measures the typical magnitude of activations in that layer.

Main empirical observation:

Plain networks: Layer responses have larger magnitudes
ResNets: Layer responses have substantially smaller magnitudes

Why this matters: Recall from Section 3.1 (earlier in the paper, not shown here, but referenced), the core insight of residual learning is:

Instead of learning: $y = \mathcal{F}(x)$ (some arbitrary transformation)

Learn: $y = x + \mathcal{F}(x)$ (an identity plus a small modification)

The layer response analysis empirically validates this motivation. In ResNets, the residual function $\mathcal{F}(x)$ tends to be small, meaning:

\|\mathcal{F}(x)\| \ll \|x\|

(The residual correction is much smaller in magnitude than the signal itself)

Deeper Networks Have Even Smaller Responses

The paper observes an interesting trend:

ResNet-20: moderate response magnitudes
ResNet-56: smaller magnitudes
ResNet-110: even smaller magnitudes

Interpretation: As networks get deeper, each individual layer modifies the signal less. This makes intuitive sense: with many layers available, no single layer needs to make a drastic change. The accumulated effect of many small modifications compounds to useful feature learning.

This is reminiscent of biological neural systems, where individual neurons make small contributions that combine into complex computations.

Part 5: Pushing to Extremes — 1202-Layer Networks

Experimental Setup

The authors set $n = 200$ , creating a 1202-layer network:

Total parameters: 19.4 million
Architecture otherwise identical to the smaller networks

What They Found

Training behavior:

The network trains successfully with no optimization difficulties
Training error drops below 0.1%—essentially memorizing the training set
The optimization path is smooth (Figure 6, right)

Testing behavior:

Test error: 7.93% (reasonable, but worse than the 110-layer net's 6.43%)

The Overfitting Problem: A New Challenge

This is a fascinating turn. For the first time, we see residual learning solve one problem (optimization) but reveal another (overfitting):

The degradation problem (solved by ResNets):

Symptom: training error increases with depth
Cause: optimization difficulty
Solution: residual connections ease optimization

The overfitting problem (revealed at 1202 layers):

Symptom: test error increases despite (or because of) improved training error
Cause: model becomes unnecessarily large relative to the dataset size
The model has 19.4M parameters for a 50k-image dataset
No dropout or strong regularization being used

Why this matters: The 1202-layer network has enough capacity to memorize training examples, but CIFAR-10 is too small to support this memorization transfer to test data. The network learns training-specific noise rather than generalizable features.

The Authors' Honest Assessment

Rather than trying to patch this with ad-hoc tricks, they explicitly state:

The overfitting is likely the problem, not optimization
Stronger regularization (dropout, maxout) could help
But applying those would "distract from the focus on optimization difficulties"
This is acknowledged as future work

This intellectual honesty is valuable: they're demonstrating that their core contribution (residual learning for optimization) doesn't automatically solve all problems. Generalization (preventing overfitting) is a separate challenge requiring additional techniques.

Part 6: Synthesis — What This Section Establishes

Evidence for Generalizability

By moving from ImageNet to CIFAR-10, the authors show:

The degradation problem is fundamental, not dataset-specific
- Plain networks fail similarly on both large and small datasets
- It's a core optimization challenge of deep learning, not an artifact of ImageNet
Residual learning is a universal solution (for optimization)
- Works across different dataset sizes, image dimensions, and network widths
- The mechanism appears to be the same everywhere
Layer response analysis provides mechanistic insight
- Validates the theoretical motivation
- Explains why residual learning works (enables learning small modifications)

The Practical Implications

Residual networks enable training of unprecedentedly deep models (1202 layers tested here)
The core optimization problem is solved—networks train successfully without degradation
But deep learning practitioners must still manage overfitting through other means (regularization, architecture design, dataset size)

Remaining Questions (Acknowledged)

How do we prevent overfitting in extremely deep networks on small datasets?
Is there a fundamental limit to network depth on a given dataset size?
Can we design networks that are both optimizable and well-regularized?

These questions point toward open research directions, which is honest science.

Summary Table: Plain vs. ResNet Performance

Depth	Plain Net	ResNet	Improvement
20 layers	~8% test error	~7% test error	Better
56 layers	~10% test error	~6.5% test error	Much better
110 layers	>60% train error (fails)	6.43% test error	Succeeds where plain fails
1202 layers	N/A	7.93% test error	Works but overfits

Key Takeaways

Residual connections solve the optimization problem that prevents training of very deep networks
The mechanism: Enabling each layer to learn small residual modifications rather than full transformations
Empirical validation: Layer response magnitudes confirm residual functions remain small
Limitations: Optimization is one problem; generalization is another. The section honestly addresses where residual learning ends and where other techniques are needed
Generalizability: The phenomenon appears across different datasets and scales, suggesting it's a fundamental property of deep neural network training

4.3. Object Detection on PASCAL and MS COCO

p.8

Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection basel...

Section 4.3: Object Detection on PASCAL and MS COCO

Big Picture: Why This Section Matters

Up to this point in the paper, the authors have demonstrated that ResNets work remarkably well for image classification—specifically on ImageNet. But a natural question arises: Does this improvement only apply to classification, or do the better representations learned by ResNets transfer to other vision tasks?

This section answers that question by showing that ResNets provide significant improvements on object detection—a fundamentally different problem from classification. Instead of just asking "what is in this image?", object detection asks "what objects are in this image, and where are they?" This is much harder computationally and requires richer learned representations.

The key insight: the 28% relative improvement on the COCO dataset (mentioned in the abstract) comes solely from using better feature representations. Everything else about the detection pipeline stays the same.

The Experimental Setup

What is Faster R-CNN?

The authors use Faster R-CNN as their detection framework. You don't need to understand all details, but the key idea is:

Region Proposal Network (RPN): Generates candidate regions that might contain objects
Feature Extraction: Extracts learned features from these regions using a deep network
Classification & Bounding Box Regression: Classifies what object is in each region and refines the box coordinates

The crucial point: Steps 2 and 3 depend entirely on the quality of the learned representations from the deep network used in step 2.

The Comparison: VGG-16 vs ResNet-101

The authors make a controlled comparison:

VGG-16: The previous state-of-the-art backbone network (41 layers in the original ResNet paper context)
ResNet-101: The new residual network with 101 layers

Critical detail: Everything else is kept identical. The detection pipeline, hyperparameters, training procedure—all the same. This means any performance difference comes purely from the network architecture and the representations it learns.

The Results and What They Mean

Datasets and Metrics

The authors evaluate on three benchmark datasets:

PASCAL VOC 2007 and 2012: Classical object detection benchmarks with ~20 object classes
MS COCO: A much more challenging modern dataset with ~80 object categories and images with multiple objects

The Key Metric: mAP

The standard metric for object detection is mean Average Precision (mAP). While we won't derive it in detail, here's the intuition:

Average Precision (AP) for a single object class measures: "How well does the detector find objects of this class at different confidence thresholds?"
mAP is the average of AP across all object classes

The notation $\text{mAP}@[.5, .95]$ means:

The detection is considered "correct" if the Intersection over Union (IoU) between predicted and ground-truth bounding boxes exceeds certain thresholds
We compute mAP at IoU thresholds of 0.5, 0.55, 0.6, ..., 0.95 and average them
This is more rigorous than older metrics that only used IoU = 0.5

The Numbers: Understanding the Improvement

The paper states:

"we obtain a 6.0% increase in COCO's standard metric (mAP@[.5, .95]), which is a 28% relative improvement"

Let's parse this mathematically. If the baseline (VGG-16) achieves mAP = $m_{\text{VGG}}$ and ResNet-101 achieves mAP = $m_{\text{ResNet}}$ :

\text{Absolute improvement} = m_{\text{ResNet}} - m_{\text{VGG}} = 6.0%

\text{Relative improvement} = \frac{m_{\text{ResNet}} - m_{\text{VGG}}}{m_{\text{VGG}}} = 28%

From these two equations, we can solve for the baseline:

\frac{6.0\%}{m_{\text{VGG}}} = 0.28 \implies m_{\text{VGG}} \approx 21.4\%

So the VGG-16 baseline achieved roughly 21.4% mAP, and ResNet-101 improved it to roughly 27.4% mAP on COCO.

Why is the relative improvement so large (28%) while the absolute improvement is modest (6%)? Because object detection is harder than classification—the baseline accuracies are lower. A 6-point improvement on a 21% baseline is proportionally much larger than a similar 6-point improvement would be on, say, a 90% classification accuracy.

Why This Result Matters: Generalization

What the authors are demonstrating:

The learned representations from ResNets are universally better, not just for classification. Here's why this is significant:

Classification (ImageNet task): Summarize an entire image into one label
Object Detection (PASCAL/COCO tasks): Locate and classify multiple objects

These are fundamentally different tasks with different computational requirements. Yet the same backbone network (the feature extractor) helps both.

This suggests that ResNets learn more robust, general-purpose visual features that capture object structure, textures, spatial relationships, etc. at multiple scales—the same features useful for detecting objects as for classifying them.

The Transfer Learning Principle

This demonstration of transfer learning is crucial for the field:

A network trained on ImageNet classification can be used as a feature extractor for other tasks
Better ImageNet networks lead to better feature extractors for downstream tasks
The gap between VGG (2014) and ResNet (2015) was large enough to benefit even specialized architectures like Faster R-CNN

Connecting to Earlier Sections

Recall from Section 4.1 and 4.2:

ResNet-101 was built using the bottleneck architecture (1×1-3×3-1×1 convolutions) described in Section 4.1
The identity shortcuts (option A/B from the earlier ablations) enabled these very deep networks to train effectively
The learned representations have small residual magnitudes (as shown in Fig. 7), meaning each layer modifies the signal conservatively

These properties—learning through small, controlled modifications—apparently produce features that are particularly valuable for detecting where objects are and what they are.

Summary of Key Points

Aspect	Details
Main Question	Do better ImageNet networks help with object detection?
Experiment	Same Faster R-CNN pipeline with different backbones (VGG-16 vs ResNet-101)
Result	6.0% absolute improvement (28% relative) on COCO mAP@[.5, .95]
Interpretation	ResNet representations are more useful for downstream vision tasks
Implication	Better backbone networks → better performance on diverse tasks

This section provides empirical validation that ResNets don't just win ImageNet—they provide more useful features for the entire computer vision ecosystem.

A. Object Detection Baselines

p.10

In this section we introduce our detection method based on the baseline Faster R-CNN [32] system. The models are initial...

Object Detection Baselines - Detailed Explanation

Big Picture: Why This Matters

The core innovation of the ResNet paper is showing that residual networks can be trained much deeper than traditional CNNs. However, the authors need to demonstrate that this benefit transfers beyond image classification to other important computer vision tasks. This section shows how ResNets can be adapted for object detection—a much harder problem than classification.

Object detection requires not just identifying what is in an image, but also where it is. The key insight here is that better feature representations (learned by deeper ResNets) can dramatically improve detection performance.

Section Overview: Architecture Adaptation

The Core Challenge

VGG-16 (the previous state-of-the-art backbone network) has fully connected (fc) layers at the end, which process global image information. ResNet-101 has no such fc layers—it's purely convolutional. This creates an architectural mismatch that the authors must address for object detection.

The Solution: Networks on Conv Feature Maps (NoC)

Key Idea: Instead of converting conv features to fully connected layers, keep everything convolutional and let subsequent detection networks adapt to this structure.

Implementation Details:

The authors use layers with stride ≤ 16 pixels on the original image to generate shared feature maps:

conv1: First convolutional layer
conv2_x, conv3_x, conv4_x: Residual blocks (91 total convolutional layers in ResNet-101)
These produce feature maps with a total stride of 16 pixels (meaning each pixel in the feature map corresponds to a $16 \times 16$ region in the original image)

This matches VGG-16's approach of using 13 conv layers to produce 16-pixel stride feature maps, allowing a fair comparison.

What happens after this?

For each proposed object region:

RoI Pooling is performed on the shared feature maps (this extracts a fixed-size feature for each region proposal)
conv5_x layers (and any subsequent layers) process these region features
These play the role that VGG-16's fc layers played—refining region-specific features

The final output has two parallel branches instead of one:

Classification branch: Determines object category
Box regression branch: Refines the bounding box coordinates

Batch Normalization (BN) Handling During Fine-tuning

The Technical Issue

When you train on ImageNet (a large, diverse dataset) and then fine-tune on detection data (typically smaller datasets like COCO), the statistics that Batch Normalization learned might not be appropriate anymore.

Batch Normalization recap: For each layer $i$ , BN computes:

\hat{x}^{(i)} = \frac{x^{(i)} - \mu_{\text{batch}}^{(i)}}{\sqrt{\sigma_{\text{batch}}^{(i)^2} + \epsilon}}

where:

$x^{(i)}$ = the activation value at layer $i$
$\mu_{\text{batch}}^{(i)}$ = mean of activations in the current mini-batch
$\sigma_{\text{batch}}^{(i)^2}$ = variance of activations in the current mini-batch
$\epsilon$ = small constant for numerical stability

Then it scales and shifts:

y^{(i)} = \gamma^{(i)} \hat{x}^{(i)} + \beta^{(i)}

where $\gamma^{(i)}$ and $\beta^{(i)}$ are learned parameters.

The Authors' Solution: Freeze BN Statistics

Rather than continue updating BN statistics during fine-tuning, they:

Compute final BN statistics on the entire ImageNet training set after pre-training
Fix these statistics during detection fine-tuning (don't update $\mu$ and $\sigma$ )
Keep $\gamma$ and $\beta$ trainable (so the layer becomes a simple linear transformation)

Why? This reduces memory consumption during Faster R-CNN training, which is computationally expensive. While it might seem suboptimal to not update BN statistics, empirically this works well because:

The ImageNet statistics are already well-tuned
The detection dataset has similar visual characteristics to ImageNet
The detection-specific learning happens in the final classification/regression layers

Results: PASCAL VOC Dataset

Dataset Setup:

PASCAL VOC 2007 test: Train on 5k images from VOC 2007 + 16k from VOC 2012 ("07+12")
PASCAL VOC 2012 test: Train on 10k+16k images + 16k from VOC 2012 ("07++12")

Key Metric: Mean Average Precision (mAP) — a standard detection metric that averages precision across all object categories

Results from Table 7:

ResNet-101 improves mAP by >3% over VGG-16
This improvement comes purely from better learned features—the detection system itself is identical

Interpretation: Deeper, residual-learning-based features are more discriminative for localization and classification tasks.

Results: MS COCO Dataset

Dataset Overview

MS COCO is significantly more challenging than PASCAL VOC:

80 object categories (vs. 20 in VOC)
80k training images, 40k validation images
Objects are smaller and more diverse

Metrics

The authors report two evaluation metrics:

mAP@IoU=0.5: Measures detection accuracy at a high Intersection-over-Union (IoU) threshold
- IoU = $\frac{\text{Area of Overlap}}{\text{Area of Union}}$ between predicted and ground-truth boxes
- This is more lenient; boxes just need rough overlap
mAP@[0.5:0.05:0.95]: The "standard COCO metric"
- Averages mAP across 10 different IoU thresholds: 0.50, 0.55, 0.60, ..., 0.95
- This is much stricter; boxes must be very precisely localized

Key Results from Table 8

Metric	ResNet-101	VGG-16	Improvement
mAP@0.5	54.2%	47.3%	+6.9%
mAP@[0.5:0.95]	34.2%	28.2%	+6.0%

Why This Matters: A Critical Observation

The absolute improvements are nearly identical across both metrics:

mAP@0.5: +6.9%
mAP@[0.5:0.95]: +6.0%

Why is this significant?

Usually when you improve detection, you get bigger gains in the lenient metric than the strict metric. Getting equal improvements across strict and lenient metrics means:

\text{ResNet-101 improves both recognition quality and spatial localization accuracy}

This is crucial evidence that deeper networks learn representations that help with both:

Semantic understanding (what object is present)
Geometric precision (where exactly it is located)

Training Details for COCO

To handle the computational demands of training on 80k images:

8-GPU training (parallel processing on 8 GPUs)
RPN mini-batch: 8 images total (1 per GPU)
Fast R-CNN mini-batch: 16 images total (2 per GPU)
Learning rate schedule:
- $0.001$ for 240k iterations
- $0.0001$ for 80k iterations (fine-tuning phase)

The Bigger Picture: Transfer Learning Success

This section proves a crucial principle in deep learning:

Good features learned on one task (ImageNet classification) transfer well to another task (object detection), provided the backbone network is powerful enough.

The 28% relative improvement in the strictest COCO metric demonstrates that ResNets:

Learn more discriminative feature hierarchies
Capture both low-level details (for precise localization) and high-level semantics (for classification)
Generalize across different computer vision tasks

This discovery was transformative for computer vision—ResNets became the backbone of choice for nearly every downstream task.

B. Object Detection Improvements

p.10

For completeness, we report the improvements made for the competitions. These improvements are based on deep features an...

Object Detection Improvements: Breaking Down Section B

The Big Picture

This section describes additional techniques the authors applied to improve their object detection results for competition submissions. These are practical engineering improvements built on top of the ResNet-based detection framework described in Appendix A. Think of this as the "secret sauce" — while the base ResNet architecture provides better features, these techniques show how to squeeze even more performance out of those features through clever post-processing and testing strategies.

The improvements fall into several categories:

Box refinement - making bounding box predictions more accurate
Global context - using image-wide information to improve local decisions
Multi-scale testing - evaluating at different image sizes
Ensemble methods - combining multiple models

Let me walk through each one.

1. Box Refinement: Iterative Localization

The Core Idea

In object detection, you need two things: what is the object (classification) and where is the object (localization via bounding boxes). The initial detection system produces a "regressed box" — the model's best guess at where an object is located.

The key insight: If you're wrong about where an object is, use that wrong answer to get a better answer.

How It Works Mathematically

Let's say the model produces:

Original proposal box: call this the Region of Interest (RoI)
Classification score for that box: $s_1$
Regressed bounding box: call this $\mathbf{b}_1$ (a 4-dimensional vector with coordinates)

Box refinement applies this process iteratively:

Pool features from the regressed box $\mathbf{b}_1$ (not the original proposal)
Run the classifier on these new features → get new score $s_2$ and new regressed box $\mathbf{b}_2$
Combine both predictions: the union of $\{\mathbf{b}_1, s_1\}$ and $\{\mathbf{b}_2, s_2\}$ gives you 300 original + 300 new predictions (600 total)
Apply Non-Maximum Suppression (NMS) with IoU threshold $\tau = 0.3$

What is NMS? A Brief Detour

Non-Maximum Suppression removes duplicate detections:

IoU (Intersection over Union) measures overlap between two boxes: $\text{IoU}(\mathbf{b}_i, \mathbf{b}_j) = \frac{\text{Area}(\mathbf{b}_i \cap \mathbf{b}_j)}{\text{Area}(\mathbf{b}_i \cup \mathbf{b}_j)}$
If two boxes have IoU > 0.3, keep only the higher-confidence one
This reduces overlapping predictions

Finally, apply box voting: For remaining boxes, average the coordinates of nearby high-confidence boxes to get a better final prediction.

Result: ~2 percentage points improvement in mAP (mean Average Precision).

2. Global Context: Combining Local and Holistic Information

The Core Idea

Each region proposal is processed independently to extract features. But objects exist in context — what objects are nearby? What does the whole scene look like? This technique adds a "global view" to each local decision.

Mathematical Description

In the Fast R-CNN module (which processes each region):

Step 1: Extract global feature

Take the full-image convolutional feature map (from earlier layers of ResNet)
Apply Spatial Pyramid Pooling (SPP) to the entire image using it as a single RoI

What does this mean? Rather than pooling features from a small region, you pool from the entire image boundary box. This gives you a "summary" feature vector of the whole scene.

Step 2: Combine local and global

For each region proposal, you now have:

Local feature: $\mathbf{f}_{\text{local}} \in \mathbb{R}^d$ (from RoI pooling of that specific region)
Global feature: $\mathbf{f}_{\text{global}} \in \mathbb{R}^d$ (summary of entire image)

Concatenate them: $\mathbf{f}_{\text{combined}} = [\mathbf{f}_{\text{local}}, \mathbf{f}_{\text{global}}] \in \mathbb{R}^{2d}$

where $[\cdot, \cdot]$ denotes concatenation (stacking vectors end-to-end).

Step 3: Make predictions

Feed $\mathbf{f}_{\text{combined}}$ through the classification and box regression heads to get final predictions.

The entire system is trained end-to-end, so the network learns what global context matters.

Result: ~1 percentage point improvement in mAP@0.5 (the stricter metric).

3. Multi-Scale Testing: Testing at Different Zoom Levels

The Core Idea

Objects appear at different sizes in images. A person might be 50 pixels tall in one image and 500 pixels in another. By testing at multiple image scales, you have a better chance of catching objects at their natural size.

How It Works

In the standard approach (from Appendix A), the image's shorter side is rescaled to exactly $s = 600$ pixels.

Multi-scale testing instead uses an image pyramid:

Create multiple versions of the input image at different scales: $s \in \{200, 400, 600, 800, 1000\} \text{ pixels}$

Each version has the same aspect ratio, just different sizes.

Compute feature maps using the ResNet backbone for each scale
Select adjacent pairs of scales (e.g., 600 and 800), following a technique from prior work
Pool RoI features from both scales
Merge predictions from both scales using maxout: For each feature dimension, take the maximum value across the two scales: $f_{\text{merged}}(i) = \max(f_{\text{scale1}}(i), f_{\text{scale2}}(i))$

This is intuitive: at different scales, different features become prominent, so taking the max captures useful information from both.

Make final predictions from the merged features

Why not multi-scale training? The authors didn't have time to implement this due to computational constraints. Multi-scale testing alone (without retraining the model) still works well because the ResNet features are reasonably robust across scales.

Result: ~2 percentage points improvement in mAP.

4. Using Validation Data for Training

The Simple But Effective Trick

When you have two datasets with similar tasks:

Train set: 80k images
Validation set: 40k images
Test-dev set: 20k images (no public labels)

Instead of only using the 80k train set, combine it with the 40k validation set (80k + 40k = 120k total) for training, then evaluate on test-dev.

This is valid because:

The test-dev ground truth is hidden, so you're not "cheating"
You're using more data, so the model learns better
You're still honestly evaluating on held-out data

Results with this approach:

Single model: 55.7% mAP@0.5, 34.9% mAP@[0.5, 0.95]

These become the baseline for further improvements (box refinement, context, multi-scale).

5. Ensemble: Combining Multiple Models

The Core Idea

Different neural networks, even with the same architecture, will converge to different local optima due to random initialization and stochastic training. Their errors are partially independent, so averaging them reduces error.

How Ensemble Works in Faster R-CNN

Faster R-CNN has two stages that can each be ensembled:

Stage 1: Region Proposal Network (RPN)

Uses 3 different trained networks to generate region proposals independently
Collects all proposals into a union set $P = P_1 \cup P_2 \cup P_3$

Stage 2: Detection on Regions

For each proposal in $P$ , run an ensemble of 3 classifiers
Average the classification scores from the 3 networks

Mathematically, for region $r$ : $\hat{s}(r) = \frac{1}{3}\sum_{i=1}^{3} s_i(r)$

where $s_i(r)$ is the confidence score from network $i$ .

Box coordinates are similarly averaged: $\hat{\mathbf{b}}(r) = \frac{1}{3}\sum_{i=1}^{3} \mathbf{b}_i(r)$

Why This Helps

Error reduction: Individual model errors are uncorrelated, so averaging reduces variance
Coverage: Proposals from 3 models together find more objects than any single model
Consensus: A prediction trusted by 3 models is more reliable than one from a single model

Results with ensemble of 3 networks:

mAP@0.5: 59.0% (up from 55.7%)
mAP@[0.5, 0.95]: 37.4% (up from 34.9%)

This ensemble won 1st place in COCO 2015 detection.

Applying to PASCAL VOC and ImageNet Detection

Transfer Learning Strategy

The authors take the trained COCO model (which saw 80k+40k images) and fine-tune it on other datasets:

PASCAL VOC:

Fine-tune COCO model on VOC training data
Apply box refinement, context, multi-scale techniques
Results:
- 85.6% mAP on VOC 2007
- 83.8% mAP on VOC 2012
- This beats previous state-of-the-art by 10 percentage points!

ImageNet Detection (200 object categories):

Pre-train on 1000-class ImageNet classification
Fine-tune on ImageNet DET training data
Use validation split (val1/val2) to avoid overfitting
Single model: 58.8% mAP
Ensemble of 3: 62.1% mAP
Wins ILSVRC 2015 detection by 8.5 points over the second place!

The key insight: Better base features (from ResNet) + clever training tricks (ensemble, multi-scale, box refinement) = state-of-the-art across multiple datasets.

Summary Table: Contribution of Each Technique

Technique	Improvement	Mechanism
Box refinement	~2 points	Iterate: use regressed box to get better prediction
Global context	~1 point	Concatenate full-image features with region features
Multi-scale testing	~2 points	Test at 5 scales, merge features using maxout
Ensemble (3 models)	~3-4 points	Average scores and boxes from 3 networks
Combined effect	~8 points total	All techniques stacked together

The cascading improvements show that good representation learning (ResNet) combined with smart inference techniques produces exceptional results.

C. ImageNet Localization

p.12

The ImageNet Localization (LOC) task [36] requires to classify and localize the objects. Following [40, 41], we assume t...

Deep Dive: ImageNet Localization (Section C)

Big Picture: What's This Section About?

The ImageNet Localization task is different from pure classification. It's not just asking "what object is in this image?" but also "where exactly is it?" The task requires the network to:

Classify objects (predict their category from 1000 classes)
Localize objects (predict their bounding boxes)

This section explains how the ResNet framework adapted from object detection methods can dramatically outperform previous approaches (like VGG) on this combined classification-localization task. The key innovation is using a per-class RPN (Region Proposal Network) that learns class-specific bounding box regressors rather than using a single generic proposal mechanism.

The Problem Setup

The Task Structure

The ImageNet Localization task assumes a two-stage pipeline:

Classification stage: Predict the object class $c$ where $c \in \{1, 2, \ldots, 1000\}$
Localization stage: Given the predicted class, predict the bounding box coordinates

This is formalized by the per-class regression (PCR) strategy: for each of the 1000 classes, train a separate bounding box regressor. This makes sense intuitively—a cat might typically appear in different image positions or sizes than a car, so learning class-specific patterns helps.

Key Difference from Detection

In the previous object detection section, the RPN was category-agnostic—it generated proposals without knowing what object class they contained. Here, we make it category-aware by making the RPN per-class.

The Architecture: Per-Class RPN

Standard RPN (from Faster R-CNN)

Before explaining the modification, recall that a standard RPN ends with:

A classification layer: predicts "object" vs "background"
A regression layer: predicts bounding box coordinates

The Per-Class Modification

The authors replace these with per-class versions:

Classification layer output dimension: Instead of binary (object/no-object), we have a 1000-dimensional output vector:

\text{cls}_{\text{output}} \in \mathbb{R}^{1000}

where each dimension $i$ performs binary logistic regression to predict "is this object class $i$ or not?"

Regression layer output dimension: Instead of 4 coordinates per anchor, we have:

\text{reg}_{\text{output}} \in \mathbb{R}^{1000 \times 4}

This contains $1000 \times 4 = 4000$ values total. For each of the 1000 classes, we learn a separate 4-dimensional bounding box regressor $[dx, dy, dw, dh]$ (change in x-position, y-position, width, and height relative to an anchor box).

Why Anchor Boxes Matter

The bounding box regression is defined relative to anchor boxes. At each spatial location in the feature map, there are multiple predefined "anchor" boxes of different aspect ratios and scales. These are translation-invariant templates. The regression outputs predict:

\text{box}_{\text{predicted}} = \text{box}_{\text{anchor}} + [dx, dy, dw, dh]

where the offsets $[dx, dy, dw, dh]$ are learned to adjust the anchor to match the ground truth.

Training Details

Data Augmentation

As in ImageNet classification (referenced from Section 3.4), the network is trained with:

Random crops of size $224 \times 224$ pixels
This forces the network to learn objects at different scales and positions

Handling Class Imbalance

A critical practical issue: negative samples (non-objects) vastly outnumber positive samples (actual objects).

The balancing strategy:

Sample 8 anchors per image (not all thousands available)
Maintain a 1:1 ratio of positive to negative anchors
- Positive anchors: those with high IoU overlap with ground truth boxes
- Negative anchors: those with low IoU overlap

This means the actual mini-batch contains 4 positive and 4 negative anchors per image. This prevents the loss from being dominated by easy negative examples.

Testing Procedure

The network is applied fully-convolutionally across the entire image (not just at the center). This means:

No need to crop to $224 \times 224$ at test time
The network processes the full resolution and produces predictions at multiple spatial locations
This provides denser coverage of potential object locations

Performance Analysis: Why ResNet Wins

Oracle Evaluation (Ground Truth Classes)

The paper compares methods when given the true class label beforehand:

VGG-16 (center-crop evaluation): $33.1\%$ error ResNet-101 (center-crop evaluation): $13.3\%$ error

This is a dramatic $20$ percentage point improvement! The reduction from $33.1\%$ to $13.3\%$ is a relative improvement of:

\frac{33.1 - 13.3}{33.1} \times 100\% \approx 59.8\%

Why is ResNet so much better here? The deeper residual architecture learns better spatial features for precise localization. The skip connections allow gradients to flow more effectively for fine-grained bounding box prediction.

With dense and multi-scale testing: $11.7\%$ error

"Dense" = fully convolutional (mentioned above)
"Multi-scale" = evaluating on images at different resolutions and combining results

Practical Evaluation (Predicted Classes)

When the network must predict its own class (with $4.6\%$ top-5 classification error from Table 4), the top-5 localization error rises to $14.4\%$ . The small increase ( $11.7\% \to 14.4\%$ ) shows the architecture is robust to classification errors.

The R-CNN Refinement Stage

Why Not Use Fast R-CNN?

A subtle but important observation: ImageNet Localization images typically contain one dominant object (unlike general detection datasets). This creates a problem for Fast R-CNN:

Fast R-CNN uses image-centric training: it samples regions from one image at a time
When the image has one object and many proposal regions, all those proposals have highly overlapping RoI-pooled features because they all capture mostly the same object
This creates small sample variations in the training data, which hurts stochastic gradient descent (SGD)

The Original R-CNN Alternative

Instead, they use R-CNN, which is RoI-centric:

Training procedure:

Apply the pre-trained per-class RPN on training images
For each training image, extract the top-200 highest-scoring proposals
Crop and warp each proposal region to $224 \times 224$ pixels
Train two sibling fully-connected (fc) layers:
- Classification: predicts class (per-class form)
- Regression: predicts bounding box refinements (per-class form)

Testing procedure:

RPN generates top-200 proposals for each predicted class
R-CNN updates proposal scores and box positions

Performance Boost

This additional refinement stage reduces top-5 localization error from $10.6\%$ (single model on validation) to $9.0\%$ (ensemble on test set).

Final Results and Historical Context

The Numbers

Configuration	Top-5 Error
ResNet-101 (validation, single model)	10.6%
ResNet ensemble (test set)	9.0%
ILSVRC 2014 state-of-the-art	~14.5%

Relative improvement:

\frac{14.5 - 9.0}{14.5} \times 100\% \approx 38\%

But the paper claims a 64% relative reduction, which seems to reference a different baseline. The point is: this is a massive jump in performance.

Why This Matters

The ResNet localization success demonstrates that:

Deeper networks learn better representations for fine-grained spatial reasoning
The residual learning framework makes it possible to train networks deep enough to capture subtle localization patterns
The same core architecture (ResNet backbone) excels across multiple tasks: classification, detection, and localization

This versatility helped ResNets dominate ILSVRC 2015 across multiple competition tracks.

Summary: Key Takeaways

Concept	Explanation
Per-class RPN	Instead of generic proposals, learn 1000 class-specific bounding box regressors
Anchor boxes	Use predefined templates; learn offsets relative to these templates
Anchor sampling	Maintain 1:1 positive/negative ratio to avoid imbalance
Fully convolutional testing	Process entire image; don't crop at test time
R-CNN refinement	Additional stage that handles single-object ImageNet setting better than Fast R-CNN
Why ResNet wins	Deeper architecture + skip connections = better spatial features for localization

The key insight is that localization is fundamentally a feature learning problem—better features (from deeper networks) lead to better bounding box predictions.

Deep Residual Learning for Image Recognition

Abstract

Abstract

Deep Residual Learning: Understanding the Abstract

The Big Picture

The Problem: Why Deep Networks Are Hard to Train

The Solution: Residual Learning Framework

Traditional Neural Network Layer

Residual Learning: A Small but Powerful Change

Empirical Results and Validation

ImageNet Results

CIFAR-10 Results

Other Tasks

Key Technical Insights Presented

1. Depth Enables Better Representations

2. Easier Optimization

Why This Matters (Conceptual Summary)

Mathematical Notation Reference

Conclusion

1. Introduction

Introduction to Residual Learning: A Deep Dive

Big Picture: What's This Section About?

Part 1: Setting the Stage—Why Depth Matters

Part 2: The Degradation Problem—Something's Wrong

The Vanishing Gradient Problem (Already Solved)

The Real Problem: Degradation (Not Overfitting!)

Implementation: Shortcut Connections

Part 4: Claims and Results

Key Takeaway

2. Related Work

Section 2: Related Work – Deep Dive Explanation

Big Picture: Why This Section Matters

Part 1: Residual Representations in Image Recognition

The Core Idea

VLAD and Fisher Vectors

Vector Quantization

Part 2: Residual Solutions in Scientific Computing

Multigrid Methods for PDEs

Part 3: Shortcut Connections – The Technical Implementation

Historical Context

Highway Networks: The Key Comparison

How Highway Networks Work

ResNet's Simpler Choice

Mathematical Clarity: The Residual Principle

Summary: Connecting All Three Parts

3.1. Residual Learning

Section 3.1: Residual Learning - A Deep Dive

The Big Picture

Part 1: The Core Hypothesis

Part 2: The Motivation from the Degradation Problem

Part 3: The Residual Learning Solution

Part 4: Preconditioning for Non-Identity Cases

Part 5: Empirical Support

Summary: The Three-Layer Logic

Connection to Figure 2

3.2. Identity Mapping by Shortcuts

Section 3.2: Identity Mapping by Shortcuts – Detailed Explanation

Big Picture: What Are We Doing Here?

The Core Building Block: Equation (1)

Breaking Down the Notation

What's Actually Happening?

Concrete Example: Two-Layer Residual Block

Critical Constraint: Dimension Matching

Extension to Convolutional Layers

Summary: Why This Design?

Why This Design Is Revolutionary: The Gradient Flow Perspective

Why F\mathcal{F}F Must Have At Least Two Layers

The Case of Dimension Mismatch

Conceptual Summary

Summary: The Key Insights

1. Residual Learning Principle

2. Gradient Preservation

3. Flexible Dimension Matching

4. Minimal Computational Overhead

5. Empirical Success

3.3. Network Architectures

Section 3.3: Network Architectures — A Detailed Explanation

Big Picture

Part 1: Plain Network Design

The Philosophy

Why $\mathcal{F}$ Must Have At Least Two Layers

Architecture Details: The Parameter $n$