Deep learning & vision

beginner

ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky, Sutskever & Hinton · NeurIPS 2012

Vision Foundations

From paper to practice

Pair this reading with structured exercises in our catalog—concepts, quizzes, and (where available) coding checkpoints so you can apply the ideas, not just skim them.

Open related course: Computer Vision Find a learning path More papers

Paper PDF

Open in new tab

Fetching research paper

Downloading PDF from the archive

If the viewer is blank (blocked by the publisher or your network), use Open in new tab. Scrolling inside the frame moves through the PDF pages when embedding is supported.

Reading map

These notes are written in plain language for this specific paper—so you can grasp the ideas before you wrestle with the authors’ formal wording. Use the button to open the PDF near the matching section (approximate page; Chromium-style viewers support #page=, otherwise we open a new tab).

Problem statement & goal

Before this work, winning on ImageNet meant hand-crafted features and shallow models. The team wanted to show that a single deep neural net, trained end-to-end on raw pixels, could beat the best classical pipelines on a huge, messy real-world image set.

Methodology & architecture

They built a very deep CNN (for 2012) with big convolutional layers, ReLU activations, dropout to reduce overfitting, and a clever way to split training across two GPUs. Data is augmented (flips, crops) so the model sees varied views of each image.

Datasets & benchmarks

Training and testing use ImageNet (1.2M training images, 1000 categories). Success means low error on the official test set—the same benchmark everyone else reports, so you can compare fairly to older methods.

Results & evaluation metrics

They report top-1 and top-5 error on ImageNet: the model makes a big jump over the previous state of the art. The takeaway for students: depth + data + compute + simple tricks (ReLU, dropout) can unlock a breakthrough.

Limitations & future work

The network is huge for its time—heavy memory, two GPUs, long training. It’s tuned for ImageNet-scale data; smaller datasets might still overfit without care. Not every idea here transfers one-to-one to today’s transformers or tiny devices.

Reproducibility

The paper gives architecture layout, training details, and augmentations in enough detail that teams reproduced and extended it. There was no public GitHub in the same way as today, but the description was enough to anchor a decade of follow-on work.

What to focus on

Eight highlights per paper—why each part matters before you read dense notation and proofs.

Historical shift

ImageNet forced models to handle real photos, clutter, and 1000 classes—not toy digits. AlexNet showed deep nets could win that game end-to-end, not just on small curated sets.

End-to-end learning

Features are learned from pixels instead of hand-crafted SIFT/HOG pipelines. That single idea unlocked scaling: more data and compute directly improve the representation.

Depth and capacity

Five convolutional layers (plus pools/FC) was unusually deep in 2012. The paper argues width × depth × data together beat shallow alternatives that were saturating on the same benchmark.

ReLU & regularization

ReLU speeds training vs. saturating activations; dropout fights overfitting on the huge parameter count. Pair that with aggressive data augmentation (crops, flips) so the net generalizes beyond memorized crops.

Two-GPU layout

The architecture splits filters across GPUs out of memory necessity—read it as early model parallelism. Understanding the split helps you map the diagram to actual tensor shapes.

Metrics that matter

Top-1 and top-5 error on ImageNet validation are the headline numbers. Compare them to prior competition winners to see the margin of victory, not just relative percent gains.

Training recipe

Learning-rate schedule, batch size, LRN placement, and preprocessing (e.g., downscaling, mean subtraction) are as important as the macro-architecture. Reproducing errors means matching these details.

What changed after

AlexNet normalized CNNs for vision; later work added batch norm, better blocks (ResNet), and different datasets—but the lesson remains: scale the model with data and compute when labels exist.

Research literacy notes

Capture how you read this paper—claims, brittle assumptions, and what you’d rerun. Notes stay on this browser only (local storage); they’re for your engagement, not grading.

Private to your device · cleared if you erase site data

Main claim (one tight paragraph)

Fragile assumption

Experiment I’d rerun or inspect

← Back to Research Lab