reimplementing-ml-papers

AlexNet

In this directory, we aim to implement the AlexNet architecture for a Convolutional Neural Network (CNN) used for image classification, to be tested with the ImageNet dataset.

CIFAR-10 implementations

Description Library Notebook
Using Pylearn2/Keras LRN Keras View on GitHub Open In Colab Open in Binder
Using TF.NN.LRN Keras View on GitHub Open In Colab Open in Binder

ImageNet implementations

  Description Library Notebook
v1 Basic impl Keras View on GitHub Open In Colab Open in Binder

Implementation notes for ImageNet v1:

  1. We haven’t yet trained or tested this network (work in progress).

References

Our implementation is based on the following paper:

The paper is available via:

See also:

Local Response Normalization

Per the AlexNet paper, the Local Response Normalization layer computes the following function:

\[b_{x,y}^i = a_{x,y}^i / \left( k + \alpha \sum_{j = \max(0, i-n/2)}^{\min(N-1, i+n/2)} \left(a_{x,y}^j \right)^2 \right) ^ \beta\]

The paper authors chose $k = 2, n = 5, \alpha = 10^{-4}, \beta = 0.75$.

We provide 2 implementations of the Local Response Normalization layer:

Input size discrepancy

Note that the original AlexNet paper refers to inputs as $224 \times 224$ images; however, that does not work out to have $55 \times 55$ images as the output from the first convolutional layer. There’s consensus that the only way to achieve that is to use input images of size $227 \times 227$, or images of size $224 \times 224$ with a 3-pixel zero-padding, which makes it work.

Here are several references which agree on this analysis of input shape:

  1. Classic Networks lecture by Andrew Ng as part of the Deep Learning specialization

    If you read the paper, the paper refers to $224 \times 224 \times 3$ images, but if you look at the numbers, the numbers only make sense if they are $227 \times 227 \times 3$.

  2. Stanford CS231n

    As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.

  3. Data Science SE

    One answer also quotes Andrew Ng’s lecture in (1) above. Another answer demonstrates via calculation that the $224 \times 224$ input shape would result in a non-integral output shape, and hence, must be an error, similarly to the Stanford CS231n class notes in (2) above.