reimplementing-ml-papers

AlexNet

In this directory, we aim to implement the AlexNet architecture for a Convolutional Neural Network (CNN) used for image classification, to be tested with the ImageNet dataset.

CIFAR-10 implementations

Description	Library	Notebook
Using Pylearn2/Keras LRN	Keras
Using TF.NN.LRN	Keras

ImageNet implementations

	Description	Library	Notebook
v1	Basic impl	Keras

Implementation notes for ImageNet v1:

We haven’t yet trained or tested this network (work in progress).

References

Our implementation is based on the following paper:

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NeurIPS 2012). Curran Associates Inc., Red Hook, NY, USA, 1097–1105.

The paper is available via:

Local Response Normalization

Per the AlexNet paper, the Local Response Normalization layer computes the following function:

\[b_{x,y}^i = a_{x,y}^i / \left( k + \alpha \sum_{j = \max(0, i-n/2)}^{\min(N-1, i+n/2)} \left(a_{x,y}^j \right)^2 \right) ^ \beta\]

The paper authors chose $k = 2, n = 5, \alpha = 10^{-4}, \beta = 0.75$.

We provide 2 implementations of the Local Response Normalization layer:

one in this directory, local_response_normalization.py as a very light wrapper around tensorflow.nn.local_response_normalization layer which was written as a result of this paper
another one in third_party/pylearn2/local_response_normalization.py which is based on the Pylearn2 implementation and adapted in the Keras project

Input size discrepancy

Note that the original AlexNet paper refers to inputs as $224 \times 224$ images; however, that does not work out to have $55 \times 55$ images as the output from the first convolutional layer. There’s consensus that the only way to achieve that is to use input images of size $227 \times 227$, or images of size $224 \times 224$ with a 3-pixel zero-padding, which makes it work.

Here are several references which agree on this analysis of input shape:

Classic Networks lecture by Andrew Ng as part of the Deep Learning specialization

If you read the paper, the paper refers to $224 \times 224 \times 3$ images, but if you look at the numbers, the numbers only make sense if they are $227 \times 227 \times 3$.
Stanford CS231n

As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
Data Science SE

One answer also quotes Andrew Ng’s lecture in (1) above. Another answer demonstrates via calculation that the $224 \times 224$ input shape would result in a non-integral output shape, and hence, must be an error, similarly to the Stanford CS231n class notes in (2) above.

This site is open source. Improve this page.