In this directory, we aim to implement the AlexNet architecture for a Convolutional Neural Network (CNN) used for image classification, to be tested with the ImageNet dataset.
Description | Library | Notebook |
---|---|---|
Using Pylearn2/Keras LRN | Keras | |
Using TF.NN.LRN | Keras |
Description | Library | Notebook | |
---|---|---|---|
v1 | Basic impl | Keras |
Implementation notes for ImageNet v1:
Our implementation is based on the following paper:
The paper is available via:
See also:
Per the AlexNet paper, the Local Response Normalization layer computes the following function:
\[b_{x,y}^i = a_{x,y}^i / \left( k + \alpha \sum_{j = \max(0, i-n/2)}^{\min(N-1, i+n/2)} \left(a_{x,y}^j \right)^2 \right) ^ \beta\]The paper authors chose $k = 2, n = 5, \alpha = 10^{-4}, \beta = 0.75$.
We provide 2 implementations of the Local Response Normalization layer:
local_response_normalization.py
as a very
light wrapper around tensorflow.nn.local_response_normalization
layer which was written as a result of this paperthird_party/pylearn2/local_response_normalization.py
which is based on the Pylearn2 implementation and adapted in the Keras projectNote that the original AlexNet paper refers to inputs as $224 \times 224$ images; however, that does not work out to have $55 \times 55$ images as the output from the first convolutional layer. There’s consensus that the only way to achieve that is to use input images of size $227 \times 227$, or images of size $224 \times 224$ with a 3-pixel zero-padding, which makes it work.
Here are several references which agree on this analysis of input shape:
Classic Networks lecture by Andrew Ng as part of the Deep Learning specialization
If you read the paper, the paper refers to $224 \times 224 \times 3$ images, but if you look at the numbers, the numbers only make sense if they are $227 \times 227 \times 3$.
As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
One answer also quotes Andrew Ng’s lecture in (1) above. Another answer demonstrates via calculation that the $224 \times 224$ input shape would result in a non-integral output shape, and hence, must be an error, similarly to the Stanford CS231n class notes in (2) above.