CNN Output Size Formula  Bonus Neural Network Debugging Session
text
CNN Output Size Formula  Tensor Transformations
Welcome to this neural network programming series with PyTorch. In this episode, we are going to see how an input tensor is transformed as it flows through a CNN.
Without further ado, let's get started.
Highlevel Overview of Our Process
 Prepare the data

Build the model
 Understanding forward pass transformations
 Train the model
 Analyze the model's results
Overview of Our Network
The CNN we will use is the one that we have been working with over the last few posts that has six layers.
 Input layer
 Hidden conv layer
 Hidden conv layer
 Hidden linear layer
 Hidden linear layer
 Output layer
We built this network using PyTorch's
nn.Module
class, and the Network
class definition is as follows:
class Network(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
self.fc1 = nn.Linear(in_features=12*4*4, out_features=120)
self.fc2 = nn.Linear(in_features=120, out_features=60)
self.out = nn.Linear(in_features=60, out_features=10)
def forward(self, t):
# (1) input layer
t = t
# (2) hidden conv layer
t = self.conv1(t)
t = F.relu(t)
t = F.max_pool2d(t, kernel_size=2, stride=2)
# (3) hidden conv layer
t = self.conv2(t)
t = F.relu(t)
t = F.max_pool2d(t, kernel_size=2, stride=2)
# (4) hidden linear layer
t = t.reshape(1, 12 * 4 * 4)
t = self.fc1(t)
t = F.relu(t)
# (5) hidden linear layer
t = self.fc2(t)
t = F.relu(t)
# (6) output layer
t = self.out(t)
#t = F.softmax(t, dim=1)
return t
Passing a batch of size one (a single image)
In a
previous episode, we saw how we can pass a single image by adding a batch dimension using PyTorch's
unsqueeze()
method. We'll pass this tensor to the network again, but this time we will step through the
forward()
method using the debugger. This will allow us to inspect our tensor as transformations are performed.
Let's begin:
> network = Network()
> network(image.unsqueeze(0))
#1 Input layer
When the tensor comes into the input layer, we have:
> t.shape
torch.Size([1, 1, 28, 28])
This value in each of these dimensions represent the following values:
Since the input layer is just the identity function, the output shape doesn't change.
#2 Convolutional layer (1)
When the tensor comes into this layer, we have:
> t.shape
torch.Size([1, 1, 28, 28])
After the first convolution operation
self.conv1
, we have:
> t.shape
torch.Size([1, 6, 24, 24])
The batch size is still
1
. This makes sense because we wouldn't expect our batch size to change, and this is going to be the case through the entire forward pass.
batch_size
is fixed as we move through the forward pass.
The number of color channels has increased from
1
to
6
. After we move forward beyond the first convolutional layer, we don't think of the channels as color channels any longer. We just think of them as output channels. The reason we have
6
output channels is due to the number of
out_channels
that we specified when
self.conv1
was created.
Convolution Operations use Filters
Like we have seen, this number
6
is arbitrary. The
out_channels
parameter instructs the
nn.Conv2d
layer class generate six filters, also known as kernels, with shape
5 by 5
with randomly initialized values. These filters are used to generate the six output channels.
out_channels
parameter determines how many filters will be created.
The filters are tensors, and they are used to convolve the input tensor when the tensor is passed to the layer instance,
self.conv1
. The random values inside the filter tensors are the weights of the convolutional layer. Remember though, we don't actually have six distinct tensor. All six of the filters are
packaged into a single weight tensor that has a height and width of five.
After the weight tensors (filters) are used to convolve the input tensor, the result is the output channels.
Another name for output channels is feature maps. The terms here are interchangeable. This is due to the fact that the pattern detection that emerges as the weights are updated represent features like edges and other more sophisticated patterns.
The algorithm:
 Color channels are passed in.
 Convolutions are performed using the weight tensor (filters).
 Feature maps are produced and passed forward.
Conceptually, we can think of the weight tensors as being distinct. However, what we really have in code is a single weight tensor that has an
out_channels
(filters) dimension. We can see this by checking the shape of the weight tensor:
> self.conv1.weight.shape
torch.Size([6, 1, 5, 5])
This tensor's shape is given by:
The
relu()
activation function
The call to the
relu()
function removes any negative values and replaces them with zeros. We can verify this by checking the
min()
of the tensor before and after the call.
> t.min().item()
1.1849982738494873
> t = F.relu(t)
> t.min().item()
0.0
The relu()
function can be expressed mathematically as
The max pooling operation
The pooling operation reduces the shape of our tensor further by extracting the maximum value from each
2x2
location within our tensor.
> t.shape
torch.Size([1, 6, 24, 24])
> t = F.max_pool2d(t, kernel_size=2, stride=2)
> t.shape
torch.Size([1, 6, 12, 12])
Convolution layer summary
The shapes of the tensor input to and output from the convolutional layer is given by:

Input shape:
[1, 1, 28, 28]

Output shape:
[1, 6, 12, 12]
Summary of each operation that occurs:

The convolution layer convolves the input tensor using six randomly initialized
5x5
filters. This reduces the height and width dimensions by four.

The relu activation function operation maps all negative values to
0
. This means that all the values in the tensor are now positive.

The max pooling operation extracts the max value from each
2x2
section of the six feature maps that were created by the convolutions. This reduced the height and width dimensions by twelve.
CNN Output Size Formula
Let's have a look at the formula for computing the output size of the tensor after performing convolutional and pooling operations.
CNN Output Size Formula (Square)
 Suppose we have an \(n \times n\) input.
 Suppose we have an \(f \times f\) filter.
 Suppose we have a padding of \(p\) and a stride of \(s\).
The output size \(O\) is given by this formula:
\[O = \frac{n  f + 2p}{s} + 1\]
This value will be the height and width of the output. However, if the input or the filter isn't a square, this formula needs to be applied twice, once for the width and once for the height.
CNN Output Size Formula (NonSquare)
 Suppose we have an \(n_{h} \times n_{w}\) input.
 Suppose we have an \(f_{h} \times f_{w}\) filter.
 Suppose we have a padding of \(p\) and a stride of \(s\).
The height of the output size \(O_{h}\) is given by this formula:
\[O_{h} = \frac{n_{h}  f_{h} + 2p}{s} + 1\]
The width of the output size \(O_{w}\) is given by this formula:
\[O_{w} = \frac{n_{w}  f_{w} + 2p}{s} + 1\]
#3 Convolutional layer (2)
The second hidden convolutional layer
self.conv2
, transforms the tensor in the same was as
self.conv1
and reduces the height and width dimensions further. Before we run through these transformations, let's check the shape of the weight tensor for
self.conv2
:
self.conv2.weight.shape
torch.Size([12, 6, 5, 5])
This time our weight tensor has twelve filters of height of five and width of five, but instead of having a single input channel, the number of channels is is coming in at six, which gives the filters a depth. This accounts for the six output channels from the first convolutional layer. The resulting output will have twelve channels.
Let's run these operations now.
> t.shape
torch.Size([1, 6, 12, 12])
> t = self.conv2(t)
> t.shape
torch.Size([1, 12, 8, 8])
> t.min().item()
0.39324113726615906
> t = F.relu(t)
> t.min().item()
0.0
> t = F.max_pool2d(t, kernel_size=2, stride=2)
> t.shape
torch.Size([1, 12, 4, 4])
The shape of the resulting output of
self.conv2
allows us to see why we reshape the tensor using
12*4*4
before passing the tensor to the first linear layer,
self.fc1
.
As we have seen in the past, this particular reshaping is called flattening the tensor. The flatten operation puts all of the tensor's elements into a single dimension.
> t = t.reshape(1, 12*4*4)
> t.shape
torch.Size([1, 192])
The resulting shape is
1x192
. The
1
in this case represents the batch size, and the
192
is the number of elements in the tensor that are now in the same dimension.
#4 #5 #6 Linear Layers
Now, we just have a series of linear layers followed by nonlinear activation function until we reach the output layer.
> t = self.fc1(t)
> t.shape
torch.Size([1, 120])
> t = self.fc2(t)
> t.shape
torch.Size([1, 60])
> t = self.out(t)
> t.shape
torch.Size([1, 10])
> t
tensor([[ 0.1009, 0.0842, 0.0349, 0.0640, 0.0754, 0.0057, 0.0878, 0.0296, 0.0345, 0.0236]])
This table summarizes the shape changing operations and the resulting shape of each:
Operation  Output Shape 

Identity function 
torch.Size([1, 1, 28, 28])

Convolution
(5 x 5)

torch.Size([1, 6, 24, 24])

Max pooling
(2 x 2)

torch.Size([1, 6, 12, 12])

Convolution
(5 x 5)

torch.Size([1, 12, 8, 8])

Max pooling
(2 x 2)

torch.Size([1, 12, 4, 4])

Flatten (reshape) 
torch.Size([1, 192])

Linear transformation 
torch.Size([1, 120])

Linear transformation 
torch.Size([1, 60])

Linear transformation 
torch.Size([1, 10])

Training the CNN is next
We should now have a good understanding of how input tensors are transformed by convolutional neural networks, how to debug neural networks in PyTorch, and how to inspect the weight tensors of all of the layers.
In the next episode, we will begin training our network, which will lead to the values in our weight tensor to be updated to make the forward method of our network map the inputs to the correct output classes. I'll see you in the next one!
quiz
resources
updates
Committed by on