🏞️ CNN Basics
⿻ ConvNet
☝️- This should be approximate(available) Emoji for Filters 😉 - Emoji Generator
Convolution Neural Network (CNN or ConvNet) is a special type of Deep Neural Network used primarily used for image related tasks like, Image Classification, Object Detection, Image Segmentation etc.,
🤷🏻♂️ Why ConvNets Special?
Local connectivity/Sparsity of Connection : Focuses only on a small part of the input image at a time
Parameters Sharing/Shared Weights : Uses the same Filter in a Layer across an entire image
Translation Invariance : Learns to detect patterns (like edges, nose, even Cat Face) regardless of where they appear in an image
🧱 Building Blocks of a ConvNet
Note : Here we consider “input” as an image for understanding purpose, but we can apply for any other relevant data as well. But mostly will be images
Let us see the building blocks of ConvNet
1. Input Layer 🖼️
For Image Data, input is usually in the form of 3D tensors of dimension Height × Width × ChannelsWhere,
H : Height of the image,
W : Width of the Image,
C : Number of channels for example in RGB, Number of channels are 3
2. Convolutional Layer ⿻
Performs a Convolution Operation using Filters(also called Kernels) which is Linear Operation ~ similar to Z = WA[l] + b in Neural Networks
Note : Generally in Deep Learning Literature Researchers will refer to this as Convolution Operation, although they perform Cross-Correlation, for historical reasons and simplicity
And guess what? It turns out to be more powerful in practice!
Here is the simple illustration of single step of Filter Operation
For Simplicity let us consider the Input Image as dimension H x W x C as 5 x 5 x 1
Let the Filter dimension as Fh x Fw x Fc as 3 x 3 x 1
Note : C in Input Image and Fc in Filter dimension should be SAME
Generally we will also add another dimension in Filter Operation i.e. Number of Filters, but let us see that later may be in another post!
Filter Operation
We will place the 3 x 3 x 1 Filter above the top right corner of the Input Image and we do Element-Wise Multiplication and Sum all the results(only on that Local Filter Field (you can refer the above image))
Simple Example
Input Image
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
Filter(Sobel)
+----+----+----+
| -1 | 0 | +1 |
+----+----+----+
| -2 | 0 | +2 |
+----+----+----+
| -1 | 0 | +1 |
+----+----+----+
Result : After One Step of Filter Operation
+----+----+
| 8 | 8 |
+----+----+
| 8 | 8 |
+----+----+
But how the dimension changed to 2 x 2? How?
It is due to below formula :-
| nh + 2p - f |
nh = |------------ +1 |
| s |
-- --
| nw + 2p - f |
nw = |------------ +1 |
| s |
-- --
“nh x nw”
Stride : s
Stride tells how many pixels to move the Filter over the Input Image
Padding : p
Padding mean we adding extra pixels(usually zeros) around the border of the Input Image before applying Convolution
For example, if p = 1, we surround “0” around an Input Image, which looks like below for our case
+----+----+----+----+----+----+
| 0 | 0 | 0 | 0 | 0 | 0 |
+----+----+----+----+----+----+
| 0 | 1 | 2 | 3 | 4 | 0 |
+----+----+----+----+----+----+
| 0 | 5 | 6 | 7 | 8 | 0 |
+----+----+----+----+----+----+
| 0 | 9 | 10 | 11 | 12 | 0 |
+----+----+----+----+----+----+
| 0 | 13 | 14 | 15 | 16 | 0 |
+----+----+----+----+----+----+
| 0 | 0 | 0 | 0 | 0 | 0 |
+----+----+----+----+----+----+
Formula to find the result dimension after padding is,
(n + 2p - f + 1) * (n + 2p - f + 1)
Generally, after one layer of Convolution, the resulting dimension will change,
Now we need to talk about two things,
Valid Convolution
Valid Convolution means p = 0
, where no padding is applied and the resulting dimension will also be reduced
Also if no padding is applied, then consider the left top corner of the image, On-Convolution less feature will be learnt on that corner, so padding helps!
Same Convolution
When “Same Convolution” is applied then we need to add padding accordingly so that following is true,
Input Dimension = Resulting Dimension after Convolution Operation
How to calculate that specific Padding “p”?
Here is how we derive “p”,
Let us consider that Output Dimension = Input Dimension so
n + 2p - f + 1 = n 👈 "n" is Output Dimension
// ☝️ "n" is Input Dimension
2p - f + 1 = n - n
2p = f - 1
f - 1
p = ------
2
Then we will apply Activation Function like ReLU on the Result, that is we do Filter Operation and adding bias
Then we will apply Pooling Layer
Pooling
Pooling helps speed up the Computation and also makes feature it detect more robust
Two common types are :
Max Pooling
As like Filter we to Element-Wise Multiplication here we take the maximum number in the Filter Region of the Input Image and place in the respective position in the output
Example,
Here f = 2, s = 2
Input
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
Output
+----+----+
| 6 | 8 |
+----+----+
| 14 | 16 |
+----+----+
Average Pooling
Here we take the average of numbers in the Filter Region of the Input Image and place in the respective position in the output
Example,
Here f = 2, s = 2
Input
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
Output
+-------+-------+
| 3.5 | 5.5 |
+-------+-------+
| 11.5 | 13.5 |
+-------+-------+
This is commonly considered as one Layer and we repeate as many time as required and we flatten and feed that to Dense Neural Network(For Image Classification Case)
Typically most of the ConvNet looks like below
Going to Left to Right
-
Height and Width of the Image will be Reduced
-
Number of Channels will be Increased
🆚 The Theoretical vs Practical Gap
Theoretical Expectation: Theoretically, a Deeper Network should perform at least as well as a Shallower one
Practical Reality: While performing experiments on Networks,
-
A 34-layer plain Network showed a higher Validation Error than an 18-layer plain network
-
Training Error for the 34-layer plain Network was found to be higher than the 18-layer plain network
So, we add Residual Connection also called Skip Connection to pass current Layers Activations to another Layer, skipping some intermediate Layers. Which allows us to train much Deeper Network by solving Vanishing Gradient Problem and fixing Degradation Problem
ResNet Authors found Adding Residual Connection, allows the Network to be Deep
ResNet Paper Link : https://arxiv.org/abs/1512.03385
📌 Fact : RestNet Paper popularized the Residual Connection or Skip Connection, which is even used in Transformer Architecture!
Tasks we can do with ConvNet 💪
With ConvNet we can do below tasks,
☝️Image Classification
Where we classify image, like what contains in an image
Some Models are,
AlexNet :
Fact : AlexNet proved that “Scale + Depth” Works
With Large Datasets
Deep Networks
In general revolutionized Deep Learning
-
VGGNet(16, 19)
-
Inception(Very different approach)
a. Called as GoogLeNet
i. Yes GooLeNet because of Tribute to LeNet by Yann LeCun
b. Inception Network Paper Link : https://arxiv.org/abs/1409.4842
c.
- ResNet : https://arxiv.org/abs/1512.03385
✌️Object Detection
Detect and draw Bounding Box around the objects in the image like Car, Pedastrian
Some Models are
R-CNN
YOLO
-
YOLO is such an amazing Paper covers cool topics like,
-
Intersection Over Union(IoU)
-
Non-Max Suppression
-
Anchor Boxes
-
🤟Image Segmentation
We draw a carefully outline around the object that is detected, so we see which pixel belong to an object and not
Like we paint the color around possible Objects in an image
Some Models are,
FCN
U‑Net 😍
✌️✌️And much more CNN Models
Let us discuss more in another Post!
🤔 What Deep ConvNet Learns?
But, crucial question is, CNN have these Layers and what these Layers learn?
Let us Consider this same image and let us consider that this CNN has 4 Layers
Each Layer Learns Features like,
First Layer : “Straight Line”, “Cross Lines”, etc
Second Layer : “Curves”, “ Common Pattern”, etc
Third Layer : “Cat Nose”, “Dog Ears” etc
Fourth Layer : “Cat Face”, “Dog Face” etc
Earlier Layers learns small patters like Lines, Shapes and
Deep Layers will Learn Complex Features like Nose, Ears, even Face etc
💡 ViT vs CNN
CNN is very powerful, yes, now also and still competitive to ViT(Vision Transformer)
CNN | ViT |
---|---|
Local receptive fields | Global Attention |
Good for small datasets | Needs Large Datasets |
Still competitive! 🥊 | Gaining popularity fast 🚀 |
Let us discuss more!
Thanks for reading