🏞️ CNN Basics

🧑‍💻 ML | Neural Network | CNN

⿻ ConvNet

☝️- This should be approximate(available) Emoji for Filters 😉 - Emoji Generator

Convolution Neural Network (CNN or ConvNet) is a special type of Deep Neural Network used primarily used for image related tasks like, Image Classification, Object Detection, Image Segmentation etc.,

🤷🏻‍♂️ Why ConvNets Special?

Local connectivity/Sparsity of Connection : Focuses only on a small part of the input image at a time

Parameters Sharing/Shared Weights : Uses the same Filter in a Layer across an entire image

Translation Invariance : Learns to detect patterns (like edges, nose, even Cat Face) regardless of where they appear in an image

🧱 Building Blocks of a ConvNet

Note : Here we consider “input” as an image for understanding purpose, but we can apply for any other relevant data as well. But mostly will be images

Let us see the building blocks of ConvNet

1. Input Layer 🖼️

For Image Data, input is usually in the form of 3D tensors of dimension Height × Width × ChannelsWhere,

H : Height of the image,

W : Width of the Image,

C : Number of channels for example in RGB, Number of channels are 3

2. Convolutional Layer ⿻

Performs a Convolution Operation using Filters(also called Kernels) which is Linear Operation ~ similar to Z = WA[l] + b in Neural Networks

Note : Generally in Deep Learning Literature Researchers will refer to this as Convolution Operation, although they perform Cross-Correlation, for historical reasons and simplicity

And guess what? It turns out to be more powerful in practice!

Here is the simple illustration of single step of Filter Operation

For Simplicity let us consider the Input Image as dimension H x W x C as 5 x 5 x 1

Let the Filter dimension as Fh x Fw x Fc as 3 x 3 x 1

Note : C in Input Image and Fc in Filter dimension should be SAME

Generally we will also add another dimension in Filter Operation i.e. Number of Filters, but let us see that later may be in another post!

Filter Operation

We will place the 3 x 3 x 1 Filter above the top right corner of the Input Image and we do Element-Wise Multiplication and Sum all the results(only on that Local Filter Field (you can refer the above image))

Simple Example

Input Image

+----+----+----+----+
|  1 |  2 |  3 |  4 |
+----+----+----+----+
|  5 |  6 |  7 |  8 |
+----+----+----+----+
|  9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+

Filter(Sobel)

+----+----+----+
| -1 |  0 | +1 |
+----+----+----+
| -2 |  0 | +2 |
+----+----+----+
| -1 |  0 | +1 |
+----+----+----+

Result : After One Step of Filter Operation

+----+----+
|  8 |  8 |
+----+----+
|  8 |  8 |
+----+----+

But how the dimension changed to 2 x 2? How?

It is due to below formula :-

     | nh + 2p - f    |
nh = |------------ +1 |
     |     s          |
      --            --

     | nw + 2p - f    |
nw = |------------ +1 |
     |     s          |
      --            --

“nh x nw”

Stride : s

Stride tells how many pixels to move the Filter over the Input Image

Padding : p

Padding mean we adding extra pixels(usually zeros) around the border of the Input Image before applying Convolution

For example, if p = 1, we surround “0” around an Input Image, which looks like below for our case

+----+----+----+----+----+----+
|  0 |  0 |  0 |  0 |  0 |  0 |
+----+----+----+----+----+----+
|  0 |  1 |  2 |  3 |  4 |  0 |
+----+----+----+----+----+----+
|  0 |  5 |  6 |  7 |  8 |  0 |
+----+----+----+----+----+----+
|  0 |  9 | 10 | 11 | 12 |  0 |
+----+----+----+----+----+----+
|  0 | 13 | 14 | 15 | 16 |  0 |
+----+----+----+----+----+----+
|  0 |  0 |  0 |  0 |  0 |  0 |
+----+----+----+----+----+----+

Formula to find the result dimension after padding is,

(n + 2p - f + 1) * (n + 2p - f + 1)

Generally, after one layer of Convolution, the resulting dimension will change,

Now we need to talk about two things,

Valid Convolution

Valid Convolution means p = 0, where no padding is applied and the resulting dimension will also be reduced

Also if no padding is applied, then consider the left top corner of the image, On-Convolution less feature will be learnt on that corner, so padding helps!

Same Convolution

When “Same Convolution” is applied then we need to add padding accordingly so that following is true,

Input Dimension = Resulting Dimension after Convolution Operation

How to calculate that specific Padding “p”?

Here is how we derive “p”,

Let us consider that Output Dimension = Input Dimension so

          n + 2p - f + 1 = n   👈 "n" is Output Dimension
      // ☝️ "n" is Input Dimension

         2p - f + 1 = n - n
         2p = f - 1
         
               f - 1
          p =  ------
                 2

Then we will apply Activation Function like ReLU on the Result, that is we do Filter Operation and adding bias

Then we will apply Pooling Layer

Pooling

Pooling helps speed up the Computation and also makes feature it detect more robust

Two common types are :

Max Pooling

As like Filter we to Element-Wise Multiplication here we take the maximum number in the Filter Region of the Input Image and place in the respective position in the output

Example,

Here f = 2, s = 2

Input

+----+----+----+----+
|  1 |  2 |  3 |  4 |
+----+----+----+----+
|  5 |  6 |  7 |  8 |
+----+----+----+----+
|  9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+

Output

+----+----+
|  6 |  8 |
+----+----+
| 14 | 16 |
+----+----+

Average Pooling

Here we take the average of numbers in the Filter Region of the Input Image and place in the respective position in the output

Example,

Here f = 2, s = 2

Input

+----+----+----+----+
|  1 |  2 |  3 |  4 |
+----+----+----+----+
|  5 |  6 |  7 |  8 |
+----+----+----+----+
|  9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+

Output

+-------+-------+
|  3.5  |  5.5  |
+-------+-------+
| 11.5  | 13.5  |
+-------+-------+

This is commonly considered as one Layer and we repeate as many time as required and we flatten and feed that to Dense Neural Network(For Image Classification Case)

Typically most of the ConvNet looks like below

Going to Left to Right

Height and Width of the Image will be Reduced
Number of Channels will be Increased

🆚 The Theoretical vs Practical Gap

Theoretical Expectation: Theoretically, a Deeper Network should perform at least as well as a Shallower one

Practical Reality: While performing experiments on Networks,

A 34-layer plain Network showed a higher Validation Error than an 18-layer plain network
Training Error for the 34-layer plain Network was found to be higher than the 18-layer plain network

So, we add Residual Connection also called Skip Connection to pass current Layers Activations to another Layer, skipping some intermediate Layers. Which allows us to train much Deeper Network by solving Vanishing Gradient Problem and fixing Degradation Problem

ResNet Authors found Adding Residual Connection, allows the Network to be Deep

ResNet Paper Link : https://arxiv.org/abs/1512.03385

📌 Fact : RestNet Paper popularized the Residual Connection or Skip Connection, which is even used in Transformer Architecture!

Tasks we can do with ConvNet 💪

With ConvNet we can do below tasks,

☝️Image Classification

Where we classify image, like what contains in an image

Some Models are,

AlexNet :

Fact : AlexNet proved that “Scale + Depth” Works

With Large Datasets

Deep Networks

In general revolutionized Deep Learning

VGGNet(16, 19)
Inception(Very different approach)

a. Called as GoogLeNet

i. Yes GooLeNet because of Tribute to LeNet by Yann LeCun

b. Inception Network Paper Link : https://arxiv.org/abs/1409.4842

ResNet : https://arxiv.org/abs/1512.03385

✌️Object Detection

Detect and draw Bounding Box around the objects in the image like Car, Pedastrian

Some Models are

R-CNN

YOLO

YOLO is such an amazing Paper covers cool topics like,
- Intersection Over Union(IoU)
- Non-Max Suppression
- Anchor Boxes

🤟Image Segmentation

We draw a carefully outline around the object that is detected, so we see which pixel belong to an object and not

Like we paint the color around possible Objects in an image

Some Models are,

FCN

U‑Net 😍

✌️✌️And much more CNN Models

Let us discuss more in another Post!

🤔 What Deep ConvNet Learns?

But, crucial question is, CNN have these Layers and what these Layers learn?

Let us Consider this same image and let us consider that this CNN has 4 Layers

Each Layer Learns Features like,

First Layer : “Straight Line”, “Cross Lines”, etc

Second Layer : “Curves”, “ Common Pattern”, etc

Third Layer : “Cat Nose”, “Dog Ears” etc

Fourth Layer : “Cat Face”, “Dog Face” etc

Earlier Layers learns small patters like Lines, Shapes and

Deep Layers will Learn Complex Features like Nose, Ears, even Face etc

💡 ViT vs CNN

CNN is very powerful, yes, now also and still competitive to ViT(Vision Transformer)

CNN	ViT
Local receptive fields	Global Attention
Good for small datasets	Needs Large Datasets
Still competitive! 🥊	Gaining popularity fast 🚀

Let us discuss more!

Thanks for reading