Introduction to Computer Vision

5 min readMar 13, 2021

Computer vision is an application of deep learning that enables a computer to see, identify and locate the objects in an image (or videos) in the same way a human does.
One of the important applications of Computer vision is self-driving cars wherein the machine learns to identify what is in front of it to decide on its next move.
Other applications include face recognition, video surveillance, gesture recognition, etc.

Classification Loss

The above loss is with respect to classification error.

1i^obj is one only if the object is present in the i^th grid cell. This term is used since we don’t want to penalize the error. The term is used when there is no object present in the cell.

FC to Conv Layer

Sliding Window Convolution Implement

Let say that we have a test image of 16 x 16 x 3.
Using the previous model trained on 14 x 14 x 3 image, we scan the test image by sliding by two pixels.

Sliding Window Drawback

As observed in the previous card, the CNN model has to run four times to scan the full image.
If the test image is of size 1000 x 1000, we need to run the model number of times, and since we are sliding by only a few pixels, the model scans the same area redundantly.
To overcome this, we convolutionally implement the sliding window as shown in the next card.

Sliding Window — Convolution

ELEMENTS OF OBJECT DETECTION

Object Localization

Unlike classification networks such as ResNets or VGG net, the object detection algorithm has to identify multiple objects and specify their exact location as shown in the image.
This property of predicting the bounding boxes around the objects is known as object localization.

Grid Cells

Object localization needs to predict the height, width and location of bounding box around the image.
Before specifying the bounding box attributes of each object the image is divided into(S×S) grid cells as shown in the picture.
If the center of the object falls on in a grid cell then that grid cell is responsible for predicting the object.

Bounding Box

Each bounding box are defined by attributes center (b_x, b_y)(bx,by), height b_h and width b_w, each of which lies in the range 0 to 1.
The image shows the computation of the bounding box attributes. Note that the calculation of (b_x, b_y)(bx,by) are relative to the center grid cell.

Target Label

The target label y defines each of the grid cells.

y is a vector given by y =[p.bx.by.bh.bw.c1…c2⋮.cn]

p is known as object confidence that gives the probability of the presence of an object in the bounding box.
c_1, c_2 … c_n is the class confidence intervals For example, if you have one of the two classes to identify pedestrian or a car, then c_1 gives the probability that the grid cell has a car and c_2 gives the probability of the presence of a pedestrian.

Object Confidence vs Class Confidence

object confidence p is different from that of the class confidence c.

p is the probability of the presence of an object within the bounding box irrespective of the class of object.
c is the probability of the object belonging to a particular class under the probability p.

IOU(Intersection Over Union)

Intersection over union(IOU) is a measure of the accuracy of the predicted bounding box against the ground truth box (the actual bounding box).
It is the ratio of area covered by the intersection of ground truth box and predicted box to the area covered by the union of these to boxes.
The maximum possible value of IOU is 1. If the measured IOU is greater than the set threshold, we can conclude that predicted bounding box is close to the ground truth box.

IOU Illustration

YOLO (You Only Look Once) ALGORITHM

In You Only Look Once (YOLO) algorithm, you run the image through a CNN model and detect the object through a single pass.
This algorithm identifies multiple bounding boxes for the same object. Hence, we use a method called non-max suppression to filter out single prediction box for each object in the image. Rest of the cards show you step by step procedure of how YOLO algorithm works.

Yolo-v2 Configuration

YOLO architecture recognizes objects falling in 7x7 grid cells.

Yolo Output

Once you pass an image through a YOLO model, the output will have lots of overlapping predictions across adjacent grid cells as shown in the image.
We selectively filter out bounding boxes using a non-max suppression technique to have one bounding box for each object.

Non-Max Suppression

Following steps are performed in non-max suppression:
Discard all the boxes with object confidence P <= 0.6
while there are any remaining boxes,
pick the boxes with the largest P and output that as a prediction.
eliminate any remaining boxes with IOU>=0.5 with the box output in the previous step.
repeat the previous two steps for the next largest P value.
The threshold for object confidence and IOU is not always fixed to 0.6 and 0.5 respectively, they may vary depending on the problem statement.

Yolo Loss Function

Yolo loss function includes four parts

Error in bounding box centers
Error in bounding box dimensions
Loss related to confidence score
Object classification loss

The summation of above four losses corresponds to final loss JJ .