Convolutional Neural Networks

Computer Vision Tasks

Generally, we want to automate the derivation of useful information from images. Some example tasks include:

classification: given an image, predict a class label
object detection: generate a bounding box around the object
semantic segmentation: assign every pixel in the image a class label
instance segmentation: differentiate between multiple instances of the same semantic class
pose recognition: for example, estimating the pose of a head, which can be used to determine what they are looking at
activity recognition: related to pose recognition, classify a pose or series of poses
object tracking: propose correspondence of detected objects across frames of a video
image restoration
feature matching: detection of features and correspondence between multiple views

Architecture

In this workshop, we will use RGB images. Each image has dimensionality H x W x 3, where H is the height of the image, W is the width of the image, and every pixel has three color channels (Red, Green, Blue).

To motivate the need for a neural network architecture for images, consider the case of classifying digits with MNIST using a fully connected neural network. In this case each pixel takes on only a single greyscale value, and thus the image has H x W values. In the fully connected neural network, we unravel the image into a one dimensional vector which is HW long, either by picking values row-wise or column-wise.

Despite both images having the same structure, the location of the white value is shifted in the vector, meaning it interacts with a completely different set of weights and biases. This is a toy example of a larger problem. For a classification example, we would like to learn to identify an image that features a cat, whether the cat is in the upper left or the lower right of the image. This property is known as translation invariance. Convolution is the linear operator that enables this in CNNs.

Some observations before we start: we have two new operations, convolution and pooling. Note that we use a common activation function, ReLU. Based on the output of the fully connected layers, we are performing a classification task between some number of classes. Note also that we seem to be reducing the dimensionality of the input down until we reach some number of features, and then feed this into a fully connected network which will generate the predicted class based on this reduced set of features.

Convolution

Pass kernels/filters over an image, high response where the image matches the filter, low response elsewhere. The kernels have learnable parameters, which means that we can use loss and backpropagation to learn kernels that find salient features.

Here's an example of the learned filter bank of ImageNet (Krizhevsky et al.). Note that some of the filters are what we would expect: the network has learned to look for lines at various angles, as well as for dots. Not all of the filters are easily interpretable.

Before neural networks there was significant effort to hand design kernels to detect small features, such as edges, corners, etc and use these for computer vision tasks. Neural networks allow us to set up a convolutional architecture that both learns the kernels that are useful for the given task as well as the mapping from the feature space to the output.

There is some additional complexity that we won't spend much time discussing. Just know that

Some additional complexity: stride, padding.

Translational invariance -- no matter where a feature is located in the image, we will detect it with the kernel as the kernel slides around the input. parameter sharing -- the resulting network also benefits from a hugely reduced number of parameters, as the kernel requires learning fewer parameters. (add scale information from the powerpoint)

Pooling

These feature maps still have very high dimension, so we pool the filter activations through a pooling layer.

Activations

Intuition: low level detailed features to high level semantic description that is useful for classification.

Case study: VGG-16 Architecture

Sources for images and additional resources

Stanford's CS231n, CNNs for Visual Recognition An excellent resource for neural networks as well as CNNs
ConvNetJS CIFAR-10 Train and test a model on CIFAR10 in your browser. Excellent demonstration of the activations at every step in the architecture.
Comprehensive Guide to Convolutional Neural Networks - the ELI5 way

UArizona DataLab, Data Science Institute, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convolutional Neural Networks

Computer Vision Tasks

Architecture

Convolution

Pooling

Activations

Case study: VGG-16 Architecture

Sources for images and additional resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally