Skip to content

Convolutional Neural Networks

Brenda Huppenthal edited this page Apr 4, 2024 · 7 revisions

Computer Vision Tasks

Generally, we want to automate the derivation of useful information from images. Some example tasks include:

  • classification: given an image, predict a class label
  • object detection: generate a bounding box around the object
  • semantic segmentation: assign every pixel in the image a class label
  • instance segmentation: differentiate between multiple instances of the same semantic class
  • pose recognition: for example, estimating the pose of a head, which can be used to determine what they are looking at
  • activity recognition: related to pose recognition, classify a pose or series of poses
  • object tracking: propose correspondence of detected objects across frames of a video
  • image restoration
  • feature matching: detection of features and correspondence between multiple views

Architecture

In this workshop, we will use RGB images. Each image has dimensionality H x W x 3, where H is the height of the image, W is the width of the image, and every pixel has three color channels (Red, Green, Blue).

To motivate the need for a neural network architecture for images, consider the case of classifying digits with MNIST using a fully connected neural network. In this case each pixel takes on only a single greyscale value, and thus the image has H x W values. In the fully connected neural network, we unravel the image into a one dimensional vector which is HW long, either by picking values row-wise or column-wise.

Despite both images having the same structure, the location of the white value is shifted in the vector, meaning it interacts with a completely different set of weights and biases. This is a toy example of a larger problem. For a classification example, we would like to learn to identify an image that features a cat, whether the cat is in the upper left or the lower right of the image. This property is known as translation invariance. Convolution is the linear operator that enables this in CNNs.

Some observations before we start: we have two new operations, convolution and pooling. Note that we use a common activation function, ReLU. Based on the output of the fully connected layers, we are performing a classification task between some number of classes. Note also that we seem to be reducing the dimensionality of the input down until we reach some number of features, and then feed this into a fully connected network which will generate the predicted class based on this reduced set of features.

Convolution

Pass kernels/filters over an image, high response where the image matches the filter, low response elsewhere. The kernels have learnable parameters, which means that we can use loss and backpropagation to learn kernels that find salient features.

Here's an example of the learned filter bank of ImageNet (Krizhevsky et al.). Note that some of the filters are what we would expect: the network has learned to look for lines at various angles, as well as for dots. Not all of the filters are easily interpretable.

Before neural networks there was significant effort to hand design kernels to detect small features, such as edges, corners, etc and use these for computer vision tasks. Neural networks allow us to set up a convolutional architecture that both learns the kernels that are useful for the given task as well as the mapping from the feature space to the output.

There is some additional complexity that we won't spend much time discussing. Just know that

Some additional complexity: stride, padding.

Translational invariance -- no matter where a feature is located in the image, we will detect it with the kernel as the kernel slides around the input. parameter sharing -- the resulting network also benefits from a hugely reduced number of parameters, as the kernel requires learning fewer parameters. (add scale information from the powerpoint)

Pooling

These feature maps still have very high dimension, so we pool the filter activations through a pooling layer.

Activations

Intuition: low level detailed features to high level semantic description that is useful for classification.

Case study: VGG-16 Architecture

Sources for images and additional resources

Clone this wiki locally