Skip to content

Convolutional Neural Networks

Brenda Huppenthal edited this page Apr 3, 2024 · 7 revisions

Computer Vision Tasks

Generally, we want to automate the derivation of useful information from digital images. A typical image is a NxM array where each location has 3-4 channels: RGBA, red green blue and alpha. We extend this into video realm where we have a sequence of images in time. Other types of data that can be used are 3d point clouds from LiDaR scanners.

  • classification: given an image, classify it into one category.
  • object detection: generate bounding boxes that tell you the extent of a detected object within the image.
  • semantic segmentation: assign every image in the pixel to a certain class: grass, cat, tree, sky.
  • instance segmentation: for semantically identical classes (i.e. three dogs), differentiate them by instance.
  • pose recognition: things like estimating the pose of a head (which can tell you what a person is looking at), pose of the body.
  • activity recognition: related to pose recognition, classify video by activity.
  • object tracking: propose correspondence of detected objects across frames of a video.
  • image restoration
  • feature matching: detection of features and correspondence between multiple views

Architecture

Consider the case of classifying digits with MNIST. Despite the 1 in both images having the same structure, the location of the black values is shifted in the vector, meaning it's on a different input neuron in our fully connected network. So we want some kind of architecture that is capable of recognizing this structure, wherever it occurs in the image.

[picture of shifted 1]

A linear operation known as convolution enables this for us. Pass kernels/filters over an image, high response where the image matches the filter, low response elsewhere. [show an example of convolution]

Before neural networks there was significant effort in hand designing filters to pass over an image to detect small features, such as edges, corners, etc. Neural networks allow us to set up a convolutional architecture and it learns the filters that are useful for the given task. With convolutional networks, we have a neural network that uses convolution in place of the matrix multiplication step in at least one layer.

But how does convolution help?

Translational invariance -- no matter where a feature is located in the image, we will detect it with the kernel as the kernel slides around the input. parameter sharing -- the resulting network also benefits from a hugely reduced number of parameters, as the kernel requires learning fewer parameters. (add scale information from the powerpoint)

Additional Resources

  • ConvNetJS CIFAR-10 Train and test a model on CIFAR10 in your browser. Excellent demonstration of the activations at every step in the architecture!
Clone this wiki locally