Feature Selection in SIFT - Theory and Practice

SIFT is Scale Invariant Feature Transform, which is a commonly used image recognition algorithm. Like most image recognition algorithms, it works by projecting a 2D image to a 1D feature-vector that represent what’s in the image, then comparing that feature-vector to other feature-vectors produced for images with known contents (called training images). If vectors for two different images are close to each other, the images may be of the same thing.

I did a bunch of machine learning and pattern matching when I was in grad school, and the thing that was always most persnickety was choosing how to make the feature-vector. You’ve got a ton of data, and you want to choose only values for the feature-vector that are representative in some way of what you want to find. Ideally, the feature-vector is much smaller in size (in terms of total bytes) than the original image. Hopefully the decrease in size is achieved by throwing away inconsequential data, and not data that would actually be pretty helpful. If you’re doing image recognition, it might make sense to use dominant colors of an image, edge relationships, or something like that. SIFT is much more complicated than that.

SIFT, which has been pretty successful, creates feature-vectors by choosing key points from band-pass filtered images (they use the difference of gaussians method). Since an image may be blurry or of a different size than the training images, SIFT generates a huge number of Difference of Gaussian variants of the image (DoG variants). By carefully choosing how blurred the images are before subtraction, DoG variants can be produced that band-pass filter successive sections of the frequency content.

The DoG variants are then compared to each other. Each pixel in each DoG variant compared to nearby pixels in DoG variants of nearby frequency content. Pixels that are maximum or minimum compared to neighboring pixels in nearby frequency DoGs are chosen as the features for the feature-vector, and saved as both location and frequency. These feature-vector elements (called keypoints) then encapsulate both space information (the location of the pixel in the original image), and frequency information (it’s a max or min compared to nearby frequency DoGs).

Pixels that are too similar to nearby pixels in the original image are thrown away. If the original pixel is low contrast, it’s discarded as a feature. If it’s too close to another keypoint that is more extreme in value, then it’s discarded. Each of the remaining feature-vector elements is associated with a rotation by finding the gradient of the DoG that the element was taken from originally. Finally, all of these points are assembled into 128 element arrays describing the point (position, frequency, rotation, nearby pixel information).

This means that if there are a large number of keypoints in the image, the feature vector used for image recognition could be even larger than the size of the image itself. So they aren’t doing this speed up computation, it’s solely for accuracy.

And SIFT does get pretty accurate. It can identify objects in images even if they’re significantly rotated from the training image. Even a 50deg rotation still leaves objects identifiable, which is pretty impressive.