Goliath couldn't implement David's Rock Detector either

CSE 576 Project Two – Spring, 2006
by
Leith Caldwell

The challenge

The goal of this project was to write code that detected, described and matched features in images. These features were supposed to be reasonably invariant to translation, rotation, illumination, and scale. As detailed further below, these feature descriptors and detection methods were also then to be benchmarked against some of the present state-of-the-art algorithms and descriptors from the Computer Vision community.

The long and the short of it

I found this project extremely challenging in that I could get absolutely nowhere. Even after getting help, scouring the lecture slides, notes and the web spending a very, very long time going over the code, writing, re-writing and testing trying to make something work.

The challenge that stopped me (in the manner of David's small rock to Goliath) was writing a working implementation of the Harris corner detector. This may be in part due to my own lack of ability in understanding the given codebase and how to manipulate it and in part due to my lack of enough preparatory background learning for this course.

Idea potential

Towards the end of this project, it became clear that I would not be able to implement and test my various ideas for feature descriptors. With that in mind, here are some of those ideas which I was hoping to leverage.

Bare simplicity
Beginning with the simple case, this would be to just store the 5 × 5 window of pixels around the point of interest as the descriptor itself. This simple case could be extended experimentally by expanding the size of the window (7 × 7, 9 × 9, 11 × 11, etc).

On relatvity
In order to make the descriptor somewhat illumination invariant, each window of values around the interest point could be converted to a relative intensity for the image, or perhaps just for a larger window around the interest point. i.e. if the image had been darkened for some reason, the maximum intensity value in the image would be lower. On a relative scale however, the brightest spots would still have the same corresponding value.

On pyramids
By adapting the feature detector to detect at several levels of the image pyramid, you could then check against the feature locations in each level as a sort of confidence measure. Instead of the suggested need to be a local maximum in a 3 × 3 surrounding window, it could be verified by the corresponding pixels in the upper levels.
Alternately, if still storing only the window around the interest point, the feature descriptor could contain the window on several levels.
Furthermore, the average pixel intensity could be stored instead of storing the value at each level.

Different windows
Since there would be some kind of transformation applied to the scene in order to get the different viewpoint that the query image contains, storing the pixel values for the window around the interest point would have less meaning unless some other information is also stored, as in the image pyramid case above. Another sort of descriptor to store would be a kind of 'averaging window'. Like so:

Take a 9 × 9 window of pixels around the interest point, then divide that window into a grid of nine 3 × 3 smaller windows. Then compute the average intenstiy value for the smaller window, and store the 3 × 3 grid (covering the 9 × 9 window) as the desctiptor.

As above in the first item, this approach could be extended to experiment on differing window sizes.

The result

After all that, I have a bunch of ideas about what I would like to execute and implement for my feature descriptor, but without even being able to detect where points of interest lie, I have no basis of evidence upon which to found my design decisions or evaluate which is better and for what reasons.

While I do have a naive feature matching implementation, without the implementation of my feature descriptor I am unable to create a matcher that takes advantage of knowing the specific size and layout of the descriptor, or evaluate my naive matcher.

Further development

While there are always more extensions one can add into any project, in this case the development of the feature detector would be key. From there the various ideas outlined previously would be able to be implemented separately, then measured against one another for quantitative evidence towards a particular type of descriptor. This evidence could then be graphed in order to more clearly visualise the differences.

Hypothesis

I believe that as in most cases, it would be a hybrid of the previously mentioned approaches that would yield the best results from my ideas. That said, I do not believe that any of the approaches would necessarily perform better than the SIFT descriptor.

Acknowledgements

Thanks to Lillie Kittredge for helping me try to understand what on earth I'm doing.