Object detection with SURF, KNN, FLANN, OpenCV 3.X and CUDA

This prototype tests different implementations of the real-time feature-based object detection with SURF, KNN, FLANN, OpenCV 3.X and CUDA.

Object detection is the process of finding instances of real-world objects such as faces, bicycles, and buildings in images or videos. Object detection algorithms typically use extracted features and learning algorithms to recognize instances of an object category. It is commonly used in applications such as image retrieval, security, surveillance, and automated vehicle parking systems.

It is one of the fundamental problems in computer vision and a lot of techniques have come up to solve it, including Feature-based object detection, Viola-Jones object detection, SVM classification with histograms of oriented gradients (HOG) features and Image segmentation and blob analysis. Other methods for detecting objects with computer vision include using gradient-based, derivative-based, and template matching approaches. More details can be found in this post.

This post is about testing of the SURF algorithm, the FLANN based matcher (Fast Library for Approximate Nearest Neighbors), and their OpenCV 3.x CPU and GPU/CUDA implementations. The SURF algorithm falls into the first category Feature-based object detection.

Detecting a reference object (left) in a cluttered scene (right) using feature extraction and matching. RANSAC is used to estimate the location of the object in the test image.


SURF stands for Speeded Up Robust Features and is an algorithm which extracts some unique keypoints and descriptors from an image. Details on the algorithm can be found here and on its OpenCV implementation here and here. A set of SURF keypoints and descriptors can be extracted from an image and then used later to detect the same image. SURF uses an intermediate image representation called Integral Image, which is computed from the input image and is used to speed up the calculations in any rectangular area. It is formed by summing up the pixel values of the x,y co-ordinates from origin to the end of the image. This makes computation time invariant to change in size and is particularly useful while encountering large images. The SURF detector is based on the determinant of the Hessian matrix. The SURF descriptor describes how pixel intensities are distributed within a scale dependent neighborhood of each interest point detected by Fast Hessian.

Object detection using SURF is scale and rotation invariant which makes it very powerful. Also it doesn’t require long and tedious training as in case of using cascaded Haar classifier based detection. The detection time of SURF is a little longer than Haar, but it doesn’t make much problem in most situations if takes some tens of millisecond more for detection. Since this method is rotation invariant, it is possible to successfully detect objects in any orientation, case where the Haar classifier fails miserably.


Feature keypoints are represented by the circle centers and the descriptors by their radius and orientation

A short description of what SURF does:
  • Find interest points in the image using Hessian matrices
  • Determine the orientation of these points
  • Use basic Haar wavelets in a suitably oriented square region around the interest points to find intensity gradients in the X and Y directions. As the square region is divided into 16 squares for this, and each such sub-square yields 4 features, the SURF descriptor for every interest point is 64 dimensional.
  • For a description of the 4 features, please refer the paper – they are basically sums of gradient changes.

Surf is a non-free algorithm. Other feature detector algorithms implemented in OpenCV include SIFT, FAST, BRIEF, ORB as listed here.

Feature Matching with FLANN (Fast Library for Approximate Nearest Neighbors)

After unique keypoints and descriptors are extracted, from both images object and scene, a matching must be done. OpenCV has two implemented matching strategies, the Brute-Force Matcher and FLANN Matcher as explained here and here.

The Brute-Force matcher is simple and it takes the descriptor of one feature in first set and is matched with all other features in second set using some distance calculation. And the closest one is returned.

The FLANN library contains a collection of algorithms optimized for fast nearest neighbor search in large datasets and for high dimensional features, and it works more faster than Brute-Force Matcher for large datasets.

Both versions implement the same interface and provide different strategies for selecting the best matches. The strategies are DescriptorMatcher::match() which returns only the best match, DescriptorMatcher::knnMatch() wich returns the best k matches using the k-nearest neighbors algorithm, and DescriptorMatcher::radiusMatch() which returns training descriptors not farther than the specified distance. The three OpenCV implementations are explained here and here in details.

In this post the second method is used DescriptorMatcher::knnMatch().

OpenCV libraries

Surf is a non-free algorithm and therefore included in a separate repository opencv_contrib. For linux it can be installed as explained here.

To get the code running with OpenCV 3.1 these libraries must be included in the path.

opencv libraries

OpenCV libraries


This first implementation is using only CPU only methods of OpenCV 3.X.


This second implementation is using both GPU and CPU methods of OpenCV 3.X.


Localization is done after keypoints and feature descriptors are extracted from both object and scene images. It searches for the right position, orientation and scale of the object in the scene based on the good_matches. This method also draws lines between matching good keypoints and a square where and if the object was detected. For finding the best translation matrix Random sample consensus (RANSAC) is used.

Execution times of the 2 implementations

Only the time spent extracting the keypoints and feature descriptors and the time for extracting the matches are taken in account. The time for loading images into CPU and GPU memories is ignored.

In the console can be seen that the GPU version finishes in about 20 ms for images not bigger than 800×600 which means the GPU implementation can run in real-time. The CPU version finishes in about 200 ms and could process only a few frames per second.

Original image and detected images

Result with many wrong detected features when the scene has similar images.

Correctly found object but with many wrong detected features and position.

Result with many wrong detected descriptors and position.

Correctly found object and features but with wrong position.

Result with correctly found object and almost correctly location.

Correctly found object and features, and almost correct location.

Correctly found object, features and location

Correctly found object, features and location.


Surf is a powerful and fast algorithm for unique feature extraction. For extracting keypoints and feature descriptors it uses only neighbour pixels, for this reason it has no information regarding the shape and relative position between keypoints. To have a correct detection it must be combined with other algorithm which take the relative position and shape in consideration.












Computer Vision – The Integral Image






















Leave a Reply

Your email address will not be published.