Vehicle Detection using OpenCV and SVM classifier

In this 5’th project from the Self-Driving Car engineer program designed by Udacity, our goals are the following:

  • Perform a Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of images and train a classifier Linear SVM classifier
  • Optionally, you can also apply a color transform and append binned color features, as well as histograms of color, to your HOG feature vector.
  • Note: for those first two steps don’t forget to normalize your features and randomize a selection for training and testing.
  • Implement a sliding-window technique and use your trained classifier to search for vehicles in images.
  • Run your pipeline on a video stream (start with the test_video.mp4 and later implement on full project_video.mp4) and create a heat map of recurring detections frame by frame to reject outliers and follow detected vehicles.
  • Estimate a bounding box for vehicles detected.

Histogram of Oriented Gradients (HOG)

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.


Extraction of HOG features from the training images.

The code for this step is contained in the file in the method ‘get_hog_features’.

I started by reading in all the vehicle and non-vehicle images. Here is an example of one of each of the vehicle and non-vehicle classes:


I then explored different color spaces and different skimage.hog() parameters (orientations, pixels_per_cell, and cells_per_block). I grabbed random images from each of the two classes and displayed them to get a feel for what the skimage.hog() output looks like. After several tests using images with different contrasts and luminosities, it worked the best using the YCrCb color space in combination with features extracted with HOG with orientations=9, pixels_per_cell=(8, 8) and cells_per_block=(2, 2).

Here is an example using the YCrCb color space and HOG parameters of orientations=9, pixels_per_cell=(8, 8)and cells_per_block=(2, 2):


Extraction of Spatial Binning of Color features

To make the algorithm more robust in identifying the cars, a new type of features is used additionally to the HOG features. Template matching is not a particularly robust method for finding vehicles unless you know exactly what your target object looks like. However, raw pixel values are still quite useful to include in your feature vector in searching for cars.

While it could be cumbersome to include three color channels of a full resolution image, we can perform spatial binning on an image and still retain enough information to help in finding vehicles.

As you can see in the example bellow, even going all the way down to 32 x 32 pixel resolution, the car itself is still clearly identifiable by eye, and this means that the relevant features are still preserved at this resolution.


A convenient function for scaling down the resolution of an image is OpenCV’s cv2.resize().

Extraction of Histograms of Color features

Another technique used in this project to make even more features are the histogram of color intensities as shown in the image bellow.


And implemented as shown here:

Combining and normalizing the features

Now that we’ve got several feature extraction methods in our toolkit, we’re almost ready to train a classifier, but first, as in any machine learning application, we need to normalize our data. Python’s sklearn package provides you with the StandardScaler() method to accomplish this task. To read more about how we can choose different normalizations with the StandardScaler() method, check out the documentation.

Combining all different features of a single image as a single array of features:

Normalization is needed to avoid some feature types weight more the others:

Training a classifier using normalized features.

I trained a linear SVM using 2 classes of images vehicle and non-vehicle images. The images are first loaded, then normalized features are extracted, and shuffled and splitet in 2 datasets train (80%) and test (20%). The features were scaled to zero mean and unit variance before training the classifier using the StandardScaler(). The source code can be found in

The entire dataset (train+test) has 17.767 items distributed evenly between vehicle and not-vehicle. After training a train.p is saved on the disk in the sub-folder train/ for later reuse. The accuracy of the trained linear SVM classifier on the the test dataset is quite high ~0.989.

I decided to search for vehicles in the lower part of the image using a overlapping sliding window search. The search in the lower part only was needed to avoid searching for vehicles in the sky and to make the algorithm faster. The window size is 64 pixels, with 8 cells and 8 pix per cell. At each slide the windows move by 2 cells either to the right or to the bottom. To make the search faster by avoiding the features extraction over and over for each window, the feature extraction is done only once, and then the sliding window uses only that part of the image. The detection could also be more robust if the windows would have different scales for accommodating all cars at long and short distance.

The implementation can be found in

As seen in the image bellow the 2 cars are correctly detected but some false positive are detected as well.


To avoid false positives, a heat map was used. A hit map adds up windows and overlapping windows have a higher value. The values over a certain treshhlold are kept as true positives.


To find the final boxes from heat-map the label function is used.


Pipeline to process one image

As visible in the code bellow, first we extract the bounding boxes including true and false positives. Then using a heat map we discard the false positives. After the final boxes are computed with using the scipy.ndimage.measurements.label()method. At the end the boxes are rendered.

This is the result of the pipeline on one of the test images:


Pipeline to process one video

The same pipeline process_image(image, plot=False) used to process one image was used in the video processing. Each frame is extracted from the video, processed by the image pipeline and merged into the final video using VideoFileClip and ffmpeg.

Here is my video result

For avoiding false positive a heat map was used, but by processing each image independently. When processing videos the heat map implementation can be updated to process several subsequent images as vehicles don’t appear and disappear between frames.


The current implementation using the SVM classifier works well for the tested images and videos, and this is mainly because the images and videos are recorded in a similar environment. Testing this classifier with a very different environment will not have similar good results. A more robust classifier using deep learning and Convolutional Neural Networks will generalize better to unknown data.

Another issue with the current implementation is that in the video processing pipeline subsequent frames are not considered. Keeping a heat map between consecutive frames will discard false positives better.

One more improvement to the current implementation would be multi-size sliding windows, which will better generalize at finding vehicles at short and longer distances.



Leave a Reply

Your email address will not be published.