Bilateral Filtering with CUDA and OpenCV 3.x

This prototype tests different implementations of the bilateral filtering to smooth images using C++, CUDA, OpenCV 3.X.

Several smoothing algorithms exist, but the most popular are:

  • Normalized Box Filter: this filter is the simplest of all. Each output pixel is the mean of its kernel neighbors (all of them contribute with equal weights)
  • Gaussian Filter: Probably the most useful filter (although not the fastest). Gaussian filtering is done by convolving each point in the input array with a Gaussian kernel and then summing them all to produce the output array.
  • Median Filter: The median filter run through each element of the signal (in this case the image) and replace each pixel with the median of its neighboring pixels (located in a square neighborhood around the evaluated pixel).
  • Bilateral Filter: In an analogous way as the Gaussian filter, the bilateral filter also considers the neighboring pixels with weights assigned to each of them. These weights have two components, the first of which is the same weighting used by the Gaussian filter. The second component takes into account the difference in intensity between the neighboring pixels and the evaluated one.

This link compares each filter in more details and also contains several examples using C++ and OpenCV.

A bilateral filter is a non-linear, edge-preserving and noise-reducing smoothing filter for images. The intensity value at each pixel in an image is replaced by a weighted average of intensity values from nearby pixels. This weight can be based on a Gaussian distribution. Crucially, the weights depend not only on Euclidean distance of pixels, but also on the radiometric differences (e.g. range differences, such as color intensity, depth distance, etc.). This preserves sharp edges by systematically looping through each pixel and adjusting weights to the adjacent pixels accordingly.


Four different methods are compared to each other in this prototype:

  • OpenCV 3.x CPU based method cv::blur from the imgproc module.
  • OpenCV 3.x GPU based method cv::cuda::bilateralFilter() from the cudaimgproc module.
  • Own method bilateralFilterCpu() implemented in C++ to run in serial on the CPU.
  • Own method bilateralFilterCuda() implemented in Cuda to run in parallel on the GPU.

Appart from the 4 methods above the cuda toolkit provides a more advanced example which can be found under toolkit/samples/3_imaging/bilateralFilter or online an older version at

The code is build and compiled on a MSI laptop with Geforce GTX 970M, 3GB GDDR5 and 13 streaming multi processors.

OpenCV CPU implementation

In this first test, the CPU version from OpenCV cv::resize is tested.

OpenCV GPU implementation

In this second test, the CPU version from OpenCV cuda::resize is tested.

Own CPU implementation

In this third test, own CPU implementation is tested.

Execution times of the 4 implementations on simple_room-wallpaper-800×600.jpg, 126,0 kB, 800×600 pixels with a kernel size of 5

The 4 implementations were run 10 times in a row. By excluding the peaks from the output, we can see that the OpenCV CPU implementation took in average ~8 ms, the OpenCV GPU implementation took in average ~0.6 ms, own CPU implementation took in average  ~1200 ms and own CUDA implementation took in average ~8 ms.

All these times do not include the time spent on loading the image into CPU, but they do include the times to copy the image from CPU to GPU memory.

Profiling of the OpenCV GPU and own CUDA implementation using NSight profiler

In profile mode can also be seen the time spent on computing is about 0.6 ms for OpenCV GPU 11% of the total compte time, compared to 89% for own GPU implementation.


Original and generated rescaled images with OpenCV GPU and own method


Original image



Smoothed image using OpenCV GPU

Leave a Reply

Your email address will not be published.