Image classification with Deep Learning, CNN, Caffe, OpenCV 3.x and CUDA

This prototype tests different implementations of the image classification with Deep Learning, Convolutional Neural Networks (CNN), Caffe, OpenCV 3.x and CUDA.

Topics covered:

  • Image Classification, Neural Networks vs Deep Learning, CNN vs R-CNN, cuDNN, Caffe, ImageNet & challenges
  • Testing of OpenCV’s DNN CPU classification using GoogLeNet, a trained network from Caffe model zoo
  • Testing of Caffe CPU and GPU/CUDA classification using the same GoogLeNet model
  • Building, training and testing of own Caffe CNN model starting from the Oxford 102 category flower dataset
  • Deployment and testing on NVIDIA Tegra K1
  • Installation and development in Ubuntu

Image Classification

In image classification, an image is classified according to its visual content. For example, does it contain an airplane or not. An important application is image retrieval – searching through an image dataset to obtain (or retrieve) those images with particular visual content.

While human visual image interpretation techniques rely on shape, size, pattern, tone, texture, shadows, and association, digital image interpretation relies mainly on color, i.e. on comparisons of digital numbers found in different bands in different parts of an image.
The objective of digital image classification procedures is to categorize the pixels in an image into land cover classes. The output is a thematic image with a limited number of feature classes as opposed to a continuous image with varying shades of gray or varying colors representing a continuous range of spectral reflectances.
Two major categories of image classification techniques include unsupervised (calculated by software) and supervised (human-guided) classification.
This post is focused on Deep Neural Networks (DNN) and Convolutional Neural Network (CNN) and lies between the two main categories. Training images are labeled in a supervised way by an analyst, but the feature learning and classification are automatically done by software in an unsupervised way. More details can be found in this post.

Neural Networks vs Deep Learning

Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data passes in a multi-step process of pattern recognition.

Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.


Artificial Neural Network left vs Deep Neural Network right

In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.


This is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction. It makes deep-learning networks capable of handling very large, high-dimensional data sets with billions of parameters that pass through nonlinear functions.

More details can be found here and here. This online book provides a lot of useful information and examples on machine learning, including Artificial Neural Networks and Deep Neural Networks. Differences between the two can be found in this thread.


In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. They have wide applications in image and video recognition, recommender systems and natural language processing.


In a CNN the input image passes through a series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, and get an output. The output can be a single class or a probability of classes that best describes the image. More details can be found here and here. In this thread are discussed some drawbacks of using a CNN for image classification.

Region-based Convolutional Neural Networks (R-CNN) is a visual object detection system that combines bottom-up region proposals with rich features computed by a convolutional neural network.


R-CNN was initially described in an arXiv tech report, with a newer version Faster R-CNN already available. Here is a matlab implementation of faster R-CNN, and here a python port of it.

This algorithm is implemented in popular frameworks like Caffe as described in this presentation from Berkeley Vision Lab.

Caffe & cuDNN

The Caffe framework from UC Berkeley is designed to let researchers create and explore CNN’s and other Deep Neural Networks (DNN’s) easily, while delivering high speed needed for both experiments and industrial deployment. Caffe provides state-of-the-art modeling for advancing and deploying deep learning in research and industry with support for a wide variety of architectures and efficient implementations of prediction and learning.

Caffe models and optimization are defined by plain text schema for ease of experimentation. For instance, a convolutional layer for 20 filters of size 5 x 5 is defined using the following text:

Every model layer is defined in this way. The LeNet tutorial included in the Caffe examples walks through defining and training Yann LeCun’s famous model for handwritten digit recognition. It can reach 99% accuracy in less than a minute with GPU training.

Deep networks require intense computation, so Caffe has taken advantage of both GPU and CPU processing from the project’s beginning. The new cuDNN library provides implementations tuned and tested by NVIDIA of the most computationally-demanding routines needed for CNNs.

More details can be found here, here, here and here.

ImageNet & challenges

The ImageNet project is a large visual database designed for use in visual object recognition software research. As of 2016, over ten million URL’s of images have been hand-annotated by ImageNet to indicate what objects are pictured; in at least one million of the images, bounding boxes are also provided. The database of annotations of third-party image URL’s is freely available directly from ImageNet; however, the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes.

Similar contests include MIT’s sceneparsing and places.

The models used in this post are trained on images from ImageNet, like the LSVRC2012 dataset.

Testing of OpenCV’s DNN CPU classification using GoogLeNet, a trained network from Caffe model zoo

The goal of this first test is to benchmark the OpenCV’s DNN implementation and it is based on this official example.

For the moment OpenCV has implemented only the CPU version of the DNN net.

As an input, the OpenCV’s DNN needs a picture, a pre-trained model (text & binary) pre-trained and available online, and a synset (text), the same synset that was used when the model was trained. The synset file is used to match a synset ID to a category name.

bvlc_googlenet is a CNN trained model on the LSVRC2012 dataset, that has over 1 million pictures from 1000 categories.

More details are available in the part of building own model.

Classification of img_color_0.jpg



Classification of img_color_1.jpg



Classification of img_color_3.jpg



Classification of img_color_6.jpg



From the logs above can be seen at least two important facts. The first one, the time to classify remains the same no matter what the image size is, and this happens because the image is first resized to a fixed size and then applied to the same net. And the second, the precision can drop significantly when the images are distorted. This can be seen by comparing the first two pictures. By removing only a small percent of the plane in the second picture the probability dropped from 99.97% to 20.40%.

Testing of Caffe CPU and GPU/CUDA classification using GoogLeNet, a trained network from Caffe model zoo

The goal of this second test is to benchmark Caffe’s DNN both CPU and GPU implementations, and it is based on this official example.

As an input, the Caffe’s DNN needs a picture, a pre-trained model (text & binary & median) which is available online as explained here, and a synset (text), the same synset that was used when the model was trained. The synset file is used to match a synset ID to a category name.

This second test uses the same bvlc_googlenet CNN trained model as in the first OpenCV test, using the same LSVRC2012 dataset that has over 1 million pictures from 1000 categories.

Additionally Caffe needs a median file.

Caffe can run on both CPU only or on a GPU from NVIDIA if present, or a mixure of them. As mixture means some layers can be implemented in CPU, others in GPU.

More details are available in the part of building own model.

In this second test the same pictures are classified as in the first test.

Classification of img_color_0.jpg

Classification of img_color_1.jpg

Classification of img_color_3.jpg

Classification of img_color_6.jpg

Profiling of Caffe GPU on img_color_0.jpg

Caffe has two ways of running on an NVIDIA GPU. The fist way using own implemented API for accessing the GPU, and the second by forwarding the calls to the highly optimized cuDNN library implemented by NVIDA. This example is running using Caffe’s own implemented API.

As can be seen in the picture below the computing part of the computing part of the classification takes only ~7 ms, while the rest ~20 ms of the classification is spent in copying data between CPU’s and GPU’s memories.


By comparing the logs can be seen that when running on GPU the classification time dropped from ~900 ms to ~30 ms, which makes this GPU implementation applicable for real-time processing applications.

Building, training and testing of own Caffe CNN model starting from the Oxford 102 category flower dataset

This part explains how to configure a Caffe model, how to build the image dataset (training, validation and testing), and how to train a model on a custom image dataset.

To make it easy, a prototype model was created and it is located under ./classification_dnn/data/model_prototype/.

It is based on the AlexNet CNN (BVLC Reference CaffeNet in Model Zoo) and changed to classify only 7 classes. A very small image training dataset was also created and it is located at /model_prototype/jpg/. The trainig dataset has only 56 pictures and it is used as follows: 5×7 (training), 2×7(validation), 1×7 (testing).

The model directory should look like this after it is trained.


The model has the following files:

  • solver.prototxt : this defines the strategy the model should be trained, defining parameters like the total number of iterations.
  • train_val.prototxt : this is the configuration file used to train the model, with paths to images, layers and their connections, the size of the output (number of classes), etc.
  • deploy.prototxt : this is a copy of train_val.prototxt and it is used in production.
  • imagenet_mean.binaryproto : a caffe model needs this file, and can be generated as explained here.
  • snapshot_iter_xx.caffemodel :  this is a temporary model after xx iterrations were executed.
  • snapshot_iter_xx.solverstate :  this file stores the state of the training process after xx iterations, and is useful to stop/restore the training for huge datasets which could take days to finish.
  • synset_words.txt : it is used to match a class ID to a class name. The row index in synset_words.txt represents the class id, and it starts with 0, third row means classId=2.
  • train.txt : lists the images used to train the model and has the format “…/jpg/image_06759.jpg 0”. The number after image path is the class ID.
  • valid.txt : lists the images used to validate the model and has the same format as train.txt.
  • test.txt : lists the images used to test the model and has the same format as train.txt. Here is an discussion on what training, validation and testing mean and how to choose their sizes.
  • jpg/*.jpg : dataset images altogether, train, valid and test.

To build this model this command must be called:

It should take only several seconds to train the model with these 56 pictures.