Discriminative Compact Pyramids for Object and Scene Recognition

Problem Domain

Spatial pyramid scheme proposed by [2] have recently proven very successful results. These are formed by representing an image using weighted multi-resolution histograms, which are obtained by repeatedly sub-dividing an image into increasingly finer sub-regions by doubling the number of divisions in each axis direction and computing histograms of features over the resulting sub-regions. When histograms for all sub-regions at all levels have been created, these histograms are concatenated to form the final image representation. For example, a level 2 spatial pyramid is constructed by concatenating a total of 1 + 4 + 16 = 21 histograms.

Although a notable performance gain is achieved by using the spatial pyramid method, the resulting histogram is often a magnitude higher in dimensionality over its standard bag-of-words based counterpart.


Here we test Divisive Information Theoretic feature Clustering (DITC) [3] on the problem of finding a discriminatively compact pyramid image representation [1] on the 15 class scenes recognition data set proposed by [2]. This data set contains 15 different scene categories and 4485 images, divided into 1500 training and 2985 testing.

For constructing the spatial pyramid representation, we use the standard k-means to construct a vocabulary of size 1024 followed by building a three level pyramid using the code provided here Spatial Pyramid code. The histogram size in this case is 21504 times 4485 dimensional histogram.

DITC-based Pyramid Compression.


Method Size Score
Pyramid 21000 84.1
Pyramid DITC 5000 85.4
Pyramid DITC 1000 84.4
Pyramid_DITC 500 84.2
Classification Score (percentage) on the 15 class Scenes data sets. The results demonstrates that DITC successfully compresses the vocabularies, while preserving their discriminative power.

DITC Overview

DITC immediately clusters discrete data into the desired number of clusters during its initialization process. The DITC initialization of the k clusters is obtained by first clustering the words into l clusters, where l is the number of classes. Subsequently, we split each cluster arbitrarily into k/l clusters.

After which it iteratively improves the quality of these clusters. Each iteration monotonically reduces the decline in mutual information between the data (w) and the class labels (c), therefore the algorithm is guaranteed to terminate at a local minimum in a finite number of iterations.

We apply the DITC to compress the input data (w_t) and obtain Wj optimum clusters. The KL divergence is the measure of similarity between distributions of input words and word clusters.

Code Available

Data Available

We also provide the compressed histograms for the data sets used in our paper:


[1] N. Elfiky, F. Shahbaz Khan, J. van de Weijer, J. Gonzàlez. Discriminative Compact Pyramids for Object and Scene Recognition. Pattern Recognition, doi: 10.1016/j.patcog.2011.09.020, 2011.

[2] S. Lazebnik, C. Schmid, J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc.: CVPR, 2006.

[3] I. S. Dhillon, S. Mallela, R. Kumar, I. Guyon, A. Elisseeff. A divisive information-theoretic feature clustering algorithm for text classification. JMLR, 2003.