Chapter 12. Classification with Caltech 101

In this tutorial, we’ll go through the steps required to build and evaluate a near state-of-the-art image classifier. Although for the purposes of this tutorial we’re using features extracted from images, everything you’ll learn about using classifiers can be applied to features extracted from other forms of media.

To get started you’ll need a new class in an existing OpenIMAJ project, or a new project created with the archetype. The first thing we need is a dataset of images with which we’ll work. For this tutorial we’ll use a well known set of labelled images called the Caltech 101 dataset. The Caltech 101 dataset contains labelled images of 101 object classes together with a set of background images. OpenIMAJ has built in support for working with the Caltech 101 dataset, and will even automatically download the dataset for you. To use it, enter the following code:

GroupedDataset<String, VFSListDataset<Record<FImage>>, Record<FImage>> allData = 

You’ll remember from the image datasets tutorial that GroupedDatasets are Java Maps with a few extra features. In this case, our allData object is a GroupedDataset with String keys and the values are lists (actually VFSListDatasets) of Record objects which are themselves typed on FImages. The Record class holds metadata about each Caltech 101 image. Records have a method called getImage() that will return the actual image in the format specified by the generic type of the Record (i.e. FImage).

For this tutorial we’ll work with a subset of the classes in the dataset to minimise the time it takes our program to run. We can create a subset of groups in a GroupedDataset using the GroupSampler class:

GroupedDataset<String, ListDataset<Record<FImage>>, Record<FImage>> data = 
			GroupSampler.sample(allData, 5, false);

This basically creates a new dataset called data from the first 5 classes in the allData dataset. To do an experimental evaluation with the dataset we need to create two sets of images: a training set which we’ll use to learn the classifier, and a testing set which we’ll evaluate the classifier with. The common approach with the Caltech 101 dataset is to choose a number of training and testing instances for each class of images. Programatically, this can be achieved using the GroupedRandomSplitter class:

GroupedRandomSplitter<String, Record<FImage>> splits = 
			new GroupedRandomSplitter<String, Record<FImage>>(data, 15, 0, 15);

In this case, we’ve created a training dataset with 15 images per group, and 15 testing images per group. The zero in the constructor is the number of validation images which we won’t use in this tutorial. If you take a look at the GroupedRandomSplitter class you’ll see there are methods to get the training, validation and test datasets.

Our next step is to consider how we’re going to extract suitable image features. For this tutorial we’re going to use a technique commonly known as the Pyramid Histogram of Words (PHOW). PHOW is itself based on the idea of extracting Dense SIFT features, quantising the SIFT features into visual words and then building spatial histograms of the visual word occurrences.

The Dense SIFT features are just like the features you used in the SIFT and feature matching tutorial, but rather than extracting the features at interest points detected using a difference-of-Gaussian, the features are extracted on a regular grid across the image. The idea of a visual word is quite simple: rather than representing each SIFT feature by a 128 dimension feature vector, we represent it by an identifier. Similar features (i.e. those that have similar, but not necessarily the same, feature vectors) are assigned to have the same identifier. A common approach to assigning identifiers to features is to train a vector quantiser (just another fancy name for a type of classifier) using k-means, just like we did in the Introduction to clustering tutorial. To build a histogram of visual words (often called a Bag of Visual Words), all we have to do is count up how many times each identifier appears in an image and store the values in a histogram. If we’re building spatial histograms, then the process is the same, but we effectively cut the image into blocks and compute the histogram for each block independently before concatenating the histograms from all the blocks into a larger histogram.

To get started writing the code for the PHOW implementation, we first need to construct our Dense SIFT extractor - we’re actually going to construct two objects: a DenseSIFT object and a PyramidDenseSIFT object:

DenseSIFT dsift = new DenseSIFT(5, 7);
PyramidDenseSIFT<FImage> pdsift = new PyramidDenseSIFT<FImage>(dsift, 6f, 7);

The PyramidDenseSIFT class takes a normal DenseSIFT instance and applies it to different sized windows on the regular sampling grid, although in this particular case we’re only using a single window size of 7 pixels.

The next stage is to write some code to perform K-Means clustering on a sample of SIFT features in order to build a HardAssigner that can assign features to identifiers. Let’s wrap up the code for this in a new method that takes as input a dataset and a PyramidDenseSIFT object:

static HardAssigner<byte[], float[], IntFloatPair> trainQuantiser(
	            Dataset<Record<FImage>> sample, PyramidDenseSIFT<FImage> pdsift)
    List<LocalFeatureList<ByteDSIFTKeypoint>> allkeys = new ArrayList<LocalFeatureList<ByteDSIFTKeypoint>>();

    for (Record<FImage> rec : sample) {
        FImage img = rec.getImage();


    if (allkeys.size() > 10000)
        allkeys = allkeys.subList(0, 10000);

    ByteKMeans km = ByteKMeans.createKDTreeEnsemble(300);
    DataSource<byte[]> datasource = new LocalFeatureListDataSource<ByteDSIFTKeypoint, byte[]>(allkeys);
    ByteCentroidsResult result = km.cluster(datasource);

    return result.defaultHardAssigner();

The above method extracts the first 10000 dense SIFT features from the images in the dataset, and then clusters them into 300 separate classes. The method then returns a HardAssigner which can be used to assign SIFT features to identifiers. To use this method, add the following to your main method after the PyramidDenseSIFT construction:

HardAssigner<byte[], float[], IntFloatPair> assigner = 
			trainQuantiser(GroupedUniformRandomisedSampler.sample(splits.getTrainingDataset(), 30), pdsift);

Notice that we’ve used a GroupedUniformRandomisedSampler to get a random sample of 30 images across all the groups of the training set with which to train the quantiser. The next step is to write a FeatureExtractor implementation with which we can train our classifier:

static class PHOWExtractor implements FeatureExtractor<DoubleFV, Record<FImage>> {
    PyramidDenseSIFT<FImage> pdsift;
    HardAssigner<byte[], float[], IntFloatPair> assigner;

    public PHOWExtractor(PyramidDenseSIFT<FImage> pdsift, HardAssigner<byte[], float[], IntFloatPair> assigner)
        this.pdsift = pdsift;
        this.assigner = assigner;

    public DoubleFV extractFeature(Record<FImage> object) {
        FImage image = object.getImage();

        BagOfVisualWords<byte[]> bovw = new BagOfVisualWords<byte[]>(assigner);

        BlockSpatialAggregator<byte[], SparseIntFV> spatial = new BlockSpatialAggregator<byte[], SparseIntFV>(
                bovw, 2, 2);

        return spatial.aggregate(pdsift.getByteKeypoints(0.015f), image.getBounds()).normaliseFV();

This class uses a BlockSpatialAggregator together with a BagOfVisualWords to compute 4 histograms across the image (by breaking the image into 2 both horizontally and vertically). The BagOfVisualWords uses the HardAssigner to assign each Dense SIFT feature to a visual word and the compute the histogram. The resultant spatial histograms are then appended together and normalised before being returned. Back in the main method of our code we can construct an instance of our PHOWExtractor:

FeatureExtractor<DoubleFV, Record<FImage>> extractor = new PHOWExtractor(pdsift, assigner);

Now we’re ready to construct and train a classifier - we’ll use the linear classifier provided by the LiblinearAnnotator class:

LiblinearAnnotator<Record<FImage>, String> ann = new LiblinearAnnotator<Record<FImage>, String>(
		            extractor, Mode.MULTICLASS, SolverType.L2R_L2LOSS_SVC, 1.0, 0.00001);

Finally, we can use the OpenIMAJ evaluation framework to perform an automated evaluation of our classifier’s accuracy for us:

ClassificationEvaluator<CMResult<String>, String, Record<FImage>> eval = 
			new ClassificationEvaluator<CMResult<String>, String, Record<FImage>>(
				ann, splits.getTestDataset(), new CMAnalyser<Record<FImage>, String>(CMAnalyser.Strategy.SINGLE));
Map<Record<FImage>, ClassificationResult<String>> guesses = eval.evaluate();
CMResult<String> result = eval.analyse(guesses);

12.1. Exercises

12.1.1. Exercise 1: Apply a Homogeneous Kernel Map

A Homogeneous Kernel Map transforms data into a compact linear representation such that applying a linear classifier approximates, to a high degree of accuracy, the application of a non-linear classifier over the original data. Try using the HomogeneousKernelMap class with a KernelType.Chi2 kernel and WindowType.Rectangular window on top of the PHOWExtractor feature extractor. What effect does this have on performance?

[Tip] Tip

Construct a HomogeneousKernelMap and use the createWrappedExtractor() method to create a new feature extractor around the PHOWExtractor that applies the map.

12.1.2. Exercise 2: Feature caching

The DiskCachingFeatureExtractor class can be used to cache features extracted by a FeatureExtractor to disk. It will generate and save features if they don’t exist, or read from disk if they do. Try to incorporate the DiskCachingFeatureExtractor into your code. You’ll also need to save the HardAssigner using IOUtils.writeToFile and load it using IOUtils.readFromFile because the features must be kept with the same HardAssigner that created them.

12.1.3. Exercise 3: The whole dataset

Try running the code over all the classes in the Caltech 101 dataset. Also try increasing the number of visual words to 600, adding extra scales to the PyramidDenseSIFT (try [4, 6, 8, 10] and reduce the step-size of the DenseSIFT to 3), and instead of using the BlockSpatialAggregator, try the PyramidSpatialAggregator with [2, 4] blocks. What level of classifier performance does this achieve?