In this tutorial, we’ll go through the steps required to build and evaluate a near state-of-the-art image classifier. Although for the purposes of this tutorial we’re using features extracted from images, everything you’ll learn about using classifiers can be applied to features extracted from other forms of media.
To get started you’ll need a new class in an existing OpenIMAJ project, or a new project created with the archetype. The first thing we need is a dataset of images with which we’ll work. For this tutorial we’ll use a well known set of labelled images called the Caltech 101 dataset. The Caltech 101 dataset contains labelled images of 101 object classes together with a set of background images. OpenIMAJ has built in support for working with the Caltech 101 dataset, and will even automatically download the dataset for you. To use it, enter the following code:
GroupedDataset<String, VFSListDataset<Record<FImage>>, Record<FImage>> allData = Caltech101.getData(ImageUtilities.FIMAGE_READER);
You’ll remember from the image datasets tutorial that
GroupedDataset
s are Java Map
s
with a few extra features. In this case, our
allData
object is a
GroupedDataset
with String
keys
and the values are lists (actually VFSListDataset
s)
of Record
objects which are themselves typed on
FImage
s. The Record
class holds
metadata about each Caltech 101 image. Record
s have
a method called getImage()
that will return the
actual image in the format specified by the generic type of the
Record
(i.e. FImage
).
For this tutorial we’ll work with a subset of the classes in the
dataset to minimise the time it takes our program to run. We can
create a subset of groups in a GroupedDataset
using
the GroupSampler
class:
GroupedDataset<String, ListDataset<Record<FImage>>, Record<FImage>> data = GroupSampler.sample(allData, 5, false);
This basically creates a new dataset called data
from the first 5 classes in the allData
dataset. To
do an experimental evaluation with the dataset we need to create two
sets of images: a training set
which we’ll use to learn the classifier, and a
testing set which we’ll evaluate
the classifier with. The common approach with the Caltech 101 dataset
is to choose a number of training and testing instances for each class
of images. Programatically, this can be achieved using the
GroupedRandomSplitter
class:
GroupedRandomSplitter<String, Record<FImage>> splits = new GroupedRandomSplitter<String, Record<FImage>>(data, 15, 0, 15);
In this case, we’ve created a training dataset with 15 images per
group, and 15 testing images per group. The zero in the constructor is
the number of validation images which we won’t use in this tutorial.
If you take a look at the GroupedRandomSplitter
class you’ll see there are methods to get the training, validation and
test datasets.
Our next step is to consider how we’re going to extract suitable image features. For this tutorial we’re going to use a technique commonly known as the Pyramid Histogram of Words (PHOW). PHOW is itself based on the idea of extracting Dense SIFT features, quantising the SIFT features into visual words and then building spatial histograms of the visual word occurrences.
The Dense SIFT features are just like the features you used in the “SIFT and feature matching” tutorial, but rather than extracting the features at interest points detected using a difference-of-Gaussian, the features are extracted on a regular grid across the image. The idea of a visual word is quite simple: rather than representing each SIFT feature by a 128 dimension feature vector, we represent it by an identifier. Similar features (i.e. those that have similar, but not necessarily the same, feature vectors) are assigned to have the same identifier. A common approach to assigning identifiers to features is to train a vector quantiser (just another fancy name for a type of classifier) using k-means, just like we did in the “Introduction to clustering” tutorial. To build a histogram of visual words (often called a Bag of Visual Words), all we have to do is count up how many times each identifier appears in an image and store the values in a histogram. If we’re building spatial histograms, then the process is the same, but we effectively cut the image into blocks and compute the histogram for each block independently before concatenating the histograms from all the blocks into a larger histogram.
To get started writing the code for the PHOW implementation, we first
need to construct our Dense SIFT extractor - we’re actually going to
construct two objects: a DenseSIFT
object and a
PyramidDenseSIFT
object:
DenseSIFT dsift = new DenseSIFT(5, 7); PyramidDenseSIFT<FImage> pdsift = new PyramidDenseSIFT<FImage>(dsift, 6f, 7);
The PyramidDenseSIFT
class takes a normal
DenseSIFT
instance and applies it to different
sized windows on the regular sampling grid, although in this
particular case we’re only using a single window size of 7 pixels.
The next stage is to write some code to perform
K-Means clustering on a sample of
SIFT features in order to build a HardAssigner
that
can assign features to identifiers. Let’s wrap up the code for this in
a new method that takes as input a dataset and a
PyramidDenseSIFT
object:
static HardAssigner<byte[], float[], IntFloatPair> trainQuantiser( Dataset<Record<FImage>> sample, PyramidDenseSIFT<FImage> pdsift) { List<LocalFeatureList<ByteDSIFTKeypoint>> allkeys = new ArrayList<LocalFeatureList<ByteDSIFTKeypoint>>(); for (Record<FImage> rec : sample) { FImage img = rec.getImage(); pdsift.analyseImage(img); allkeys.add(pdsift.getByteKeypoints(0.005f)); } if (allkeys.size() > 10000) allkeys = allkeys.subList(0, 10000); ByteKMeans km = ByteKMeans.createKDTreeEnsemble(300); DataSource<byte[]> datasource = new LocalFeatureListDataSource<ByteDSIFTKeypoint, byte[]>(allkeys); ByteCentroidsResult result = km.cluster(datasource); return result.defaultHardAssigner(); }
The above method extracts the first 10000 dense SIFT features from the
images in the dataset, and then clusters them into 300 separate
classes. The method then returns a HardAssigner
which can be used to assign SIFT features to identifiers. To use this
method, add the following to your main method after the
PyramidDenseSIFT
construction:
HardAssigner<byte[], float[], IntFloatPair> assigner = trainQuantiser(GroupedUniformRandomisedSampler.sample(splits.getTrainingDataset(), 30), pdsift);
Notice that we’ve used a
GroupedUniformRandomisedSampler
to get a random
sample of 30 images across all the groups of the training set with
which to train the quantiser. The next step is to write a
FeatureExtractor
implementation with which we can
train our classifier:
static class PHOWExtractor implements FeatureExtractor<DoubleFV, Record<FImage>> { PyramidDenseSIFT<FImage> pdsift; HardAssigner<byte[], float[], IntFloatPair> assigner; public PHOWExtractor(PyramidDenseSIFT<FImage> pdsift, HardAssigner<byte[], float[], IntFloatPair> assigner) { this.pdsift = pdsift; this.assigner = assigner; } public DoubleFV extractFeature(Record<FImage> object) { FImage image = object.getImage(); pdsift.analyseImage(image); BagOfVisualWords<byte[]> bovw = new BagOfVisualWords<byte[]>(assigner); BlockSpatialAggregator<byte[], SparseIntFV> spatial = new BlockSpatialAggregator<byte[], SparseIntFV>( bovw, 2, 2); return spatial.aggregate(pdsift.getByteKeypoints(0.015f), image.getBounds()).normaliseFV(); } }
This class uses a BlockSpatialAggregator
together
with a BagOfVisualWords
to compute 4 histograms
across the image (by breaking the image into 2 both horizontally and
vertically). The BagOfVisualWords
uses the
HardAssigner
to assign each Dense SIFT feature to a
visual word and the compute the histogram. The resultant spatial
histograms are then appended together and normalised before being
returned. Back in the main method of our code we can construct an
instance of our PHOWExtractor:
FeatureExtractor<DoubleFV, Record<FImage>> extractor = new PHOWExtractor(pdsift, assigner);
Now we’re ready to construct and train a classifier - we’ll use the
linear classifier provided by the
LiblinearAnnotator
class:
LiblinearAnnotator<Record<FImage>, String> ann = new LiblinearAnnotator<Record<FImage>, String>( extractor, Mode.MULTICLASS, SolverType.L2R_L2LOSS_SVC, 1.0, 0.00001); ann.train(splits.getTrainingDataset());
Finally, we can use the OpenIMAJ evaluation framework to perform an automated evaluation of our classifier’s accuracy for us:
ClassificationEvaluator<CMResult<String>, String, Record<FImage>> eval = new ClassificationEvaluator<CMResult<String>, String, Record<FImage>>( ann, splits.getTestDataset(), new CMAnalyser<Record<FImage>, String>(CMAnalyser.Strategy.SINGLE)); Map<Record<FImage>, ClassificationResult<String>> guesses = eval.evaluate(); CMResult<String> result = eval.analyse(guesses);
A Homogeneous Kernel Map transforms data into a compact linear
representation such that applying a linear classifier approximates,
to a high degree of accuracy, the application of a non-linear
classifier over the original data. Try using the
HomogeneousKernelMap
class with a
KernelType.Chi2
kernel and
WindowType.Rectangular
window on top of the
PHOWExtractor
feature extractor. What effect
does this have on performance?
Tip | |
---|---|
Construct a |
The DiskCachingFeatureExtractor
class can be
used to cache features extracted by a
FeatureExtractor
to disk. It will generate and
save features if they don’t exist, or read from disk if they do.
Try to incorporate the
DiskCachingFeatureExtractor
into your code.
You’ll also need to save the HardAssigner
using
IOUtils.writeToFile
and load it using
IOUtils.readFromFile
because the features must
be kept with the same HardAssigner
that created
them.
Try running the code over all the classes in the Caltech 101
dataset. Also try increasing the number of visual words to 600,
adding extra scales to the PyramidDenseSIFT
(try [4, 6, 8, 10] and reduce the step-size of the DenseSIFT to
3), and instead of using the
BlockSpatialAggregator
, try the
PyramidSpatialAggregator
with [2, 4] blocks.
What level of classifier performance does this achieve?