Chapter 14. Parallel Processing

Modern computers tend to have multiple processors. By making use of all the processing ability of your machine your programs can run much faster. Writing code that takes advantage of multiple processors in Java usually involves either manually creating and managing threads, or using a higher level concurrent programming abstraction library, such as the classes found in the excellent java.util.concurrent package.

A common use-case for multithreading in multimedia analysis is the application of an operation to a collection of objects - for example, the extraction of features from a list of images. This kind of task can be effectively parallelised using Java’s concurrent classes, but requires a large amount of boiler-plate code to be written each time. To help reduce the programming overhead associated with this kind of parallel processing, OpenIMAJ includes a Parallel class that contains a number of methods that allow you to efficiently and effectively write multi-threaded loops.

To get started playing with the Parallel class, create a new OpenIMAJ project using the archetype, or add a new class and main method to an existing project. Firstly, lets see how we can write the parallel equivalent of a for (int i=0; i<10; i++) loop:

Parallel.forIndex(0, 10, 1, new Operation<Integer>() {
	public void perform(Integer i) {
	    System.out.println(i);
	}
});

Try running this code; you’ll see that all the numbers from 0 to 9 are printed, although not necessarily in the correct order. If you run the code again, you’ll probably see the order change. It’s important to note that the when parallelising a loop that the order of operations is not deterministic.

Now let’s explore a more realistic scenario in which we might want to apply parallelisation. We’ll build a program to compute the normalised average of the images in a dataset. Firstly, let’s write the non-parallel version of the code. We’ll start by loading a dataset of images; in this case we’ll use the CalTech 101 dataset we used in the Classification 101 tutorial, but rather than loading record object, we’ll load the images directly:

VFSGroupDataset<MBFImage> allImages = Caltech101.getImages(ImageUtilities.MBFIMAGE_READER);

We’ll also restrict ourselves to using a subset of the first 8 groups (image categories) in the dataset:

GroupedDataset<String, ListDataset<MBFImage>, MBFImage> images = GroupSampler.sample(allImages, 8, false);

We now want to do the processing. For each group we want to build the average image. We do this by looping through the images in the group, resampling and normalising each image before drawing it in the centre of a white image, and then adding the result to an accumulator. At the end of the loop we divide the accumulated image by the number of samples used to create it. The code to perform these operations would look like this:

List<MBFImage> output = new ArrayList<MBFImage>();
ResizeProcessor resize = new ResizeProcessor(200);
for (ListDataset<MBFImage> clzImages : images.values()) {
    MBFImage current = new MBFImage(200, 200, ColourSpace.RGB);

    for (MBFImage i : clzImages) {
        MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);
        tmp.fill(RGBColour.WHITE);

        MBFImage small = i.process(resize).normalise();
        int x = (200 - small.getWidth()) / 2;
        int y = (200 - small.getHeight()) / 2;
        tmp.drawImage(small, x, y);

        current.addInplace(tmp);
    }
    current.divideInplace((float) clzImages.size());
    output.add(current);
}

We can use the DisplayUtilities class to display the results:

DisplayUtilities.display("Images", output);

Before we try running the program, we’ll also add some timing code to see how long it takes. Before the outer loop (the one over the groups provided by images.values()), add the following:

Timer t1 = Timer.timer();

and, after the outer loop (just before the display code) add the following:

System.out.println("Time: " + t1.duration() + "ms");

You can now run the code, and after a short while (on my laptop it takes about 7248 milliseconds (7.2 seconds)) the resultant averaged images will be displayed. An example is shown below. Can you tell what object is depicted by each average image? For many object types in the CalTech 101 dataset it is quite easy, and this is one of the reasons that the dataset has been criticised as being too easy for classification experiments in the literature.

Now we’ll look at parallelising this code. We essentially have three options for parallelisation; we could parallelise the outer loop, parallelise the inner one, or parallelise both. There are many tradeoffs that need to be considered including the amount of memory usage and the task granularity in deciding how to best parallelise code. For the purposes of this tutorial, we’ll work with the inner loop. Using the Parallel.for method, we can re-write the inner-loop as follows:

Parallel.forEach(clzImages, new Operation<MBFImage>() {
    public void perform(MBFImage i) {
        final MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);
        tmp.fill(RGBColour.WHITE);

        final MBFImage small = i.process(resize).normalise();
        final int x = (200 - small.getWidth()) / 2;
        final int y = (200 - small.getHeight()) / 2;
        tmp.drawImage(small, x, y);

        synchronized (current) {
            current.addInplace(tmp);
        }
    }
});

For this to compile, you’ll also need to make the current image final by adding the final keyword to the line in which it is created:

final MBFImage current = new MBFImage(200, 200, ColourSpace.RGB);

Notice that in the parallel version of the loop we have to put a synchronized block around the part where we accumulate into the current image. This is to stop multiple threads trying to alter the image concurrently. If we now run the code, we should hopefully see an improvement in the time it takes to compute the averages. You might also need to increase the amount of memory available to Java for the program to run successfully.

On my laptop, with 8 CPU cores the running time drops to ~3100 milliseconds. You might be thinking that because we have gone from 1 to 8 CPU cores that the speed-up would be greater; there are many reasons why that is not the case in practice, but the biggest is that the process that we’re running is rather I/O bound because the underlying dataset classes we’re using retrieve the images from disk each time they’re needed. A second issue is that there are a couple of slight bottlenecks in our code; in particular notice that we’re creating a temporary image for each image that we process, and that we also have to synchronise on the current accumulator image for each image. We can factor out these problems by modifying the code to use the partitioned variant of the for-each loop in the Parallel class. Instead of giving each thread a single image at a time, the partitioned variant will feed each thread a collection of images (provided as an Iterator) to process:

Parallel.forEachPartitioned(new RangePartitioner<MBFImage>(clzImages), new Operation<Iterator<MBFImage>>() {
	public void perform(Iterator<MBFImage> it) {
	    MBFImage tmpAccum = new MBFImage(200, 200, 3);
	    MBFImage tmp = new MBFImage(200, 200, ColourSpace.RGB);

	    while (it.hasNext()) {
	        final MBFImage i = it.next();
	        tmp.fill(RGBColour.WHITE);

	        final MBFImage small = i.process(resize).normalise();
	        final int x = (200 - small.getWidth()) / 2;
	        final int y = (200 - small.getHeight()) / 2;
	        tmp.drawImage(small, x, y);
	        tmpAccum.addInplace(tmp);
	    }
	    synchronized (current) {
	        current.addInplace(tmpAccum);
	    }
	}
});

The RangePartitioner in the above code will break the images in clzImages into as many (approximately equally sized) chunks as there are available CPU cores. This means that the perform method will be called many fewer times, but will do more work - what we’ve done is called increasing the task granularity. Notice that we created an extra working image called tmpAccum to hold the intermediary results. This means memory usage will be increased, however, also notice that whilst we still have to synchronise on the current image, we do far fewer times (once per CPU core in fact). Try running the improved version of the code; on my laptop it reduces the total running time even further to ~2900 milliseconds.

14.1. Exercises

14.1.1. Exercise 1: Parallelise the outer loop

As we discussed earlier in the tutorial, there were three primary ways in which we could have approached the parallelisation of the image-averaging program. Instead of parallelising the inner loop, can you modify the code to parallelise the outer loop instead? Does this make the code faster? What are the pros and cons of doing this?