Chapter 10. Feature Extraction from Audio

Just like images, we can extract features that can be used to get a higher-level understanding of the audio. There are some features that have become de-facto in audio processing, and one of these is the Mel-Frequency Cepstrum Coefficients (MFCCs). They give an overview of the shape (or envelope) of the frequency components of the audio based on some perceptual scaling of the frequency domain.

OpenIMAJ provides an MFCC class based around the jAudio implementation. Unsurprisingly, it's called MFCC! We can use it in exactly the same way as the FFT processor, so if you take the code from FFT example in the previous chapter you can change the FFT processor to be the MFCC processor.

		MFCC mfcc = new MFCC( xa );
		...
		while( (sc = mfcc.nextSampleChunk()) != null )
		{
			double[][] mfccs = mfcc.getLastGeneratedFeature();
			vis.setData( mfccs[0] );
		}
	

MFCCs were specifically developed for speech recognition tasks and so are much more suitable for describing speech signals than a sine wave sweep. So, let's switch to using the JavaSoundAudioGrabber so we can speak into the computer. Secondly, we'll fix the analysis window that we're using. The literature shows that 30ms windows with 10ms overlaps are often used in speech processing. At 44.1KHz, 10ms is 441 samples, so we'll use the FixedSizeSampleAudioProcessor to deal with giving us the appropriate size sample chunks.

		JavaSoundAudioGrabber jsag = new JavaSoundAudioGrabber( new AudioFormat( 16, 44.1, 1 ) );
		FixedSizeSampleAudioProcessor fssap = new FixedSizeSampleAudioProcessor( jsag, 441*3, 441 );
	

You should see that when you speak into the computer, the MFCCs show a noticeable change compared to the sine wave sweep.

[Tip] Tip

You might want to try fixing the axis of the visualisation bar graph using the method setAxisLocation( 100 ).

10.1. Exercises

10.1.1. Exercise 1: Spectral Flux

Update the program to use the SpectralFlux feature. Set the bar visualisation to use a maximum value of 0.0001. What do you think this feature is showing? How would this feature be useful?