Algorithms for determining the key of an audio sample

It’s worth being aware that this is a very tricky problem and if you don’t have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you’re expecting to throw a couple of FFTs at the problem then you won’t get very far. I hope you do have the interest as it is a really fascinating area.

Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R’s link). However, you’ll often find that this gives the wrong results: you’ll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.

Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it’s still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it’s major or minor. So you’d need to use a method similar to Krumhansl’s key finding algorithm.

So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you’re going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320… Furthermore, if you play an E
(see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)

If you looked beyond this interval of a 5th to major, minor and other chords then you’ll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it’ll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.

That’s just a very simple approach but may be enough to find the key of a strummed or sustained chord. You’d also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.

EDIT:
If you want to experiment then I’d suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.

Other useful links:

  • My PhD thesis on some aspects of pitch recognition — the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
  • http://en.wikipedia.org/wiki/Auditory_scene_analysis — Bregman’s Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
  • Dan Ellis has done some great papers in this and similar areas
  • Keith Martin has some interesting approaches

Leave a Comment