Music Feature Extraction in Python

Packages

We’ll be using librosa for analyzing and extracting features of an audio signal.

For playing audio we will use pyAudio so that we can play music on jupyter directly.

import librosa
audio_path = 'audio-path'
x , sr = librosa.load(audio_path)
print(type(x), type(sr))

x = time series

sr = sample rate of x; consider overwriting to sr=44100

Playing audio

  import IPython.display as ipd
  ipd.Audio(audio_path)

Display methods

wave plot: loudness of audio at given time
spectrogram: show different frequencies along with amplitude
colormap
stft (short term fourier transform): determines the amplitude of various frequencies at a given time

Feature Extraction

Zero Crossing Rate: rate at which the signal changes from positive to negative or back

x, sr = librosa.load(audio_path)
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

Zooming in:

n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

Spectral centroid: indicates where centre of mass for sound is located
Spectral rolloff: is frequency under which a specified percentage of total spectral energy lies
MFCC: mel frequency cepstral coefficients of a signal are a small set of features which concisely describe the overall shape of a spectral envelope
```
mfccs = librosa.feature.mfcc(x, sr=sr)
print(mfccs.shape)
#Displaying  the MFCCs:
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
```
Dummy guide to MFCC

Summary: MFCC features represent phonemes (distinct units of sound) as the shape of the vocal tract (which is responsible for sound generation) is manifest in them.

Source: The Dummy’s Guide to MFCC

Term explanation

Cespstral is ‘spectral’ with ‘spec’ inverted. Cepstrum is the information of rate of cahnge in spectral bands.

Pitch is one of the characteristics of a speech signal and is measured as the frequency of the signal.

Mel scale is scale that relates the perceived frequency of a tone to the actual frequency because humans hear differently. They are better at identifying small changes in speech at lower frequencies.

Getting the Cepstral coefficients

MFCC Tutorial

Source: MFCC Tutorial

Implementations

Implemented MFCCs in python, available here. Use the ‘Download ZIP’ button on the right hand side of the page to get the code.

There is a good MATLAB implementation of MFCCs over here.

References

Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366
X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.

MFCC Articles Summary