Speech Warping and Audio Restoration

Matt Montag - EEN 540 Speech Signal Processing - Project 3

MATLAB Files

proj3.m project script
pad.m utility function to make two vectors match in size
fftplot.m utility function
averageLogSpectrum.m utility function to compute average log spectrum

Discussion

Linear predictive coding is a signal processing technique that is useful for separating signal content using a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. The filter is represented by an all-pole filter that mimics the spectral shape of the formant.

In this project, linear prediction is performed and various manipulations are applied to the residual signal and spectral function. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting. A simple 200% pitch shift yields a squirrel-like timbre. It is not a physically valid conversion, because it scales the vocal tract too much.

I experimented a great deal with the pitch shifting and pole-scaling process and arrived at a tweaked method for gender conversion.

Speech Gender Conversion

All Audio Samples

Barack Obama obama.wav obama_to_female.wav obama_to_child.wav obama_to_child_audition.wav
Sarah Palin palin.wav palin_to_male.wav palin_to_female.wav
Katie Couric couric.wav couric_to_male.wav couric_to_male_audition.wav
Al Gore algore.wav algore_to_female.wav algore_to_child.wav
Hip-Hop turnstiles.wav turnstiles_to_child.wav turnstiles_to_child_audition.wav

Files labeled "audition" were processed in Adobe Audition using the time stretch/pitch shift effect and are included for comparison.

Female to Male

Original
female_speech.wav

Converted
female_speech_to_male.wav

Male to Female

Original
male_speech.wav

Converted
male_speech_to_female.wav

Male to Child

Original
male_speech.wav

Converted
male_speech_to_child.wav

Residual Signal

Perfect Reconstruction

Original Signal
female_speech.wav

Error Signal
female_speech_error.wav

Reproduced Signal
female_speech_reproduced.wav

Stylized Signal

One Sample Per Period

female_speech_stylized1.wav

Two Samples Per Period

female_speech_stylized2.wav

Four Samples Per Period

female_speech_stylized4.wav

Audio Restoration

Original Signal
caruso.wav

Restored Signal
caruso_restored.wav

Discussion

Homomorphic signal processing can be used to separate two convolved signals. In this case, the original signal, a recording of Caruso singing, is convolved with the poor response of a transducer used in recording. In Homomorphic processing, components that have been convolved are converted into components that are added by taking the Fourier transform followed by the logarithm. After linear filtering to separate the added components, the original steps are undone. In this case, we take the log of the FFT of a modern Pavarotti recording and subtract it from the log of the FFT of the Caruso recording, and then exponentiate and IFFT the result, giving us a correction filter. When the Caruso recording is processed with this filter, it results in spectral content that more closely matches the reference Pavarotti recording.

The processing technique is not perfect, since it also amplifies the noisy part of the signal. A more advanced processing technique could be employed to isolate the speech signal from the noise.


© 2011 Matt Montag