Speech Recognition/Dynamic Time Warping

Introduction (DTW)-based speech recognition

Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach.

Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if there were accelerations and deceleration during the course of one observation. DTW has been applied to video, audio, and graphics – indeed, any data that can be turned into a linear representation can be analyzed with DTW.

A well-known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g., time series) with certain restrictions. That is, the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

Learning Task

Explain why measuring similarity between two sequences that may vary in time or speed help to assign a recorded speech to the corresponding recognized text? What causes the variation in speech beyond medical causes like flu, cough, ...?^[1]
To have a robust speech recognition beyond the variation of time, speed and pronuntiation is relevant for all speech recognition methods. Analyze if HMM have a better performance for the speech recognition task!

References

↑ Yu, K., Mason, J., & Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEE Proceedings-Vision, Image and Signal Processing, 142(5), 313-318.

[1] Yu, K., Mason, J., & Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEE Proceedings-Vision, Image and Signal Processing, 142(5), 313-318.

[1]