Speech Recognition/Neural Networks

Neural networks

Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification,^[1] isolated word recognition,^[2] audiovisual speech recognition, audiovisual speaker recognition and speaker adaptation.

neural networks make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them attractive recognition models for speech recognition. When used to estimate the probabilities of a speech feature segment, neural networks allow discriminative training in a natural and efficient manner. However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words,^[3], early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies.

One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction,^[4] step prior to HMM based recognition. However, more recently, LSTM and related recurrent neural networks (RNNs)^[5]^[6]^[7]^[8] and Time Delay Neural Networks(TDNN's)^[9] have demonstrated improved performance in this area.

Deep feedforward and recurrent neural networks

Deep Neural Networks and Denoising Autoencoders^[10] are also under investigation. A deep feedforward neural network (DNN) is an artificial neural network with multiple hidden layers of units between the input and output layers.^[11]^[12]^[13]^[14] Similar to shallow neural networks, DNNs can model complex non-linear relationships. DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.^[15]

A success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial researchers, in collaboration with academic researchers, where large output layers of the DNN based on context dependent HMM states constructed by decision trees were adopted.^[16]^[17] ^[18] See comprehensive reviews of this development and of the state of the art as of October 2014 in the recent Springer book from Microsoft Research.^[19] See also the related background of automatic speech recognition and the impact of various machine learning paradigms, notably including deep learning, in recent overview articles.^[20]^[21]

One fundamental principle of deep learning is to do away with hand-crafted feature engineering and to use raw features. This principle was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,^[22] showing its superiority over the Mel-Cepstral features which contain a few stages of fixed transformation from spectrograms. The true "raw" features of speech, waveforms, have more recently been shown to produce excellent larger-scale speech recognition results.^[23]

Learning Tasks

Artificial Neural Networks (ANN) as designed to be able to learn, adapt or discover pattern in training data. Explain the concept of Artifical Neural Networks and explain the application, why these techniques are used in the context of speech recognition!
Users speech differently, have a different voice depending on their mood and emotions, have an accent, ... Explain the role of machine learning for pattern recognition and identify the benefit and constraints of this approach in comparison to other techniques!

References

↑ Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K. J. (1989). "Phoneme recognition using time-delay neural networks". IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (3): 328–339. doi:10.1109/29.21701.
↑ Wu, J.; Chan, C. (1993). "Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics". IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11): 1174–1185. doi:10.1109/34.244678.
↑ S. A. Zahorian, A. M. Zimmer, and F. Meng, (2002) "Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired," in ICSLP 2002
↑ Hu, Hongbing; Zahorian, Stephen A. (2010). "Dimensionality Reduction Methods for HMM Phonetic Recognition". ICASSP 2010. Archived from the original on 6 July 2012. http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf.
↑ Sepp Hochreiter; J. Schmidhuber (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.
↑ Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk (September 2015): "Google voice search: faster and more accurate." Archived 9 March 2016 at the Wayback Machine
↑ Fernandez, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". Proceedings of IJCAI. Archived from the original on 15 August 2017. http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf.
↑ Graves, Alex; Mohamed, Abdel-rahman; Hinton, Geoffrey (2013). "Speech recognition with deep recurrent neural networks". arXiv:1303.5778 [cs.NE]. ICASSP 2013.
↑ Waibel, Alex (1989). "Modular Construction of Time-Delay Neural Networks for Speech Recognition". Neural Computation 1 (1): 39–46. doi:10.1162/neco.1989.1.1.39. Archived from the original on 29 June 2016. https://web.archive.org/web/20160629180846/http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf.
↑ Maas, Andrew L.; Le, Quoc V.; O'Neil, Tyler M.; Vinyals, Oriol; Nguyen, Patrick; Ng, Andrew Y. (2012). "Recurrent Neural Networks for Noise Reduction in Robust ASR". Proceedings of Interspeech 2012.
↑ Hinton, Geoffrey; Deng, Li; Yu, Dong; Dahl, George; Mohamed, Abdel-Rahman; Jaitly, Navdeep; Senior, Andrew; Vanhoucke, Vincent et al. (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups". IEEE Signal Processing Magazine 29 (6): 82–97. doi:10.1109/MSP.2012.2205597.
↑ Deng, L.; Hinton, G.; Kingsbury, B. (2013). "New types of deep neural network learning for speech recognition and related applications: An overview". 2013 IEEE International Conference on Acoustics, Speech and Signal Processing: New types of deep neural network learning for speech recognition and related applications: An overview. pp. 8599. doi:10.1109/ICASSP.2013.6639344. ISBN 978-1-4799-0356-6.
↑ Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013 (by Geoff Hinton).
↑ Keynote talk: "Achievements and Challenges of Deep Learning: From Speech Analysis and Recognition To Language and Multimodal Processing," Interspeech, September 2014 (by Li Deng).
↑ Deng, Li; Yu, Dong (2014). "Deep Learning: Methods and Applications". Foundations and Trends in Signal Processing 7 (3–4): 197–387. doi:10.1561/2000000039. Archived from the original on 22 October 2014. https://web.archive.org/web/20141022161017/http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf.
↑ Yu, D.; Deng, L.; Dahl, G. (2010). "Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition". NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
↑ Dahl, George E.; Yu, Dong; Deng, Li; Acero, Alex (2012). "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition". IEEE Transactions on Audio, Speech, and Signal Processing 20 (1): 30–42. doi:10.1109/TASL.2011.2134090.
↑ Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. Recent Advances in Deep Learning for Speech Research at Microsoft. ICASSP, 2013.
↑ Yu, D.; Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer).
↑ Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An Overview". IEEE Transactions on Audio, Speech, and Language Processing.
↑ Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia 10 (11): 32832. doi:10.4249/scholarpedia.32832.
↑ L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton (2010) Binary Coding of Speech Spectrograms Using a Deep Auto-encoder. Interspeech.
↑ Tüske, Zoltán; Golik, Pavel; Schlüter, Ralf; Ney, Hermann (2014). "Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR". Interspeech 2014. Archived from the original on 21 December 2016. https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf.

[1] Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K. J. (1989). "Phoneme recognition using time-delay neural networks". IEEE Transactions on Acoustics, Speech, and Signal Processing 37 (3): 328–339. doi:10.1109/29.21701.

[2] Wu, J.; Chan, C. (1993). "Isolated Word Recognition by Neural Network Models with Cross-Correlation Coefficients for Speech Dynamics". IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11): 1174–1185. doi:10.1109/34.244678.

[3] S. A. Zahorian, A. M. Zimmer, and F. Meng, (2002) "Vowel Classification for Computer based Visual Feedback for Speech Training for the Hearing Impaired," in ICSLP 2002

[4] Hu, Hongbing; Zahorian, Stephen A. (2010). "Dimensionality Reduction Methods for HMM Phonetic Recognition". ICASSP 2010. Archived from the original on 6 July 2012. http://bingweb.binghamton.edu/~hhu1/paper/Hu2010Dimensionality.pdf.

[5] Sepp Hochreiter; J. Schmidhuber (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.

[6] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk (September 2015): "Google voice search: faster and more accurate." Archived 9 March 2016 at the Wayback Machine

[fernandez2007-7] Fernandez, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "Sequence labelling in structured domains with hierarchical recurrent neural networks". Proceedings of IJCAI. Archived from the original on 15 August 2017. http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-124.pdf.

[8] Graves, Alex; Mohamed, Abdel-rahman; Hinton, Geoffrey (2013). "Speech recognition with deep recurrent neural networks". arXiv:1303.5778 [cs.NE]. ICASSP 2013.

[9] Waibel, Alex (1989). "Modular Construction of Time-Delay Neural Networks for Speech Recognition". Neural Computation 1 (1): 39–46. doi:10.1162/neco.1989.1.1.39. Archived from the original on 29 June 2016. https://web.archive.org/web/20160629180846/http://isl.anthropomatik.kit.edu/cmu-kit/Modular_Construction_of_Time-Delay_Neural_Networks_for_Speech_Recognition.pdf.

[10] Maas, Andrew L.; Le, Quoc V.; O'Neil, Tyler M.; Vinyals, Oriol; Nguyen, Patrick; Ng, Andrew Y. (2012). "Recurrent Neural Networks for Noise Reduction in Robust ASR". Proceedings of Interspeech 2012.

[HintonDengYu2012-11] Hinton, Geoffrey; Deng, Li; Yu, Dong; Dahl, George; Mohamed, Abdel-Rahman; Jaitly, Navdeep; Senior, Andrew; Vanhoucke, Vincent et al. (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups". IEEE Signal Processing Magazine 29 (6): 82–97. doi:10.1109/MSP.2012.2205597.

[ReferenceICASSP2013-12] Deng, L.; Hinton, G.; Kingsbury, B. (2013). "New types of deep neural network learning for speech recognition and related applications: An overview". 2013 IEEE International Conference on Acoustics, Speech and Signal Processing: New types of deep neural network learning for speech recognition and related applications: An overview. pp. 8599. doi:10.1109/ICASSP.2013.6639344. ISBN 978-1-4799-0356-6.

[HintonKeynoteICASSP2013-13] Keynote talk: Recent Developments in Deep Neural Networks. ICASSP, 2013 (by Geoff Hinton).

[interspeech2014Keynote-14] Keynote talk: "Achievements and Challenges of Deep Learning: From Speech Analysis and Recognition To Language and Multimodal Processing," Interspeech, September 2014 (by Li Deng).

[BOOK2014-15] Deng, Li; Yu, Dong (2014). "Deep Learning: Methods and Applications". Foundations and Trends in Signal Processing 7 (3–4): 197–387. doi:10.1561/2000000039. Archived from the original on 22 October 2014. https://web.archive.org/web/20141022161017/http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf.

[Roles2010-16] Yu, D.; Deng, L.; Dahl, G. (2010). "Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition". NIPS Workshop on Deep Learning and Unsupervised Feature Learning.

[ref27-17] Dahl, George E.; Yu, Dong; Deng, Li; Acero, Alex (2012). "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition". IEEE Transactions on Audio, Speech, and Signal Processing 20 (1): 30–42. doi:10.1109/TASL.2011.2134090.

[ICASSP2013-18] Deng L., Li, J., Huang, J., Yao, K., Yu, D., Seide, F. et al. Recent Advances in Deep Learning for Speech Research at Microsoft. ICASSP, 2013.

[ReferenceA-19] Yu, D.; Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer).

[20] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An Overview". IEEE Transactions on Audio, Speech, and Language Processing.

[scholarpedia2015-21] Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia 10 (11): 32832. doi:10.4249/scholarpedia.32832.

[interspeech2010-22] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton (2010) Binary Coding of Speech Spectrograms Using a Deep Auto-encoder. Interspeech.

[interspeech2014-23] Tüske, Zoltán; Golik, Pavel; Schlüter, Ralf; Ney, Hermann (2014). "Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR". Interspeech 2014. Archived from the original on 21 December 2016. https://www-i6.informatik.rwth-aachen.de/publications/download/937/T%7Bu%7DskeZolt%7Ba%7DnGolikPavelSchl%7Bu%7DterRalfNeyHermann--AcousticModelingwithDeepNeuralNetworksUsingRawTimeSignalfor%7BLVCSR%7D--2014.pdf.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]