Speech Recognition/Performance

From Wikiversity
Jump to navigation Jump to search

The performance of speech recognition systems is usually evaluated in terms of accuracy and speed.[1][2] Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).

Speech recognition by machine is a very complex problem, however. Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy of speech recognition may vary with the following:[3][citation needed]

  • Vocabulary size and confusability
  • Speaker dependence versus independence
  • Isolated, discontinuous or continuous speech
  • Task and language constraints
  • Read versus spontaneous speech
  • Adverse conditions

Learning Tasks[edit | edit source]

  • Explain how the size of the vocabulary can be significantly reduced if the speaker selects an application domain for the speech recognition (e.g. a medical doctor dictates the results of a computer tomography image and then he drives home and dictates an e-mail to his friend about meeting for sports) Both applications need a specific vocabulary and subset of words in the medical enviroment might not used in private enviroment and vice versa. Other subset of the vocabulary are used in both domains. Select a few words as examples for thoses sets and select words in the intersection of both domains.
  • Reduction of the size of the vocabulary is helpful to improve accuracy and performance especially for mobile devices. How can the speech recognition itself detect the application domain.
  • What are the option for application in Machine Learning to the perform the domain detection?


Accuracy[edit | edit source]

As mentioned earlier in this article, accuracy of speech recognition may vary depending on the following factors:

Vocabulary Size[edit | edit source]

  • Error rates increase as the vocabulary size grows:
e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7% or 45% respectively.
  • Vocabulary is hard to recognize if it contains confusable words:
e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusable words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z"); an 8% error rate is considered good for this vocabulary.[citation needed]
  • Speaker dependence vs. independence:
A speaker-dependent system is intended for use by a single speaker.
A speaker-independent system is intended for use by any speaker (more difficult).
  • Isolated, Discontinuous or continuous speech
With isolated speech, single words are used, therefore it becomes easier to recognize the speech.

With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.
With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.


  • Task and language constraints
    • e.g. Querying application may dismiss the hypothesis "The apple is red."
    • e.g. Constraints may be semantic; rejecting "The apple is angry."
    • e.g. Syntactic; rejecting "Red is apple the."

Grammar and Constraints[edit | edit source]

Constraints are often represented by a grammar.

  • Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.
  • Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)

Speech recognition is a multi-levelled pattern recognition task.

  • Acoustical signals are structured into a hierarchy of units, e.g. Phonemes, Words, Phrases, and Sentences;
  • Each level provides additional constraints;

e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower level;

  • This hierarchy of constraints are exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken in smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on lower level, and going to lower levels even more, we create more basic and shorter and simpler sounds. The lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sound on upper level, a new set of more deterministic rules should predict what new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition we need to take into a consideration neural networks. There are four steps of neural network approaches:
  • Digitize the speech that we want to recognize

For telephone speech the sampling rate is 8000 samples per second;

  • Compute features of spectral-domain of the speech (with Fourier transform);

computed every 10 ms, with one 10 ms section called a frame;

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions: amplitude (how strong is it), and frequency (how often it vibrates per second).

Security concerns[edit | edit source]

Speech recognition can become a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action.[4] Voice-controlled devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside. Attackers may be able to gain access to personal information, like calendar, address book contents, private messages, and documents. They may also be able to impersonate the user to send messages or make online purchases.

Two attacks have been demonstrated that use artificial sounds. One transmits ultrasound and attempt to send commands without nearby people noticing.[5] The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system.[6]

Accuracy[edit | edit source]

As mentioned earlier in this article, accuracy of speech recognition may vary depending on the following factors:

Vocabulary Size[edit | edit source]

  • Error rates increase as the vocabulary size grows:
e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7% or 45% respectively.
  • Vocabulary is hard to recognize if it contains confusable words:
e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusable words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z"); an 8% error rate is considered good for this vocabulary.[citation needed]
  • Speaker dependence vs. independence:
A speaker-dependent system is intended for use by a single speaker.
A speaker-independent system is intended for use by any speaker (more difficult).
  • Isolated, Discontinuous or continuous speech
With isolated speech, single words are used, therefore it becomes easier to recognize the speech.

With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.
With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.


  • Task and language constraints
    • e.g. Querying application may dismiss the hypothesis "The apple is red."
    • e.g. Constraints may be semantic; rejecting "The apple is angry."
    • e.g. Syntactic; rejecting "Red is apple the."

Grammar and Constraints[edit | edit source]

Constraints are often represented by a grammar.

  • Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.
  • Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)

Speech recognition is a multi-levelled pattern recognition task.

  • Acoustical signals are structured into a hierarchy of units, e.g. Phonemes, Words, Phrases, and Sentences;
  • Each level provides additional constraints;

e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower level;

  • This hierarchy of constraints are exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken in smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on lower level, and going to lower levels even more, we create more basic and shorter and simpler sounds. The lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sound on upper level, a new set of more deterministic rules should predict what new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition we need to take into a consideration neural networks. There are four steps of neural network approaches:
  • Digitize the speech that we want to recognize

For telephone speech the sampling rate is 8000 samples per second;

  • Compute features of spectral-domain of the speech (with Fourier transform);

computed every 10 ms, with one 10 ms section called a frame;

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions: amplitude (how strong is it), and frequency (how often it vibrates per second).

Security concerns[edit | edit source]

Speech recognition can become a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action.[7] Voice-controlled devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside. Attackers may be able to gain access to personal information, like calendar, address book contents, private messages, and documents. They may also be able to impersonate the user to send messages or make online purchases.

Two attacks have been demonstrated that use artificial sounds. One transmits ultrasound and attempt to send commands without nearby people noticing.[8] The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system.[9]

See also[edit | edit source]

References[edit | edit source]

  1. Ciaramella, Alberto. "A prototype performance evaluation report." Sundial workpackage 8000 (1993).
  2. Gerbino, E., Baggia, P., Ciaramella, A., & Rullent, C. (1993, April). Test and evaluation of a spoken dialogue system. In Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on (Vol. 2, pp. 135–138). IEEE.
  3. National Institute of Standards and Technology. "The History of Automatic Speech Recognition Evaluation at NIST Archived 8 October 2013 at the Wayback Machine".
  4. "Listen Up: Your AI Assistant Goes Crazy For NPR Too". NPR. 6 March 2016. Archived from the original on 23 July 2017.
  5. Claburn, Thomas (25 August 2017). "Is it possible to control Amazon Alexa, Google Now using inaudible commands? Absolutely". The Register. Archived from the original on 2 September 2017.
  6. "Attack Targets Automatic Speech Recognition Systems". vice.com. 31 January 2018. Archived from the original on 3 March 2018. Retrieved 1 May 2018.
  7. "Listen Up: Your AI Assistant Goes Crazy For NPR Too". NPR. 6 March 2016. Archived from the original on 23 July 2017.
  8. Claburn, Thomas (25 August 2017). "Is it possible to control Amazon Alexa, Google Now using inaudible commands? Absolutely". The Register. Archived from the original on 2 September 2017.
  9. "Attack Targets Automatic Speech Recognition Systems". vice.com. 31 January 2018. Archived from the original on 3 March 2018. Retrieved 1 May 2018.