Psycholinguistics/Models of Speech Perception

Introduction

Speech perception is the process by which speech is interpreted. Speech perception involves three processes, hearing, interpreting and comprehending all of the sounds produced by a speaker. The combination of these features into an order that resembles speech of a given language is a main function of speech perception. Speech perception includes combining not only the phonology and phonetics of the speech to be perceived, but also the syntax of the language and the semantics of the spoken message. Adequate speech perception requires a model which will unite all the various components of speech and produce a comprehensive message. Various models have been developed to help understand the methods behind perceiving different components of speech. There are models that function on the production or perception of speech solely, and there are other models that combine both speech production and perception together. Some of the first models produced date back in time until about the mid 1900's, and models are continually being developed today.

Models of Speech Perception

TRACE Model

TRACE model for speech perception was one of the first models developed for perceiving speech, and is one of the better known models. TRACE Model is a framework in which the primary function is to take all of the various sources of information found in speech and integrate them to identify single words. The TRACE model, founded by McClelland and Elman (1986) is based on the principles of interactive activation^[1]. All components of speech (features, phonemes, and words) have their own role in creating intelligible speech, and using TRACE to unite them leads to a complete stream of speech, instead of individual components. The TRACE model is broken into two distinct components. TRACE I deals mainly with short segments of real speech, whereas TRACE II deals with identification of phonemes and words in speech.The model as a whole, consists of a very large number of units which are organized into three separate levels. Each level is comprised of a bank of detectors for distinguishing the components of that level.

Feature level - At this level, there are several banks of feature detectors. Each features has its only place in speech time, and they are organized in successive order.
Phoneme level- At this level, there is a bank of detectors for each phoneme present in the speech sounds.
Word level - At this level there is a bank of detectors for each individual word that is spoken by the speaker.

The TRACE model works in two directions. TRACE allows for either words or phonemes to be derived from a spoken message. By segmenting the individual sounds, phonemes can be determined from spoken words. By combining the phonemes, words can be created and perceived by the listener.

Motor Theory Model

This model was developed in 1967 by Liberman and colleagues. The basic principle of this model lies with the production of speech sounds in the speaker's vocal tract. The Motor Theory proposes that a listener specifically perceives a speaker's phonetic gestures while they are speaking. A phonetic gesture, for this model, is a representation of the speaker's vocal tract constriction while producing a speech sound^[2]. Each phonetic gesture is produced uniquely in the vocal tract. The different places of producing gestures permit the speaker to produce salient phonemes for listeners to perceive. The Motor Theory model functions by using separate embedded models within the main model. It is the interaction of these models that makes Motor Theory possible.

Human Vocal Tract: Areas of constriction and relaxation within this tract create various vocal gestures

Trading Relations - This is the concept that not every phonetic gesture can be directly translated and defined into acoustic terms. This means that there must be another step for interpreting the vocal gestures. Some gestures can be cognitively switched with others to make interpretation simpler. If the produced gesture is similar enough to another gesture that already has a known articulatory cause, they can be switched. The perceived gesture can be traded with the known gesture and interpretation can be achieved.

Coarticulation - This is the idea that there is variability in the aspect of gesture production. This concept indicates that there are variations in the area of articulation of vocal gestures produced by speakers. The same gesture may be able to be produced in more than one place. The phonemes within the gestures are obtained and perceived by the ability to compensate for all the variations of speech possible due to coarticulation.

Categorical Perception

Categorical Perception is the concept that phonemes in speech can be divided categorically once they are produced. The main categories that speech can be divided into are places of articulation and voice onset time. Some of the vocal gestures can only occur from a single type of articulation. Other gestures have a variety of coarticulations. This means that the same sound can either be produced at a single place in the vocal tract, or it can be produced from a few different places in the vocal tract^[3].Being able to determine where the sound is being produced will assist in determining which sound has been produced. Some vocal gestures also have different places in time for the voice on-set of the gesture in speech. Different vocal gestures produce their onset of sound at different times, depending on what the sound being produced is. For example, /b/ has a different voice onset than /p/ yet they are produced in the same place in the vocal tract^[4]. Knowing when the voice onset is of the sound will help when trying to assess which sound the speaker has produced. Making the distinction between articulation and voice onset enables gestures to be grouped and defined based on the ways they are produced.

Cohort Model

Proposed in the 1980's by Marslen-Wilson, the Cohort-Model is a representation for lexical retrieval. An individual's lexicon is his or her mental dictionary or vocabulary of all the words he or she is familiar with. According to a study, the average individual has a lexicon of about 45,000 to 60,000 words^[5]. The premise of the Cohort Model is that a listener maps novel auditory information onto words that already exist in his or her lexicon to interpret the new word. Each part of an auditory utterance can be broken down into segments. The listener pays attention to the individual segments and maps these onto pre-existing words in their lexicon. As more and more segments of the utterance are perceived by the listener, he or she can omit words from their lexicon that do not follow the same pattern.

Example: Grape

1. The listener hears the /gr/ sound and begins thinking about which words he or she has in their lexicon which begin with the /gr/ sound and cancel out all of the others.

2. /gra/ all words following this pattern are thought of, and all the rest are omitted.

3. This patter continues until the listener has run out of speech segments and is left with a single option : Grape

The ideals behind Cohort Model have also recently been applied to technology to make internet searches more convenient and faster. Google has begun using this model to help make searching faster and easier for internet users.

As the first letter is typed into the search bar, Google begins "guessing" what the word is going to be.The guesses are generally based on what the most common searches tend to be, and what makes sense syntactically. As more letters are typed, different options appear in the menu which correspond with the letters typed.

Exemplar Theory

The main premise of the Exemplar theory is very similar to the Cohort Model. Exemplar theory based on the connection between memory and previous experience with words. The Exemplar theory aims to account for the way in which a listener can remember acoustic episodes. An acoustic episode is an experience with spoken words. There has been evidence produced that demonstrates that details relating to specific audible episodes are remembered by the listeners, if the episodes are familiar to the listener^[6]. It is believed that listeners may be better at recognizing previously heard words if they are repeated by the same speaker, using the same speaking rate, meaning that the episode is familiar. With the Exemplar theory, it is believed that every word leaves a unique imprint on the listener's memory, and that this imprint is what aids a listener with remembering words. When new words enter the memory, the imprint of the new words are matched to previous ones to determine any similarities^[7]. The Exemplar Theory states that as more experience is gained with lexical improvements, new words being learned or heard, the stability of the memory increases. With this lexical plasticity, the Ganong Effect comes into play. The Ganong Effect states that real-world memory traces are able to perceive much more readily than nonsense word memory^[8].

Ganong Effect Example:

Soot, Boot, Root will be easier to remember due to similarity in the memory of the listener

Snoyb, Bnoyb, and Rnoyb without being similar in the memory of the listener, will be difficult to remember

Neurocomputational Model

Kroger and colleagues (2009) worked on a speech perception model which is based on the neurophysiological and neuropsychological facts about speech^[9]. The model they developed simulates what the neural pathways in various areas of the brain are involved in when speech is being produced and perceived. Using this model, brain areas in speech knowledge are obtained by training neural networks to detect speech in the cortical and sub-cortical regions of the brain. Through their research, Kroger and colleagues determined that the neurocomputational model has the capability of embedding in these brain areas important features of speech production and perception to achieve comprehension of speech^[10].

This model differs from previously discussed models on the basis of its role in speech perception. The authors developed their model to demonstrate that speech perception not only involves the perception of spoken language, it also heavily relies on the production of language too^[11]. This model greatly reflects the findings of Liberman and associates in their work on the Motor Theory of speech production. Both of these models demonstrate that speech perception is a product of both production of speech and recieving of speech. With the work conducted by Huang and associates, it can be shown that very similar areas in the brain are activated for production along with perception of language^[12]. This neurocomputational model is one of the few that adequately map the pathways of both speech functions in the brain.

Dual Stream Model

The Dual Stream Model, proposed by Hickok and Poeppel (2007) demonstrates the presence of two functionally distinct neural networks that process speech and language information^[13]. One of the neural networks deals primarily with the sensory and phonological information pertaining to conceptual and semantics. The other network operates with sensory and phonological information pertaining to motor and articulatory systems. In this sense, the Dual Stream Model encompasses the key aspects of speech, production and perception. Despite previous assumptions about the lateralization of the human brain, the Dual Stream Model reverses the conceptions. As previously thought, the left hemisphere of the human brain dealt with only fast temporal information, but as Hickok & Poeppel (2007) demonstrate, this might not necessarily by the case. With the development of the Dual Stream Model, it has been shown that the left hemisphere of the brain is also capable of representing acoustic information as readily as the right hemisphere^[14]. Along with changing the way it was thought that the brain dealt with incoming information, the basic concept of the Dual Stream Model is that acoustic information must interfere with conceptual and motor information for the entire message to be perceived^[15]. This combining of roles is what makes the Dual Stream Model unique and plausible as a model for speech perception.

Problems with Speech Perception Models

Some of the main issues involved in producing a speech perception model is deciding which method of perception the model is going to adopt. Speech perception is a process which can occur in one of two ways, top-down processing, or bottom-up processing. With top-down processing, listeners perceive the entire word, and break it down into its components to determine its meaning, whereas in bottom-up processing, listeners perceive the individual segments of a word first, and build them together to form and determine meaning. When designing a speech perception model, both of these processes need to be taken into account. The processing direction the model takes will need to depend in which way the researchers believe speech perception occurs.

The TRACE model and the Dual Stream model each employ both the top-down and bottom-up processing as they function. This means that not only do the models explain how word can be built from the phonemes up, they are also capable of explaining how the phonemes are capable of being derived from complete words as well. The TRACE and Dual Stream models are exceptions as most of the speech perception models involve speech perception occurring in only one direction. For instance, Cohort Theory uses strictly bottom-up processing. The method of building upon some segments of a word until the entire word is built is an example of bottom-up processing. Only processing information in one direction is a downfall for a speech perception model.

The Exemplar Theory and Motor Theory each pose a different type of problem for speech perception. Both theories involve operationally defining certain aspects of speech that make the model work. It is in these definitions that errors may arise. In the Exemplar Theory, how can similarity of words be defined adequately when the level of similarity of words will be different for each individual^[16]? Same goes with defining an episode. It is difficult to ensure that listening to someone speak will be the exact same experience for more than one person, and also that it will be the same experience the second time around. With the Motor Theory, how can the gestures made by speakers be defined properly, if each speaker has a unique vocal tract and way of producing sound.

Each model is unique and capable of functioning the way it had been designed to. There are some limitations to each of the models, and there is no perfect model for speech perception. Along with these limitations, that models still work independently and co-dependently of one another. If a perception problem can not be solved using one of the models, there is a strong chance that there is another model that will work.

Conclusion

In conclusion, as shown above, there are many different models which can be used to perceive speech. Each model has its own method of working and usage. It depends on which aspect of speech, or for which purpose, that you would select a particular model to use. TRACE model and the Dual Stream models can both be used if speech is going to be processed from the phonemes upwards to the words, or broken down from the words into the phonemes. This is possible because these two models are the only ones capable of perceiving speech in both directions. The other models mentioned in this chapter have their own significant purpose, and are best used when in that particular circumstance. Speech perception models were designed to help detect and interpret speech for a great number of reasons, namely to help understand what utterances are being produced when it is difficult to distinguish them. These models have also been produced to help computers, microphones, and other electronic devices receive and translate human voices into intelligible messages. Speech perceptions models should be used with the understanding that each one varies from the others, and selecting the appropriate model will happen solve the speech perception problem much easier than if the incorrect model is used.

Learning Activity

Upon reviewing the above material on Models of Speech Perception, answer the following questions to test your knowledge about the models of speech perception. The answers to each of the questions are posted below this section for you to review once you've answered.

Name the Model

For this section, provide the speech perception model that may be described by the following words. There may be one or more answers for some of the questions.

lexicon, mapping, segmentation = __________.
gestures, coarticulation, trading relations = _________.
conceptual/semantics, motor/articulatory, networks = __________.
acoustic episode, imprint, Ganong Effect = __________.
word, feature, phoneme = __________.
places of articulation, voice onset, gestures = _________.
neurophysiological, neuropsychological, pathways = __________.

Short Answer

For this section please provide a written answer in the form of a paragraph. Make note of which speech perception model is being utilized, and unless otherwise specified, state what the model(s) is/are.

Sally, a undergraduate student is preparing an argument about speech perception models. Her argument must include models that work best at processing words or phonemes in both a top-down and a bottom-up fashion. Which model(s) should Sally use in her argument, and why?
You are a tutor for a grade 4 student. The student has been learning spellings, working especially on vowel sounds. You have noticed that the student can easily remember how to pronounce and spell the words read, heal, and veal, however he struggles greatly with the words kead, peaf, and feam. What could be causing a problem for this learner, and which speech perception model does this problem fall under?
Joe is a neuroscience student who is very familiar with neural pathways. His professor has asked him to prepare a presentation on speech perception models. The professor is also interested in the pathways involved in speech and would like to see this incorporated into the presentation. Which model could Joe use for this assignment to make his professor happy, and interest himself?

Mini Quiz

Fill in the blanks using the information found above.

Coarticulation and __________ are used as two models within the main model for __________.
Different vocal gestures have a different __________ for the different sounds they are making.
The premise of __________ is that a listener maps novel words onto pre-existing words in his __________ to interpret the new word.
The TRACE model is based on the principles of __________ activation.
One stream of the Dual stream model deals with _________ and the other stream deals with __________.
The Neurocomputational Model is based on _________ and __________ facts about speech.
Each phonemic gesture is produced __________ in the vocal tract.
The main categories speech can be divided into are __________ and __________.

References

↑ McClelland J., & Elman J. (1986). The TRACE Model of Speech Perception. Cognitive Psychology, 18, 1-86
↑ Liberman et al. (1967). Perception of the Speech Code. Psychological Review, 74, 431-461
↑ Goldstone, L. (1994). Influences of categorization on perceptual discrimination. Journal of Experimental Psychology 123 178–200.
↑ Truckenbrodt H. (2007). Spectrogram readings and other acoustics. Introduction to Phonetics and Phonology. May 27, 2007
↑ Aitchison, J. 1987. Words in the Mind. Oxford: Basil Blackwell
↑ Goldinger, S. (1996). Words and voices: episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory and Cognition 22(5) 1166-1183
↑ Goldinger, S. (1998). Echo of echoes? An episodic theory of lexical access. Psychological review 105(2) 251-279
↑ Goldinger S. (1998)
↑ Kroger et al. (2009) Towards a neurocomputational model of pseech production and perception. Speech Communication 15. 793-809
↑ Kroger et al. (2009)
↑ Hickok & Poeppel (2000). Towards a functional neuroanatomy of speech perception. Trends in Cognitive Science 4 131–138
↑ Huang et al (2001). Comparing cortical activations for silent and overt speech using event-related fMRI. Human Brain Mapping 15 39–53
↑ Hickok, G. & Poeppel, D. (2007).The cortical organization of speech processing. Nature Reviews Neuroscience 8(5) 393-402
↑ Luce, P. & Pisoni, D. (1998). Recognizing spoken words: the neighborhood activation model. Ear Hear 19.1–36
↑ Milner, A. & Goodale, M. The visual brain in action (Oxford University Press, Oxford, 1995)
↑ Johnson et al. (1999). Auditory-visual integration of talker gender in vowel perception. Journal of Phonetics, 27,359-384

Learning Activity Answers

Part A

1. Cohort Model

2. Motor Theory

3. Dual Stream

4. Exemplar Theory

5. TRACE

6. Categorical Perception

7. Neurocomputational model

Part B

1. TRACE and Dual Stream

The TRACE model, as well as the Dual Stream model use a top-down and bottom-up process while they are in function. Both of these models function in such a way that they are able to either segment entire words down into their phonemes (top-down) or build words up from their individual phonemes (bottom-up). These are the only two speech perception models with this capability, which makes them capable of handling many speech perception needs. For her argument, Sally should discuss both of these models

2. The 4th grader is struggling with what is known as the Ganong Effect. This means that the first grouping of words is easy for him to remember because they are familiar words which he has probably heard used in conversation before. the second group of words is hard to remember because they are unfamiliar to him, and he has nothing stored in his mental lexicon to map them onto. The Ganong Effect come into play with the Exemplar Theory. This is because the Exemplar Theory works with memory and experience of words. Words that are used in everyday life, or that are familiar will be easier to recall and learn compared to words that aren't.

3. Neurocomputational Model

Joe should use the neurocomputational model for his presentation. This model of speech perception utilizes information based on where in the various brain areas speech is being produced and also perceived. This model works by training specific neural pathways to not only detect speech, but also to produce it. If Joe is interested and knowledgable about the brain and its neural pathways, this model would be ideal for him to present.

Part C

1. trading relations, Motor Theory

2. voice onset time

3. Cohort Model, lexicon

4. interactive

5. conceptual and semantics, motor and articulatory

6. neurophysiological, neuropsychological

7. uniquely

8. place of articulation, voice onset time

[1] McClelland J., & Elman J. (1986). The TRACE Model of Speech Perception. Cognitive Psychology, 18, 1-86

[2] Liberman et al. (1967). Perception of the Speech Code. Psychological Review, 74, 431-461

[3] Goldstone, L. (1994). Influences of categorization on perceptual discrimination. Journal of Experimental Psychology 123 178–200.

[4] Truckenbrodt H. (2007). Spectrogram readings and other acoustics. Introduction to Phonetics and Phonology. May 27, 2007

[5] Aitchison, J. 1987. Words in the Mind. Oxford: Basil Blackwell

[6] Goldinger, S. (1996). Words and voices: episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory and Cognition 22(5) 1166-1183

[7] Goldinger, S. (1998). Echo of echoes? An episodic theory of lexical access. Psychological review 105(2) 251-279

[8] Goldinger S. (1998)

[9] Kroger et al. (2009) Towards a neurocomputational model of pseech production and perception. Speech Communication 15. 793-809

[10] Kroger et al. (2009)

[11] Hickok & Poeppel (2000). Towards a functional neuroanatomy of speech perception. Trends in Cognitive Science 4 131–138

[12] Huang et al (2001). Comparing cortical activations for silent and overt speech using event-related fMRI. Human Brain Mapping 15 39–53

[13] Hickok, G. & Poeppel, D. (2007).The cortical organization of speech processing. Nature Reviews Neuroscience 8(5) 393-402

[14] Luce, P. & Pisoni, D. (1998). Recognizing spoken words: the neighborhood activation model. Ear Hear 19.1–36

[15] Milner, A. & Goodale, M. The visual brain in action (Oxford University Press, Oxford, 1995)

[16] Johnson et al. (1999). Auditory-visual integration of talker gender in vowel perception. Journal of Phonetics, 27,359-384

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]