One of the challenges with studying spoken language is that it happens very quickly: within a single second, a listener may hear several words, where each word contains multiple speech sounds, and where each speech sound is determined by a number of acoustic features in the signal. Despite the amount of information packed into this short timeframe, listeners have little difficulty correctly recognizing words under everyday listening conditions. How do listeners accomplish this remarkable task?
Our lab studies this question by using several techniques that allow us to observe spoken language processing as it happens over millisecond-level time-scales. For example, we use the event-related brain potential (ERP) technique to measure brain activity non-invasively using electrodes attached to the head. Electrical activity produced by the brain can be detected at the scalp and recorded by these electrodes in real-time. Because of its temporal precision, the ERP technique allows us to study spoken word recognition as it happens and identify components associated with different processes.
Across several studies, we have used the ERP technique to show that listeners are sensitive to fine-grained acoustic detail in the speech signal. In these experiments, listeners are presented with sounds that vary from one phonetic category to another. For example, the sound clip below varies in voice onset time (VOT) along a continuum from the word dart to the word tart. When you play it you will hear the words varying in nine VOT steps from 0 ms (a good dart) to 40 ms (a good tart):
The results of these experiments show that the amplitude of the auditory N1 varies linearly with changes in VOT and is not influenced by the phonological category the subject was listening for, nor by how they categorized the stimuli. A later response, the P3, also varies with VOT, but does depend on which category the subject is listening for. These results tell us that perception is continuous with respect to changes in the speech signal and that the effects of categories observed in behavioral responses are the result of later-occurring processes.
ERP data for speech sounds varying in voice onset time (VOT). (A) ERP waveforms as a function of VOT during the time range of the N1. (B) Mean N1 amplitude as a function of VOT, showing a linear effect across the VOT continuum consistent with encoding of continuous acoustic cues. (C) ERP waveforms as a a function of distance from category endpoints. (D) Mean P3 amplitude as a function of distance from category endpoints. Unlike the N1, the P3 is affected by both listeners' phonological categories, and graded acoustic cues (Toscano et al., 2010).
In other experiments, we have use the ERP technique to study how listeners encode different phonological contrasts in natural speech and how speech processing differs cross-linguistically. Importantly, these early brain responses are also affected by semantic context—demonstrating the role of interactivity in spoken language processing.
Fast optical imaging data showing (A) a linear trend reflecting cue-level responses and (B,C) quadratic trends reflecting category-level responses (Toscano et al., 2018).
Sound frequency encoding in the auditory brainstem (Tabachnick & Toscano, 2018).
We are also studying how the ABR varies in response to speech sounds. This project will help us better understand the role of the auditory brainstem in speech perception, and may lead to improved diagnosis and treatments for hearing loss. Currently, we are looking at whether the ABR can detect changes in silence duration and rise time, two acoustic cues that distinguish /ʃ/ and /tʃ/ (i.e., "sh" and "ch"). If the two cues are processed independently, it would mean the brain combines acoustic cue information later in processing, which would be consistent with models of cue integration and help explain why human listeners are so good at understanding language despite the high degree of variability in speech. Early results show a clear difference in ABRs for both prototypical sounds and ambiguous sounds, suggesting that listeners encode each cue separately early in processing.