Audiovisual integration has a fundamental role in human information processing. During development, one is exposed to highly correlated auditory and visual speech stimulation [1, 2]. Indeed, multisensory perception is essential in every-day communication; audiovisual interactions often accelerate and improve perception. For example, it has been shown that seeing the speaker’s articulatory lip movements enhance speech comprehension especially when there is acoustic noise . In contrast, conflicting audiovisual features can degrade or even alter perception. In the so-called McGurk illusion conflicting visual and auditory phonemes can produce illusory percept of a third phoneme (e.g., visual /ga/ combined to auditory /ba/ is perceived as /da/, see ).
Recent functional magnetic resonance imaging (fMRI) studies have provided findings on the neural basis of how seeing articulatory gestures influences auditory cortex speech processing. Audiovisual speech has been frequently shown to elicit hemodynamic activity in the posterior superior temporal sulcus (STS) /superior temporal gyrus (STG) [5-11] and it has been shown that seeing articulatory gestures can modulate even primary auditory cortex hemodynamic activity [12, 13]. Further, attention to audiovisual stimuli has been observed to enhance hemodynamic responses in the planum temporale (PT, a part of the secondary auditory sensory cortex ). These findings on audiovisual speech processing somewhat deviate from findings (initially suggested by primate electrophysiological studies ), showing that human auditory cortex is organized to process auditory stimuli in parallel anterior “what” and posterior “where” processing pathways with speech being processed within the “anterior” stream [16, 17]. Supporting this notion, previous neuroimaging studies on auditory speech have consistently demonstrated hemodynamic activations in the areas anterior to Heschl’s gyrus (HG, i.e. primary auditory cortex [18-23]), even bilaterally [17, 22, 24].
One possible explanation for the apparent discrepancy in findings between studies on the neural basis of auditory vs. audiovisual speech processing is that the dorsal “where” pathway involves a broader set of functions than mere spatial processing, specifically mapping of auditory inputs to motor schemas (i.e., the “how” pathway) . Furthermore, the effects of lipreading are not restricted to the sensory specific areas (i.e. visual and auditory cortices), but have been reported to activate the Broca’s area [9, 10, 26], motor cortex [10, 26, 27], posterior parietal cortex [9, 26-28], claustrum , and insular cortex . Such large scale of motor system activations tentatively suggest that audiovisual speech perception is closely related to speech production and it has been specifically suggested that seeing speech modulates auditory cortex via top-down inputs from the speech motor system [2, 10, 25, 29].
Magnetoencephalography (MEG) studies have supported the notion that the speech-motor system mediates auditory-cortex modulation during lipreading. Suppression of electromagnetic N1 response to audiovisual phonetic stimuli, generated ~100 ms from sound onset in the auditory cortex and its posterior areas, has been reported as compared with responses to auditory phonetic stimuli [2, 30-32]. Furthermore, silent lipreading suppresses N1 responses to phonemic F2 transitions  and suppression of N1 to auditory speech stimuli has been observed during overt and covert speech production . In our recent study, we observed that N1 responses to pure tones were suppressed similarly during lipreading and covert speech production most prominently in the left hemisphere. These effects were seen for all tone frequencies ranging from 125 to 8000 Hz, but due to the ill-posed nature of the MEG inverse estimation the precise anatomical loci of this effect remained ambiguous .
There are at least two auditory cortex source generators that contribute to the N1 response [30, 36, 37] and, further, fMRI studies have demonstrated several tonotopic maps in the human superior temporal lobe [38-43]. While the functional specialization of these tonotopically organized areas have not been elucidated, the loci of these areas is such that they could potentially contribute to the generation of the N1 response . Thus, it is feasible to assume that seeing articulatory gestures might suppress activity in some of these tonotopic areas and result in suppression of responses to even non-linguistic auditory stimuli.
Here, we studied with fMRI how the human auditory cortex processes narrow-band (1/3 octave) noise bursts centered at low frequencies (i.e., out of frequency band that is critical for speech processing) and mid-frequencies (i.e., within the frequency band that is critical for speech processing) during presentation of linguistic and non-linguistic visual stimuli. We hypothesized that seeing articulatory gestures suppresses auditory-cortex processing of the narrow-band noise bursts compared to the control non-linguistic condition. We restricted the region of interest to the superior temporal plane covering the primary and secondary auditory cortex and further set forth to investigate whether the suppressive effects specifically concern some of the tonotopic areas as described by Talavage et al. 2004 .
Materials and Methodology
Fifteen healthy right-handed native Finnish speakers (four men; age range 21 – 55; average = 26.9 years) gave an informed consent to participate in the experiment. Two subjects were discarded due technical problems during fMRI scanning. The subjects had normal hearing and normal or corrected-to-normal vision, and they reported having no history of neurological or auditory diseases or symptoms. The study protocol was approved by the Ethics Committee of Hospital District of Helsinki and Uusimaa, Finland and the research was conducted in accordance with the Helsinki declaration. Subjects received no financial compensation for their participation.
Stimuli and Task
The auditory stimuli were 100 ms noise bursts, which had a bandwidth of 1/3 octave and a center frequency equal to either 250 Hz (low frequency, LF) or 2000 Hz (mid-frequency, MF; Figure 1). The frequency borders for the LF noise burst were 223 and 281 Hz and those for the MF noise burst were 1782 and 2245 Hz. Both sound clips had 5-ms Hanning-windowed onset and offset ramps. The onset-to-onset inter-stimulus interval was 500 ms. Each condition (LF, MF, and silence) was presented for 30 s in random order. The sound files were generated with Matlab (R14, MathWorks, Natlick, MA, USA). We used 44.1 kHz sampling rate with 16 bit precision.
The visual stimuli were the same as the ones used in our previous experiment . In the lipreading condition a female face articulated Finnish vowels /a/, /i/, /o/, and /y/. The digitized videoclip of each articulation lasted for 1320, 1360, 1400, and 1440 ms, respectively. We combined the videoclips of the articulations in a pseudorandom order to create a continuous 10-minute video. The subjects were requested to press a magnet- compatible button whenever they detected two consecutive same-category articulations. In the expanding-circle condition a blue transparent circle was overlaid on the mouth area of the still-face image and the circle transformed into an oval of four alternative directions (horizontal, vertical, right oblique, and left oblique). The time scale and spatial frequency of the circle transformation into an oval approximated that of the mouth opening during the lipreading condition. The videoclips of the circle transformations to ovals were concatenated pseudorandomly in the same way as the vowel videoclips in the lipreading condition to form a continuous video of 10 min. The subjects were instructed to press the response paddle whenever they detected occurrence of two consecutive oval expression of the same direction. In covert-production of vowels condition we used still face of the same female and the subjects were asked to covertly produce vowels at roughly the visual stimulus presentation rate of the other conditions. The screen resolution was 640 x 480 pixels at 60 Hz with 32- bit color depth.
Each visual condition (lipreading, expanding-circle, and covert self-production) formed one 10-min run, in which we used a block design with alternating 30-s noise bursts (LF, MF) and silence blocks (Figure 1). Each run consisted of 20 counterbalanced auditory blocks. The order of the runs was randomized across subjects. During lipreading and the expanding-circle conditions the target of the one-back task occurred 10 times per each run. Each visual condition was repeated twice in random order, thus the total functional scanning time was 60 minutes.
Before the experiment, the tasks were explained and the subjects were allowed to practice one test run per condition to familiarize them with the paradigm. The stimuli in the practice tasks were different from the tasks used in the main experiment.
Prior the measurements the individual hearing threshold was assessed using LF noise burst as a test sound and the intensity of the test sound was adjusted 40 dB above the hearing threshold. As in the fMRI environment the use of pneumatic headphones might induce sound attenuation, especially in high frequencies, the LF and MF noise bursts were individually adjusted to match the same hearing level. In the attenuation test the subjects heard consecutively LF and MF noise bursts (= attenuation pair) and they had to choose the loudest of the attenuation pair using a response paddle. The response automatically changed the attenuation of the MF noise burst for the next attenuation pair and the assessments of attenuation pairs were continued until the equilibrium of the attenuation was achieved. During both adjustments the subjects were in the MRI bore.
We scanned the subjects with 3 Tesla GE Signa MRI (Milwaukee, WI, USA) with eight-channel quadrature head coil. We used sparse sampling technique  in which the functional imaging of 0.8 s was followed by the 9.2 s of silence. The coolant pump of the magnet was switched off during prescanning adjustments of the sound and during functional imaging to further reduce acoustic noise. The functional gradient-echo echo-planar (GE-EP) T2*-weighted volumes had the following parameters: repetition time 800 ms, echo time 30 ms, matrix 64 x 64, flip angle 90 degrees, field of view 22 cm, slice thickness 3 mm, no gap, 12 near-axial orientated slices, delay between the acquisition of the GE-EP volumes was 9.2 s. The effective voxel size was 3.43 x 3.43 x 3.00 mm. The functional volume consisted of 12 slices set parallel to the superior temporal sulcus – with the most superior slices covering the HG. Each experimental run produced 60 GE-EP volumes. The subjects were instructed to avoid body movements throughout the experiment with the head fixed with foam cushions on both sides.
After functional imaging the whole-head GE-EP images were obtained in the same slice orientation as the functional volume, followed by the 3D T1 weighted axial slices for co-registration and T2-Fluid Attenuation Inversion Recovery (FLAIR) images. All the auditory stimuli were presented binaurally to the participants with MRI-fitted narrow tubes that had ear plugs with pores in the center attached to them. The visual stimuli were projected via a mirror-system stationed onto the head coil inside the magnet bore. Subjects were instructed to focus to the mouth region of the face. We delivered the stimuli with the Presentation (Neurobehavioral System, Albany, CA, USA) program.
The data analysis was conducted with BrainVoyager QX software version 2.1 (Brain Innovation, Maastricht, The Netherlands ). Two participants were discarded prior to analysis and one participant's one run was discarded because of technical reasons. The three volumes of each run were discarded to allow for T1-saturation, thus leaving 342 GE-EP -volumes of each subject into final analysis. Preprocessing for single-participant data included 3D motion correction, Gaussian spatial smoothing (full-width-half-maximum 5 mm), and linear trend removal. Slice timing correction was not included as we used the sparse sampling technique. Each individual’s functional images were co-registered with their high-resolution anatomical images and transformed into the standard Talairach coordinate system.
The group data were analyzed with multi-subject random-effects general linear model (GLM) in standard space. Statistical analysis was carried out using GLM with three auditory stimuli (LF noise burst, MF noise burst, and silence baseline) as independent predictors. The model was not convolved to a hemodynamic response function because of the sparseness of the data.
First we created region of interests within which we carried out the further analyses. Using a GLM, LF noise burst and MF noise burst stimuli were contrasted as separate predictors against silence in each condition at the significance level P < 0.05. The ensuing clusters encompassing the auditory cortical areas in each hemisphere in each condition were merged to form one ROI per each hemisphere (Figure 2). The further statistical group-level contrasts were carried out inside these ROIs using significance level P < 0.05.
The high-resolution images were brought into anterior commisura (AC) – posterior commisura (PC) space. Further segmentation according to their gray and white matter and hemispheric segregation was done with the BrainVoyager semiautomatic segmentation program using sigma filtering and automatic bridge removal algorithms. As signal intensity homogeneity varied extensively in the temporal and frontal lobes all anatomical images needed additional manual correction. Reconstruction of the cortex allowed us to display activations on the gyri and sulci in both flattened and inflated brain views: the group analysis was presented on the inflated or flattened hemispheres of a single subject’s brain.
Analysis of Behavioral Task
We measured both reaction time (RT) for detecting the target stimuli and the hit rate. The RT was measured from the beginning of the visual target clip to the button press, thus increasing the RT compared to simple target identification as the video clip type could be identified at earliest 300-400 ms from onset (See Figure 1A). The hit rate (HR) was the proportion of targets detected. We compared hit rate and RT between the different conditions using Student’s t-test.
Results and Observations
There were no significant differences between HRs to targets between the expanding-circle (mean ± SD = 82.7 ± 3.4%) and the lipreading (76.7 ± 4.5%) conditions (p > 0.14). The RT was 1512 ± 26 ms for detection of target vowel articulations and 1418 ± 23 ms for detection of expanding oval targets. RT for the expanding circle targets was significantly faster than that for vowel articulation detection (p < 0.05); as the oval direction was apparent from the first deformation compared to mouth movements, where the distinct differences appeared only later.
Conjunction analysis of all group-level activations disclosed a wide range of temporal-lobe activations that extended to subcortical regions (Figure 2). The temporal-lobe cortical areas revealed by this analysis served as the region-of-interest (ROI) for further planned contrasts between the different experimental conditions.
Figure 3 depicts significant group activations of LF noise bursts and MF noise bursts in the flattened right and left superior temporal ROIs of one subject. The noise bursts activated the middle and posterior STS, including Heschl’s sulcus (HS) and HG, with some extensions into STG bilaterally. In all conditions the activations caused by LF and MF noise bursts were more extended in the left than in the right hemisphere.
Activity elicited by the MF noise bursts encompassed a slightly more limited region in the superior temporal plane than the hemodynamic responses to LF noise bursts; the activations caused by the LF noise bursts surrounded the activations caused by the MF noise bursts. The activations were more anterior in the right than in the left hemisphere, and there was a slight left-right asymmetry in the posterior STS activations in all conditions.
When contrasting the non-linguistic (i.e., expanding circle) vs. linguistic (i.e., lipreading and covert self-production) conditions within the temporal lobe ROIs, significant suppression of hemodynamic activity to MF noise bursts was observed in the linguistic condition in the left hemisphere first transverse sulcus (FTS) and right hemisphere STG lateral to HS (Figure 4). There were no significant differences in the contrast between the lipreading and covert self-production conditions.
In the present study, we investigated using fMRI in which auditory cortical regions silent lipreading and covert self-production suppresses processing of simple auditory stimuli. By examining with two different frequency bands we found that contrasting non-linguistic (i.e., expanding circle) vs. linguistic (i.e., lipreading and covert self-production) conditions during MF noise burst stimulation results in bilateral, but asymmetric activations on the superior temporal cortex involving FTS in the left and STG lateral to HS in the right hemisphere (Figure 4). This suggests that speech motor system modulates processing of non-linguistic sounds at speech-relevant frequencies by suppressing hemodynamic reactivity of specific auditory cortical areas. Previously, several tonotopically organized areas have been documented in auditory cortex suggesting that the auditory cortex is functionally highly heterogeneous. We tentatively propose that the suppressed areas in the left FTS and in the right STG could be related to tonotopic areas described by Talavage et al. .
The activations of the LF and MF noise bursts overlapped the auditory cortex in the left and right hemispheres within the auditory cortical regions of interest across the conditions (Figure 3). Since we specifically selected LF and MF noise bursts in the present study, the results are not easily comparable to the previous tonotopic mapping results, per se [38-42]. However, it can be speculated that the areas activated by MF noise bursts would be more relevant for speech processing than those activated by LF noise bursts due to the MF noise bursts occupying the frequency band that is critical for speech perception. While there seem to be some frequency-dependent differences across the conditions as shown in Figure 3, it has to be noted that the maps were threshold against the silent baseline rather than contrasted between the conditions. Nevertheless, the activations in the right hemisphere disclosed an area more anteriorly than in the left hemisphere. Such activity distributions could reflect hemispheric differences in processing auditory stimuli: right hemisphere has been suggested to process acoustic sound features such as pitch  and the left hemisphere has been suggested to specialized in processing speech-related temporal dynamics [18, 23, 48]. Noting that even non-semantic non-speech audiovisual information has access to superior temporal cortex [49-51], the vast hemodynamic activations in our study could also be associated to audiovisual integration. All the conditions showed left-hemispheric activations posterior to HG and medial to planum temporale or adjacent to somatosensory areas of tongue and pharynx (Figure 3). This could tentatively suggest that speech-related stimuli have access to speech motor areas and thus, support the left-hemispheric lateralization to linguistic information or even mirroring of action into perception . However, we cannot exclude subvocalization during the covert speech production, although we encouraged the subjects to avoid mouth movements and to articulate the vowels only in their minds in the covert articulation condition. Because the ROIs were limited to auditory cortex no visual cortex or other heteromodal cortices were imaged in the present study.
Contrasting the non-linguistic (i.e., expanding circle) conditions to linguistic (i.e., lipreading and covert self-production) revealed asymmetric supratemporal activations (Figure 4). Within the temporal lobe ROIs significant suppression of hemodynamic activity to speech-related MF noise bursts was shown in the linguistic conditions in the left hemisphere FTS and right hemisphere STG lateral to HS. In previous studies, the auditory cortex has been documented to contain several tonotopic areas with the functions, interactions, and connectivity of these areas presumably highly heterogeneous. In the tonotopic mapping experiment Talavage et al. revealed several tonotopically organized areas using amplitude-modulated noise . Our FTS activation in the left hemisphere could correlate with Talavage’s “higher-frequency sensitivity endpoint 2’”. Interpretation of the right hemisphere activation is more speculative. Although our right hemispheric STG activation could be seen as corresponding to Talavage’s “higher-frequency endpoint 5’”, one has to be cautious as Talavage’s tonotopic mapping involved only the left hemisphere. Striem-Amit et al. investigated bilateral tonotopic mapping across left and right supratemporal lobes with rising tone chirps ranging from 250 to 4000 Hz . Our asymmetric suppressant activations to speech-related MF noise bursts could be pinpointed to the areas where Striem-Amit et al. showed activations at medium frequencies. We failed to see any significant differences in the contrast between the lipreading and covert self-production conditions that tentatively suggests that there might have been similar mechanisms at work.
Our present study restrained the analysis on the supratemporal lobes, as our approach was merely on the primary auditory cortex and the surrounding areas. It is then plausible that our region specific analysis could have excluded other speech related regions that are known to support lipreading such as Broca´s area . Likewise, our slice- and ROI selection excluded visual cortex or other heteromodal cortices . The limited temporal resolution of fMRI is a further limitation of our experiment. Therefore, the speech-related hemodynamic activations cannot be related to suppressed N1 observed in our previous MEG experiment  in a straightforward manner. Nonetheless, with similar experimental setup MEG demonstrated N1 response suppression in the superior temporal lobes during lipreading and covert speech-production and fMRI exhibited suppressant activations in the left FTS and in the right STG lateral to HS. Together, the present results combined with previous MEG study  show that the speech-related suppression was bilateral, but with distinct areas suppressed asymmetrically. In both studies, suppression effect was highly similar between covert speech self-production and lipreading tasks, thus suggesting similar underlying mechanism that mediates suppression of auditory processing during lipreading and covert self-production. Importantly, the present results suggest that the top-down input from speech motor areas induced suppression lasting several seconds (besides sub-second and transient suppression revealed by MEG) in at least two distinct areas of the auditory cortex, more prominently in the left hemisphere.
In conclusion, in the present study lipreading and covert speech self-production suppressed processing of non-linguistic sounds in the auditory cortex. We suggest that speech-related suppressant effect arises in tonotopic subareas of auditory cortex, in the left FTS and in the right STG lateral to HS.
Conflicts of interest
The authors declare no conflicts of interest.
This study was financially supported by the Academy of Finland, the EVO Fund of the HUS Medical Imaging Center, Helsinki University Central Hospital, and by the US National Institutes of Health grants R21DC010060, R01MH083744, R01HD040712, and R01NS037462. We thank all the volunteers for participating.
2014 Ross Science Publishers