Introduction

As one of the most influential mediums of auditory and visual art, cinema holds power to evoke emotions. In fact, the use of natural sceneries with appropriate melodies can already evoke an emotional response, even without relying on dialogue. The mechanisms behind this response begs the question on how to improve emotional response in film-making. Thus, studies have been conducted to evaluate which stimuli primarily evokes emotions during film viewing, whether auditory- or visual-driven1,2.

Recent studies utilizing functional magnetic resonance imaging (fMRI), particularly voxel-wise encoding and decoding models, have revealed the relationship between brain activities and comprehensive cognitive states3,4. For one, Kragel et al. demonstrated that certain emotional states can be decoded from brain patterns3. Numerous studies have similarly showed that visual and auditory experiences can be observed from the brain activities of the audience using fMRI decoding techniques5,6,7,8. Nishimoto et al. further revealed that cinema-evoked visual images can be reconstructed from specific brain patterns6. Given these advancements in fMRI decoding techniques, it is possible to determine which sensory stimulus drives emotion, audition, and vision during film viewing.

In this study, we explored whether fMRI decoding techniques can identify which sensory stimulus evokes emotional responses during film viewing. We hypothesized that the neural processing of underlying emotional responses would exhibit distinct patterns in specific brain regions depending on whether the emotion was auditory- or visual-driven. To test the hypothesis, we conducted the two main experiments—one inside the fMRI scanner and outside (Fig. 1). First, participants watched movies featuring natural sceneries and piano music while undergoing fMRI. To ensure alertness, that is keeping their eyes and ears open, participants also performed a simple audio-visual task (See Methods). Once outside the fMRI scanner, participants revisited the same movies and reported their evoked emotional feelings using Russell’s Circumplex model9. In addition, they indicated if the emotions were auditory- or visual-driven. Based on these results, we categorized brain activity as auditory- or visual-driven emotion (See Methods). Certain portions of the data were selected for fMRI model training, and remaining data were used to test if the fMRI decoder model can distinguish stimuli that evoke auditory- and visual-driven emotions (Figs. 1 and 2). Furthermore, we investigated what brain regions successfully decoded the emotional responses, allowing use to observe the neural processing of audiovisual emotions10; See Methods.

Fig. 1
figure 1

Experimental scheme. Participants watched selected movies inside and outside the fMRI scanner. Based on the subjective film-viewing evaluations outside the fMRI scanner, data were initially categorized as either auditory-driven (AD) or visual-driven (VD), and subsequently divided into the training and testing datasets for the fMRI decoder model10,11,12,13. The accuracy and predictive ability of the model to distinguish AD and VD emotional states was then evaluated.

Fig. 2
figure 2

Experimental results and analyses. (a) The mean ratio of visual-driven emotions for all movies. Error bars show standard errors. Data were sorted in ascending order and categorized at the center as either auditory- or visual-driven. (b) Schematic illustration of cross-validation for creating the fMRI decoder in this study (four-fold crass-validation).

Results

Prior to emotional decoding, we analyzed the behavioral data to verify the alertness of participants during the fMRI experiment (See Methods). Utilizing data from the simple audiovisual task, we calculated mean accuracy as detecting target signals (hearing a beep or seeing a change in color) within a 2-s window. Data revealed a mean accuracy of 88.2 ± 2.9% across 288 trials (4 trials/movie × 24 movies × 3 times repetition), indicating that participants were sufficiently engaged and attentive throughout the session.

We then analyzed the subjective ratio of emotional driven stimulus that was taken after the fMRI experiment. Interestingly, one-way ANOVA showed that the distribution of these subjective ratings in the behavioral experiment outside the fMRI scanner differed significantly from the pre-experiment movie assignment (F [23,264], 1.57; p, 0.04; partial η2, 0.11) (Fig. 2a, top bar graph). Therefore, we divided the movies into two categories, AD and VD (Fig. 2a, bottom bar graph), which served as labels for training the fMRI decoding models (Fig. 1).

fMRI decoding analyses showed significantly higher mean decoding accuracies in the auditory area (trained and chance-level: t(11), 4.73; p, 0.00031; r, 0.82) and insula trained and chance-level: t(11), 3.55; p, 0.0023; r, 0.73), as compared to chance-level. Conversely, the mean accuracies in the visual areas and amygdala were not significant (Fig. 3).

Fig. 3
figure 3

Box plots for decoding accuracies in the visual areas, auditory area, insular cortices, and amygdala. The boxes represent 25 ~ 75% of the data distribution, and vertical bars indicate the lowest and highest values. Red plots represent the mean decoding accuracies of the trained decoding model (Mean data points: 50.6, 61.9*, 57.7*, and 49.1| Visual areas, Auditory area, Insula, and Amygdala respectively). Green plots show the mean accuracies with the same set of training and testing data, but training data were randomly labeled randomly as auditory- or visual-driven. Asterisks indicate that the decoding accuracies of the trained model were significantly higher than that of the random model and chance-level. The blue dashed line represents the chance-level accuracy (50%).

To rule out the possibility of chance, we compared the mean accuracies of the trained model with that of random model, which was made using random labels. As compared to the random model, significantly higher mean accuracies in the auditory area (trained and random: t(11), 2.81; p, 0.0085; r, 0.65) and insula (trained and random: t(11), 2.36; p, 0.020; r, 0.58) were observed in the trained model. For further confirmation in the off-chance that using only the mean subjective evaluations influenced the results (Fig. 2a, bottom bar graph), we also calculated the mean accuracies for both brain regions using individual subjective evaluations, revealing similar tendencies as with mean subjective evaluations (See Supplementary Analysis 1 at Supplementary Information). Given these findings, brain activities in the auditory area and insula appeared to represent which stimuli (auditory or visual) evoked emotional responses during film viewing.

Discussion

In this study, we utilized fMRI decoding techniques to examine brain regions that predicted auditory- or vision-driven emotions during film viewing. Results showed that brain activities in both the auditory area and insula represented relatively which stimulus (auditory or visual) relatively elicited an emotional response, suggesting their involvement in neural processing during film viewing. Our findings were consistent with previous studies on human and non-human primates that showed wide reciprocal connections between the two regions14,15,16,17,18. For instance, Zhang et al. showed and argued significant roles of the insula and auditory areas in emotional processing, and also suggested that insula play a crucial role in conscious perception of emotions14,15. These suggest that observing brain activity in the auditory area and insula allows us to objectively determine if the emotional response evoked was auditory- or visual-driven. In the cinematic arts, directors would be able to utilize these findings to decide on what music and sceneries to present in their films. In contrast, the present results revealed that brain activities in the visual areas and amygdala did not represent which stimulus mainly elicits the emotion, audition or vision. This is likely due to the specialized functions of these regions, such as the visual areas (V1-V4) which basically process visual information (e.g. luminance-contrast, shape) without making connections to emotional value. Moreover, auditory stimuli have been associated with greater influence over emotional response, as compared to visual stimuli2,19. Therefore, it seems logical that the visual areas were not relatively involved in the processing of stimulus-driven emotion. It may be also convincing that the brain activities in the amygdala did not represent the stimulus-driven emotion, mean decoding accuracies. The lack of decoding accuracy in the amygdala may also be logical, since neural responses in this region correlate with fundamental emotional responses, including fear and anxiety20,21,22,23,24, rather than the stimulus of the emotion.

One might think the reason why brain activity in the auditory area represented the stimulus in emotional response is that the auditory area was just stimulated by basic sound features, such as sound volume, tempo, and tonality25,26. To confirm this possibility, we analyzed and compared these sound features (sound volume, tempo, and tonality) using the MIRtoolbox (https://www.mathworks.com/matlabcentral/fileexchange/24583-mirtoolbox), consequently identifying the ratio of stimulus-driven emotions (Fig. 2a; “AD” and “VD” labels for decoding analyses). Results of this analysis revealed no significant correlations, indicating that basic sound features did not influence the emotional response of the participants (See Supplementary Analysis 2 at Supplementary Information).

The present study offers valuable insights for personalizing and enhancing the audio-visual experiences of people. For instance, monitoring the brain activity using electroencephalography would allow the creation of advanced and comfortable audio-visual environments. Studies have shown the growing popularity of such “neuro-designed” services and products in near future27,28. Someday, our brain might select our favorites instead of “us”. This idea would also contribute to the controversial aspect of consciousness29,30.

Despite the significant findings of this study, the decoding accuracy values highest values only reached 70–75%. Further research is required to establish more effective fMRI decoding methods to address this. Increasing the number of training sessions or using functional ROI may be helpful in constructing a more accurate decoder and performing more detailed analyses. Furthermore, we should reconsider scientific approach to complex human emotion as well as the technical methods of fMRI brain decoding. In addition, we here explained the limitations of generalizability in the current study. Although we had statistical and neuroscientific important results, we tested only some specific people: They were twelve participants from twenty to thirty-nine years old. We should test more numbers of and more variety of people if we want much more generalized findings.

In conclusion, the current results highlight fMRI decoding techniques as an excellent tool for determining which stimulus evoked an emotional response during film viewing, providing new insights into the mechanism of stimulus-driven emotion.

Methods

Participants

A total of twelve individuals (six males, six females| It was assumed to be no effect of gender difference in this kind of emotion and perception experiment, but in case it doesn’t, we collected equal number of men and women.) with normal/corrected vision and normal hearing participated in the study (mean age, 26.58; SD, 3.23; range, 20–39| It was not predicted to be effect of aging in this perception experiment, but we avoided the possibility of perceptually developing or declining by aging.). All participants gave written informed consent, and our study was approved by the Ethics Committee of the National Institute of Information and Communications Technology, Japan in compliance with the Declaration of Helsinki.

Stimuli

The experiment employed 24 movies presented on a screen inside the fMRI scanner. Each movie comprised a grayscale natural scene paired with piano music. The selection of the scenes and music were based on the results of the preliminary experiment (See some samples in Supplementary Information). We avoided using too familiar or/and too bizarre stimuli, because they might have a possibility of bias affecting the participants’ performances. Also, we used these stimuli in our previous studies and pilot testing, because it seemed to be no biases. Multimedia elements were sourced from the following royalty-free libraries: PIXTA (https://pixta.jp/), Panasonic (https://www2.panasonic.biz/jp/lighting/shop/effect-projector/spaceplayer/contents/free/), and SOUND-JUKE (https://www.sound-juke.com/). Each movie was presented for 40 s and repeated three times in a random order. Between movies, a white fixation point on a gray background was presented (Fig. 4a). While inside the fMRI scanner, participants viewed the movies through a mirror mounted on the head coil, ensuring their gaze remained centered on the fixation point. For audio reception, they wore MRI compatible headphones throughout the experiment.

Fig. 4
figure 4

Experiments inside and outside of the fMRI scanner. (a) Inside the fMRI scanner, each movie was presented for 40 s with a 4-s interval in between. While watching, participants were required to press a button when seeing a red circle in the center or hearing a beep sound. These audio-visual targets randomly appeared every 10 s. (b) Outside the fMRI scanner, participants rewatched the movies and were asked regarding their evoked emotional responses using Russell’s Circumplex model. Participants also reported which sensory stimulus (auditory or visual) evoked that emotional response.

While outside of fMRI, they just viewed and listened to the audio-visual stimuli that were identical to the stimuli inside the fMRI (Fig. 4b). They reported their evoked emotional feelings using Russell’s Circumplex model9 and indicated if the emotions were auditory- or visual-driven in in 100% scale (Fig. 2a).

Behavioral task inside the fMRI scanner

To ensure alertness (keeping their eyes and ears open) while minimizing direct influence in their viewing, participants were tasked to perform a simple audio-visual task. Specifically, participants were instructed to press a button if they observed any of the target signals—color change in the central fixation point (white to red) or a brief beeping sound. For each 40-s film viewing, four randomly timed target signals were shown (Fig. 4a).

Behavioral task outside the fMRI scanner

Following testing inside the fMRI scanner, participants rewatched the same movies. At this time, they were asked to answer their emotional response using Russell’s Circumplex model9 and report which sensory stimulus (auditory or visual) evoked that emotion in a scale of 1–100% (e.g., 70% audition and 30% vision) (Fig. 2a).

fMRI data acquisition

MRI data were acquired using a 3T MRI scanner (Magnetom Prisma; Siemens, Erlangen, Germany) equipped with a 12-channel head coil at the Center for Information and Neural Networks (Osaka, Japan). To achieve high spatial resolution whole-brain fMRI with a standard time of repetition (TR), multi-band echo planar imaging by CMRR (https://www.cmrr.umn.edu/multiband/) was utilized. Interleaved T2*-weighted gradient-echo planar imaging (EPI) was specifically performed to acquire functional images covering the entire brain (TR, 2000 ms; TE, 30 ms; flip angle, 75°; interleaved; multi-band acceleration factor = 2, partial Fourier = 6/8; voxel size, 3 × 3 × 3 mm; number of slices, 50). Furthermore, T1-weighted anatomical images by MP-RAGE were collected (TR, 1900 ms; TE, 2.48 ms; flip angle, 9°; voxel size, 1 × 1 × 1 mm). Afterwards, field map imaging was performed with a double-echo spoiled gradient echo sequence (gre_field_map; TR, 753.0 ms; TE, 4.92/7.38 ms; voxel size, 3 × 3 × 3 mm with 0.8-mm gaps; flip angle, 90°), generating a magnitude and two-phase images, which were essential for field map computation.

fMRI data analyses

Image pre-processing and statistical analyses were conducted using SPM12 (Wellcome Trust Centre for Neuroimaging, UCL, SPM), and field map correction was applied, subsequently discarding the first 8 s of each scan to account for instability at the start of data acquisition. Acquired fMRI data underwent slice-timing and motion correction, and results were co-registered to the within-session, high-resolution, anatomical images (matched slides used for EPI) and the whole-head, high-resolution, anatomical image. Images were then spatially normalized to the Montreal Neurological Institute template, resampling the voxels to a uniform size of 3 × 3 × 3 mm, and underwent linear trend removal within each run. Following normalization and smoothing, amplitude normalization relative to the mean amplitude in each run was determined to minimize the baseline difference across runs. To compensate for hemodynamic delays, fMRI signals of each voxel were averaged within each stimulus block after shifting the data by 4 s. Specifically, data samples were created by averaging the fMRI volumes within each 40-s stimulus block (average, 20 volumes) and by sorting voxels into each region of interest (ROI). Anatomical labels for ROI analysis were defined using the SPM anatomy toolbox (SPM Anatomy Toolbox, version 3.0)).

fMRI decoding analyses

We used "svc", as a support vector machine (SVM) implementation of the decoder, with the default value of a soft margin parameter (C = 1) in the parameter settings, in which the scikit-learn Python toolbox was employed for the decoding (https://scikit-learn.org/stable/). We attached the sample code in Supplementary Information. In this study, fMRI decoders were trained to classify the sensory stimuli that evoke emotional response by categorizing data samples as “auditory-driven (AD)” or “visual-driven (VD)” based on participants’ subjective evaluations outside the fMRI scanner. To compensate for variability in the specific training and test dataset choices, the decoding performance was computed following the four-fold cross-validation method, which used 25% of the data left out in the study (Fig. 2b) (Because the previous study suggested that 20% ~ of the left data enables better estimates with less computation13). This was repeated 10 times with the shuffled data splits to ensure generalizability (Fig. 2b). In case, some readers are interested in the other possible cross-validation in this study (we basically could test only 12-fold (8.3% of the data left), sixfold (16.7%), fourfold (25%| Fig. 2), and twofold (50%), because we have to choose equal numbers of data as the Training data from each AD and VD), we presented them at Supplementary Analysis 3 in Supplementary Information.

To evaluate the success of the decoder in predicting the primary stimulus of the evoked emotion, separate decoding accuracies were calculated for each participant and specific ROIs as defined by the SPM anatomy toolbox (version 3.0)10, including the visual areas (from visual area 1 to 4), auditory area (from TE1 to 5), insula, and amygdala. These brain regions have been indicated to play significant roles in the perceptual and emotional processing of audio-visual stimuli during film viewing5,6,7,8,14,15,16,17,18.