Introduction

In communication scenarios, listeners typically engage in head and eye movements for several purposes, for example to follow social norms1,2,3 or to gain additional sensory input from different modalities, particularly visual and auditory information4,5,6. The movements of the head and eyes play an important role in spatial auditory perception and speech intelligibility. Assuming that listeners can actively maximize information gain through movements in a given scene, their motion behavior might, in turn, provide insights into the complexity of a scene and the significance of different information for scene perception. This knowledge would be valuable for a deeper understanding of listeners' behavior based on the available multimodal information and scene complexity. Additionally, it could be relevant for the development of hearing aid technology; a wearable hearing device could utilize measures of motion behavior to adjust algorithms based on the perceived difficulty by a listener.

The auditory system can exploit binaural disparities between a target and an interfering source to enhance speech intelligibility6,7,8. Grange and Culling5 demonstrated that optimizing these binaural cues through head movement can improve speech intelligibility in specific target-interferer conditions. However, listeners did not fully exploit this benefit, either refraining from movement unless explicitly instructed or employing a movement strategy to maximize the target speech signal level rather than the signal-to-noise ratio (SNR). Apart from task instruction, various factors influence head movement behavior in listening experiments. Brimijoin et al.9 investigated head-orienting behavior with auditory and visual stimuli in listeners with normal and impaired hearing, revealing differences in head-motion behavior between the two groups. The head movement behavior of hearing-impaired listeners was more complex, characterized by a delayed movement onset and offset and a less direct final head orientation. In a subsequent study10, they demonstrated that the microphone modes of hearing instruments induced head movements with different patterns and complexities dependent on the target location. Whitmer et al.11 found that hearing instruments with a lower spectral bandwidth did not compromise listeners’ ability to accurately locate a sound source but did affect their motion behavior, such as delayed movement onset. Moving closer to the source generally increases its level and might serve as a social signal for listening difficulties, as suggested by Hadley et al.12 and Weisser et al.13, who revealed that listeners move closer to each other in situations where communication becomes more challenging. This evidence highlights that an individual’s hearing ability and differences in sensory input, including the availability of visual information, can influence head motion behavior.

While turning the head generates physically different signals at the ears, eye movement can enhance auditory perception by providing visual information that integrates into a multimodal percept. Visual information and gaze direction have been shown to improve speech comprehension14,15 and support auditory source localization16,17,18,19. Visual scene perception requires frequent eye movement as the sharp foveal vision only covers a small angle20. Consequently, movement patterns consist of fixations interrupted by fast saccades to the next fixation point. Monitoring eye gaze allows exploration of visual search behavior, revealing that the focus of visual attention often aligns with the focus of auditory attention21. This alignment is particularly evident in challenging listening scenarios, as indicated by prolonged fixation durations on a target talker with increasing noise levels22. Furthermore, various gaze parameters, including fixation durations, rate and saccade velocity, are associated with task difficulty and cognitive load23,24. Since eye-gaze behavior reflects visual attention and cognitive load, it may also indicate the complexity of an acoustic scene, contributing to difficulties in speech comprehension.

The benefit of eye movements may extend beyond providing visual scene information. The integration of information from both modalities could occur early in auditory perception, altering the internal representation of an acoustic signal based on eye-gaze direction25,26. Indeed, Maddox et al.27 demonstrated enhanced detection of interaural time and level differences when eye-gaze was directed towards the stimulus location. Along with evidence that gaze position can alter activity in auditory neurons in macaques26, these findings suggest that eye gaze can modulate auditory stimulus processing and improve speech perception28.

When exploring listener behavior in virtual audio-visual environments, virtual reality (VR) glasses have proven to be a valuable tool, offering a high level of realism with greater experimental control and reproducibility than field experiments. Through VR glasses with integrated eye-trackers, visual environments can be presented to participants, providing visual cues that have been shown to improve sound localization accuracy and speech intelligibility, and impact movement behavior15,27,29. Modern VR glasses, equipped with integrated eye trackers, and development platforms with interfaces for movement tracking, allow reliable measurement of the eye and head movement simultaneously during the presentation of visual scenes30,31,32. Although the VR glasses can affect the wearer’s head-related transfer function (HRTF)33 and their head and eye movements34, when designed carefully, the behavior exhibited in virtual environments appears to mirror natural behavior16,29.

Most studies investigating head motion and eye-gaze behavior have utilized stimuli of short durations5,9,29, often employing structures like the Coordinate Response Measure (CRM) task29. However, real-life acoustic environments are typically more complex, with speech localization and comprehension occurring simultaneously, as in ordinary conversations. Previous research suggests that the complexity of the acoustic scene, including factors like the number of competing talkers, their similarity to the target, the type and level of masking noise, and room acoustic parameters such as reverberation, impacts listener behavior35,36,37. While most studies on listener behavior concerning scene complexity have not focused on movement, findings by Hadley et al.35 suggested that a listener’s movement behavior and conversational strategies depend on the complexity of the acoustic scenario. To enhance our understanding of how different factors of complexity influence listeners’ movement, tasks are required that replicate challenges similar to those encountered in conversations.

In this study, we examined the head-motion and eye-gaze of young normal-hearing listeners during a speech localization and comprehension task conducted in virtual audio-visual scenes. The visual scenes consisted of a room with fifteen semi-transparent humanoid shapes arranged equidistantly between − 105° to 105° azimuthal angle around the listener (Fig. 1A and B), presented using VR glasses. Participants were instructed to locate one of ten target speech streams (stories), identified by a visually presented icon (Fig. 1B and D), among other stories. The task difficulty varied across three different levels of reverberation—anechoic, mid-reverberant and high-reverberant, corresponding to different visual rooms (Fig. 1C)—and the number of concurrent talkers, ranging from two to eight. We hypothesized that the difficulty of the acoustic scene would influence listeners’ scene analyses, increasing cognitive load and resulting in changes in head and eye movement behavior. Features of head-motion and eye-gaze behavior were identified and analyzed alongside response time and accuracy, as reported in a previous study38. These comparisons aimed to provide insights into potential strategies employed by listeners to comprehend speech in complex listening situations.

Figure 1
figure 1

Experimental paradigm in an audio-visual virtual reality. (A) Top view of the virtual rooms. Between 2 and 8 simultaneous talkers were randomly presented from the 15 possible source locations, ranging between ± 105° with a spacing of 15°. (B) The listeners were asked to locate the talker discussing a topic indicated by an icon on the room’s wall. Potential source locations were depicted as humanoid semi-transparent avatars. (C) Three different rooms with varying acoustical features but equal sizes were simulated both acoustically and visually. (D) Icons representing ten different stories, each spoken by ten talkers, were used38.

Results

Movement trajectories were segmented into four distinct phases, as illustrated in Fig. 2: a pre-stimulus phase occurring before the onset of the speech stimuli lasting three seconds in all trials; an initialization phase concluding with the first eye movement; a search phase concluding with the final head movement; and a decision phase. For a comprehensive definition of the movement phases and the extracted features, see Materials and Methods.

Figure 2
figure 2

Example trajectory of one subject for one trial. The trajectory shows both the horizontal head (red) and the gaze (blue) position with the gaze and head saccades marked for both movement trajectories. The trajectory is divided into four phases: The 3 s preparation phase before the onset of the speech stimuli, the initialization phase ending with the first gaze saccade, the search phase ending with the last head saccade and the final decision phase before pressing the response button. The position of the target talker at − 90° is marked in green.

Durations of the movement phases

Figure 3 shows the durations of the three movement phases (initialization phase, search phase and decision phase as defined in in Fig. 2) in relation to the number of talkers in the scene across the three reverberation conditions. Following the 3-s pre-stimulus period, the initialization phase (Fig. 2A) ended with the initiation of the first saccade. No significant effects of talker (F(6,614.28) = 1.5, p = 0.18) or reverberation (F(2,614.67) = 2.8, p = 0.06) on the duration of the initialization phase were observed (Fig. 3A). The subsequent search phase (Fig. 3B) exhibited a significant increase with the number of talkers (F(6,614.40) = 44.3, p < 0.001) and the degree of reverberation (F(2,614.82) = 48.0, p < 0.001). The interaction between the number of talkers and the reverberation condition was significant (F(12,614.54) = 4.1, p < 0.001), with the high-reverberation condition extending the search phase duration when more than four talkers were present (p < 0.05), while no significant difference was noted between the anechoic and low reverberation conditions (p > 0.05). The decision phase (Fig. 3C) displayed an increase with a greater number of simultaneous talkers in the scene (F(6,614.18) = 4.8, p < 0.001), but reverberation had no impact on the time between the offset of the last head movement and the final decision (F(2,614.46) = 1.2, p = 0.31). A significant interaction effect was observed between the number of talkers and the reverberation conditions (F(12,614.28) = 2.3, p < 0.01). To further understand the movement behavior during the search and decision phases, various features were extracted from individual trajectories to characterize movement behavior and its relation to scene difficulty, i.e. the of number of concurrent talkers and the level of reverberation. Since the initialization phase was defined as the period before any movement onset, no movement features were derived for this phase.

Figure 3
figure 3

Duration of the different movement phases. Movement trajectories were categorized into three phases: An initialization phase concluding with the first eye movement (grey), a search phase ending with the last head movement (purple), and a decision phase concluding with the pressing of the response button (green). The durations in seconds, derived from data collected from 13 subjects with three reeditions per condition, are presented in relation to the number of talkers in a scene and the room reverberation conditions (indicated by the saturation). The boxplots illustrate the median (horizontal line), the 25th and 75th percentile range (boxes), and the overall range (whiskers). Dots on the plot indicate outliers.

Search phase

Table 1 provides a summary of the outcomes from the ANOVAs conducted on the movement measures extracted during the search phase. Reverberation demonstrated a medium effect on the average head fixation duration, the number of gaze and head jumps, the number of gaze jumps per head fixation, and the number of fixated talkers. The number of simultaneous talkers in a scene had a strong effect on the average head fixation duration, the standard deviation of the head and the gaze position, the number of gaze and head jumps, the number of gaze jumps per head fixation, and the number of fixated talkers. Furthermore, the location of the target talker had a medium effect on the standard deviation of the head position and the maximum speed of head movements. With the exception of a medium interaction between reverberation and the number of talkers regarding the number of gaze jumps, only small significant interaction effects were observed.

Table 1 Main effects of the ANOVAs for the movement measures during the search phase.

Head movement measures during the search phase are presented in Fig. 4, while eye movement measures are illustrated in Fig. 5. The number of talkers influenced the likelihood of initially directing the head in the wrong direction (Fig. 4A).

Figure 4
figure 4

Head movement features during search phases. The extracted head movement features, obtained from trajectories during the search phase, are presented based on data from 13 subjects and three repetitions per condition, considering the number of talkers in a scene and the room reverberation conditions (indicated by saturation). (A) The misorientation rate of the initial head orientation increases with more talkers. (B) The average fixation duration increases with more talkers and reverberation. (C) The maximum speed increases with scene complexity and is higher for targets presented more than 30° azimuth away from the center. (D) The standard deviation of the position, serving as a measure of the amount of the overall head movement, increases with scene complexity and is higher for targets presented more than 30° azimuth away from the center. Boxplots represent the median (horizontal line), the 25th and 75th percentile range (boxes), and the overall range (whiskers). Outliers are indicated by dots.

Figure 5
figure 5

Eye movement features during search phases. Eye movement features, extracted from trajectories during the search phase of data from 13 subjects and three repetitions per condition, are presented in relation to the number of talkers in a scene and the room reverberation conditions (indicated by the saturation). All features displayed are influenced by both reverberation and the number of talkers in a scene. (A) The number of gaze jumps was counted for each head fixation (B) The average number of gaze jumps per second decreased with difficulty. (C) Subjects scanned more sources, and (D) the standard deviation of the position, as a measure of the overall degree of eye movement, increased with scene difficulty. Boxplots illustrate the median (horizontal line), the 25th and 75th percentile range (boxes), and the overall range (whiskers). Outliers are indicated by dots.

For both head and eye movements, the average speed and the frequency of movements per second decreased with an increasing number of talkers (Fig. 5B), while the total number of movements increased. In more complex conditions, both head and gaze were fixated at various positions, as evidenced by an increase in the standard deviation of the azimuth position (Figs. 4D, 5D) and the number of visually fixated target talkers (Fig. 5C). During head saccades, the head moved more rapidly in scenes with more talkers and increased reverberation (Fig. 4C). Overall, the mean head fixation duration increased with rising complexity (Fig. 4B). Moreover, during these head fixations, participants exhibited more gaze jumps (Fig. 5A), reflected in the total number of gaze jumps per head fixation and relative to the duration of the head fixation.

Decision phase

Figure 6 shows head (blue) and eye (red) movement measures during the decision phase. Participants spent less time looking at the target (Fig. 6A) when more simultaneous talkers were present (F(6,625.55) = 11.3, p < 0.001), and with more reverberation (F(2,625.58) = 24.9, p < 0.001). No significant difference between the anechoic and mid reverberation condition was found (> 0.05). Instead, participants fixated talkers in the proximity of their final response with an increasing standard deviation of the gaze position for more talkers (F(6,626.61) = 3.0, p < 0.01) and more reverberation (F(2,626.69) = 4.0, p = 0.02) as shown in Fig. 6B. Figure 6C shows the head direction with respect to the target location. The amount of reverberation affected the difference between the response location and the final head position (F(2,624.19) = 8.3, p < 0.001). This difference increased the less centrally the target was positioned (F(7,624.79) = 58.9, p < 0.001). The number of talkers had no effect on the error of the final head location (F(6,618.20) = 1.4, p = 0.20).

Figure 6
figure 6

Eye movement features during decision phases. Movement features extracted from trajectories during the decision phase of data from 13 subjects and three repetitions per condition are shown with respect to the number of talkers in a scene (A and B), the target location (C), and the room reverberation conditions (indicated by saturation). (A) The average percentage of time spent on the target + /- standard error decreased with scene difficulty. (B) The standard deviation of the horizontal gaze position showed a slight increase. Participant scanned more potential talkers in proximity to the target before making the final decision. (C) The final head orientation differences to the response location showed a small increase for reverberant scenes and strongly depended on the target position, with the head facing the target more directly for centrally presented targets. Boxplots illustrate the median (horizontal line), the 25th and 75th percentile range (boxes), and the overall range (whiskers). Dots indicate outliers.

Relation to response time

We also computed correlation coeffcients between the overall response time and the various movement measures to explore their relationship. Figure 7 shows the corresponding correlation plots and values (as can be found in supplementary Table S1) with the colour indicating the number of talker and the symbols indicating the level of reverberation.

Figure 7
figure 7

Movement feature as a function of response time. The different panels illustrate the specific movement features extracted for each individual trial against the trial’s response time. Data points are colored coded based on reverberation, with saturation indicating the number of talkers. The first row shows gaze features extracted from the search phase, the second row presents head features during this phase, and the third row shows gaze features extracted from the decision phase. Blue lines represent the linear model fitted to the data, with the 95% confidence interval shown in grey.

Discussion

In this study, we examined alterations in the head and gaze movement patterns of normal-hearing listeners in response to variations in the complexity of the acoustic scene. Our aim was to enhance our understanding of the listeners' strategies for analyzing scenes and the challenges they face. Using VR environments, participants were tasked with indicating the position of a talker narrating a specific target story, while the scene complexity was manipulated by adjusting the number of simultaneous talkers and the degree of reverberation. The movement trajectory of the listeners showed three distinct phases: an initialization phase preceding any movement onset, followed by a search phase concluding with the last head movement, and finally, a decision phase prior to participant’s final response. We analyzed the behavior by extracting various features from the observed trajectories during each movement phase. Several of these extracted features indicated changes in movement behavior corresponding to alterations in scene difficulty, thereby elucidating previously observed difficulty-related changes in response time.

Although no differences were observed between the anechoic and mid-reverberant conditions across various movement measures, we found distinctions between the two low-reverberation conditions (anechoic and mid-reverberation) and the high-reverberation condition. Participants spent more time analyzing the scene, reflected in a prolonged search phase in high reverberant conditions. During this phase, participants exhibited increased head and eye movement, particularly evident in greater positional variability, a higher frequency of head and gaze movements, and in increased number of fixated talkers. Their heads remained fixated longer at one position during the search phase under high-reverberant conditions. In the decision phase, high reverberation led to increased eye movement with less focus on the target talker, despite the overall duration of the decision phase remaining unaffected. Moreover, the head was less directly orientated towards the response location under high reverberant conditions.

The prolonged duration of the search phase aligns with previous findings regarding the overall response time reported in Ahrens and Lund38, where the search phase was the primary driver of the response time. Head movements alter the physical signal arriving at the ears and are crucial for improving the SNR in scenarios with target and masking signals. However, Grange and Culling5 reported that a significant proportion of their participants failed to optimize head movement behavior for SNR improvement, with almost half not moving their heads until informed of the potential advantage. In our study, listeners consistently exhibited head movement behavior, possibly influenced by heightened realism induced by the virtual environment with visual cues, randomly positioned target talkers, and the complexity of the task involving multiple talkers. This task required listeners to scan the auditory scene, considering interfering talkers as temporary targets to comprehend their stories. With increasing reverberation, listeners can make less use of binaural cues, as interaural level differences39 and the coherence of the ear signals decrease40, leading to challenges in precisely locating the target position. This explains the observed increase in both head and eye movement. The increase in eye movement aligns with some of the findings by Hendrikse et al.41, who observed increased variability in gaze position for older and hearing-impaired participants listening to conversations in virtual environments. The longer head fixations in high-reverberant conditions may indicate an extended listening period before participants moved their heads to a new position.

The movement in our search task can also be considered as a series of orientation tasks, with listeners orienting to a new target position after each head fixation. Previous studies reported a delayed onset of these head orientations under conditions of hearing impairment or limited stimulus bandwidth9,11. While there is an advantage of less direct head location to improve SNR, the task used in this study triggered visual search behavior, even though there was no benefit from the visual information. This might have led participants to make less use of the SNR benefit for head locations, similar to findings in studies investigating head orientations during conversations35,41. The SNR benefit of less direct head positions, however, may explain the higher final head orientation offset at the decision phase's end for high-reverberant scenes35,41. The reduced amount of time spent on the target talker in the high-reverberation condition during the decision phase indicates that subjects were more insecure about the exact target position, scanning targets in the proximity before making the final decision.

The increase in search time duration with more simultaneous talkers in the scene was accompanied by an increase in head and eye movements, evidenced by more saccades, wider angles covered in their movements, and more fixated talkers. The peak velocity of head movements increased with more talkers, indicating larger movement angles that require higher velocities. However, the average speed of head and gaze movements was reduced. The reduction in the average head speed was likely caused by the observed longer head fixation durations. During these head fixations, participants performed more gaze saccades with an increasing number of talkers fixated, indicating more difficulty in identifying the target talker. Additionally, the onset of the first head movements started later, even though there was no effect of the number of talkers on the onset of the first eye movement, and the head movements were more often towards the wrong side when there were more talkers in the scene. This shows that people start scanning the scene visually independently of the scene’s difficulty while they wait before making a more costly change in head orientation. The length of the decision phase increased with more talkers in the scene, and participants fixated more targets while spending less time on the target talker. This indicates increased insecurity about the exact location of the target talker before making the final decision and matches the general increase in head fixation durations and the number of gaze jumps during these fixations.

A higher number of talkers reduces the SNR as well as the opportunity for dip listening while increasing the amount of informational masking up to eight talkers42, making the scene analysis more difficult. This was reflected in the overall response time, response accuracy38, and several movement measures related to visual and acoustic scene scanning. While changing the head orientation affects the effective SNR at the ear supporting the separation of target and interferers43,44, the increase in gaze behavior and the clearly expressed visual search behavior of the participants is especially remarkable since there was no actual benefit of visual information beyond the avatars being potential source locations. The high amount of gaze movement and the exact correspondence to the response location suggest that the gaze behavior expressed here was indicating not only the focus of visual but also auditory attention. It is possible that the absence of visual information benefit despite the presence of humanoid avatars was a too artificial of a situation, triggering natural visual scanning behavior. Another explanation is that the visual search gave an acoustic benefit by steering internal auditory receptive fields. The latter is supported by neurophysiological studies on rhesus monkeys26,45, and psychophysical experiments in humans27,28.

The later onset of the head movement, increased head fixation durations, and the longer decision phase are all in line with the previously found increases in movement latencies for orientation responses9,11. They suggest increased time spent on scanning the acoustic scene before moving the head. The high amount of initial misorientations aligns with previous findings5,10, suggesting that when the scene is getting too difficult, listeners initially move their head randomly to one side, likely in order to change the SNR of considered candidate talkers. This might be even more enhanced in young, normal-hearing listeners since they might not yet have developed more advanced strategies to analyze the scene. Although this difference in strategies still requires more investigation, findings by Hendrikse et al.41 support this argument, as they demonstrated that older and hearing-impaired listeners move their heads in a way that leads to a higher SNR benefit than young normal-hearing listeners. However, it is still unclear how exactly their movement strategies differ.

Targets located more peripherally influenced the head movement behavior, resulting in increased movement, higher peak velocities, and decreased mean velocities of the movements. This aligns with previously reported increases of in peak velocity and movement complexity related to angle9,11 and can be attributed to the greater movement required for peripherally located targets, leading to faster velocities. Additionally, final head orientations were less directly oriented towards the response location for more peripheral targets. The offsets observed here are similar in magnitude to the angle-dependent offsets reported by Brimijoin et al.10, and it has been previously noted that subjects perform a larger portion of the movement with their eyes for lateral targets46.

We varied scene complexity by manipulating reverberation and the number of concurrent talkers. Target and interferer locations, genders, and stories were randomly assigned, resulting in differences in difficulty even among conditions considered similar in complexity. For example, a scene with two talkers of the same sex positioned close together posed a greater challenge than one where these talkers were farther apart. While these variations in difficulty introduced greater variability in response and movement measures, making comparisons between subjects challenging, they also allowed for more generalizable findings across different target/interferer setups. However, future studies employing a similar experimental design might consider imposing stricter constraints on talker selection and distribution to enhance comparability of scene difficulty.

Using VR environments and continuous stories, we aimed to achieve a high level of realism while maintaining experimental control. Although the speech material in our study was more naturalistic than single-sentence corpora typically used in speech comprehension tasks, it still differed from real-life communication scenarios where listeners must navigate between multiple talkers, prepare their own speech turns, and account for the social context. Additionally, the talkers in our study were distributed evenly around a circle, facing the listener directly, and all speech stimuli were presented at the same sound pressure level. While these constraints diverge from typical communication settings, they were chosen to eliminate additional cues during the task. The absence of social feedback likely led listeners to optimize their movement behavior solely for target localization and speech comprehension without considering social norms. This may affect the generalizability of our findings, as everyday listening behavior often involves balancing speech comprehension with adhering to social norms. Consequently, it remains unclear to what extent the movement measures identified as sensitive to scene complexity in our study predict real-life difficulty, particularly among hearing-impaired listeners.

Wearing a VR headset likely affected the listener’s HRTFs33, potentially resulting in localization errors and altered movement behavior compared to real-world scenarios. The limited field of view of HMDs has been shown to influence head and eye movement behavior, resulting in more head movements and smaller gaze angles34. However, studies by Ahrens et al.16 and Huisman et al.47 suggested that the impact of VR headsets on sound localization accuracy is minimal when ample visual information is provided. Nonetheless, the influence of VR headsets on movement behavior and the generalizability of our findings to unrestricted movement remain uncertain, as these effects could vary significantly depending on the task, the device and the individual listener.

Visually, the humanoid phantoms used here provided no additional visual information beyond potential target locations, such as lip movements that could aid speech localization and comprehension. The effect of this visual limitation of subjects’ movement behavior is unclear. Hendrikse et al.14 examined different visual conditions using avatars in VR with or without synchronized lip movement and found no significant effect on subjects’ movement behavior.

This study exclusively examined young, normal-hearing listeners. The movement behaviors of older listeners and those with hearing impairment differ in several aspects29,48. Previous research indicates that they tend to rely more on head movements compared to eye movements and derive a greater SNR benefit from the head movement41. When engaged in conversations, these listeners demonstrate increased gaze movements, likely due to their reliance on visual cues to aid speech comprehension. Additionally, Brimijion et al.9 observed differences in the onset and offset latency of movements and in the complexity of movements for orientation responses among hearing-impaired listeners compared to their normal-hearing counterparts. Therefore, investigating whether the complexity-related movement measures in this study exhibit stronger effects among hearing-impaired listeners would be interesting.

In this study, we investigated the head and eye movement patterns of young, normal-hearing listeners in response to varying levels of acoustic scene complexity. Their behavior revealed three distinct movement phases: initialization, search, and decision, each of which responded differently to changes in scene complexity. The influence of reverberation was apparent in increased exploration during the search phase and heightened visual scanning during the decision phase. In scenes with high reverberation, the final head position was less directly oriented towards the response location, indicating an acoustically more optimal orientation. A greater number of talkers led to increased scene scanning, more pronounced than changes related to reverberation, and lengthened the decision phase. Interestingly, participants displayed a notable visual search behavior, reflecting their focus of auditory attention, despite the lack of visual information benefits. Rather than optimizing head movements for improved acoustic information, the observed movement behavior appeared to be a balance between visual and acoustic scene scanning. Understanding such movement patterns is crucial for developing future hearing assistance devices with personalized settings. Additionally, movement indicators could offer insights into the acoustic scene and potential listening challenges, valuable for both hearing aid design and evaluation.

Materials and methods

Participants

Data from thirteen native Danish, normal-hearing participants (7 female and 6 male) aged between 20 and 26 years were analyzed. All participants reported normal hearing, provided written and informed consent, and received compensation on an hourly basis. The study was approved by the Science-Ethics Committee for the Capital Region of Denmark (reference H-16036391) and was carried out in accordance with these guidelines. The data were collected concurrently with Ahrens and Lund38, which presented localization and response times but did not include analyses of motion features.

Speech material

Details of the speech material used in this study are reported in Lund et al.49. In summary, both target and interferer speech were taken from a database of ten Danish monologues, each featuring substantially distinct content. These monologues were spoken by ten different native Danish speakers (5 female and 5 male), resulting in a compilation of 100 different speech stimuli with an average duration of 93 s. Additionally, each of the ten narratives was associated with an icon of different color related to the content of the respective story (see Fig. 1D).

Virtual audio-visual scenes

This study utilized an anechoic chamber, a mid-reverberant room, and a high-reverberant room, all possessing identical dimensions (length × width × height: 12.0 × 9.0 × 2.8 m3). However, these rooms featured different visual surface materials tailored to match their acoustic properties. The three virtual environments used in the experiment are shown in Fig. 1C. The visual scenes were created in Unity (Unity Technologies, San Francisco, California, USA) and displayed on HTC Vive Pro Eye (HTC Vive system, HTC Corporation, New Taipei City, Taiwan) VR glasses. The setup had a refresh rate of 90 Hz and a field of view of 110° azimuthal angle. In all experimental conditions, the listener, seated in the room’s center, was surrounded by 15 semi-transparent humanoid shapes positioned between − 105° and 105° azimuthal angle at distance of 2.4 m with 15° separation (see Fig. 1A).

The acoustic scenes were presented with a 64-channel spherical loudspeaker array located in an anechoic chamber. The loudspeakers signals were generated using the LoRA-toolbox50, employing acoustic simulation of the rooms created with Odeon (Odeon A/S, Kgs. Lyngby, Denmark). For the reproduction of the anechoic room, only the direct sound from individual loudspeakers was used. In the case of the reverberant rooms, a nearest loudspeaker mapping was employed for early reflections, and an ambisonics-based method reproduced the late reverberant tail50. A more detailed description of the audio-visual environments, including a room acoustical analysis, can be found in Ahrens and Lund38.

Task

Listeners were familiarized with the speech material, where they had the opportunity to hear each story and each talker outside of the loudspeaker environment. Subsequently, they were relocated to the actual audio-visual laboratory, where they were seated in the center of the loudspeaker array and were introduced to the virtual environments. Their task was to swiftly localize the target talker amongst a different number of interfering talkers, utilizing a virtual laser pointer integrated into the visual scene to indicate their decision. The target story was indicated by the corresponding icon displayed in the background of the virtual room.

In each trial, a talker, story, and position were pseudo-randomly selected as the target and interferers, ensuring their distinctiveness within a trial. Each trial started with a 3-s pre-stimulus period before presenting the speech stimuli at a sound pressure level (SPL) of 55 dB for up to 120 s or until the listener responded. The onset of the stories was randomly assigned, and if a story concluded before the trial’s end, it restarted from the beginning. Listeners could indicate the target location after the speech offset.

The scene’s complexity was manipulated along two dimensions: the number of simultaneous talkers, ranging from two to eight, and the level of reverberation, categorized as anechoic, mid-reverberant and high-reverberant. This manipulation resulted in 21 different conditions, each repeated three times, yielding a total of 63 trials. In Ahrens and Lund38, also audio-visually incongruent conditions were explored, which, for the purposes of this study, have been excluded.

Head- and eye-motion tracking

Listeners’ head- and eye-movements were tracked using the VR glasses. The device employed SteamVR Lighthouse for head-motion tracking. The glasses are equipped with a built-in eye tracker, enabling the recording of participants’ eye gaze and pupil dilation through the SRanipal software development kit. The eye-tracker operated at a sampling frequency of 120 Hz, maintaining an accuracy between 0.5° and 1.1° (HTC Corporation31). Head-motion and eye-movement data were collected at a 90 Hz sampling rate in Unity (Unity, San Francisco, CA, USA), similarly to Schuetz and Fiehler31.

Outcome measures

To characterize the gaze and head movement behavior, we selected a set of features previously demonstrated to be associated with changes in orientation and visual search behavior11,29. Additionally, we included eye gaze measures that serve as potential markers for cognitive load23,24.

We segmented each individual trajectory into four different phases, as illustrated in Fig. 2. (1) The preparation phase was the time period from trial onset to the onset of the speech stimuli, precisely lasting three seconds by definition (see Task). (2) This was succeeded by an initialization phase marked by the absence of significant head or eye movement, concluding with the onset of the initial saccadic eye movement. To ensure consistency, trials where the head or eye position at the beginning of the initialization phase fell outside an azimuth range of –15° to 15° were excluded (39 trials). (3) The subsequent search phase terminated with the conclusion of the final head movement. (4) The final phase was a decision phase where only the eyes were in motion, continuing until the trial concluded with the indication of a target talker. 26 trials without any head movements were removed. Targets in all these trials were presented in an azimuth range of –15° to 15°. Trials with the initialization, the search phase or the decision phase lasting less than 100 ms, representing the motor planning time for eye movements51, were excluded from further analysis (123 trials similar to Rorden and Driver21). Additionally, trials with response times exceeding 120 s or response locations more than 45° (3 talker) away from the target were not included, as these responses likely resulted from guessing. In total, we excluded 21% of the original trials. For each of the different phases, we selected a set of features expected to describe the characteristics of the movement behavior and demonstrate sensitivity to variations in the acoustic conditions.

Head movement features

Onsets and offsets of head movements were determined using a velocity threshold of 10°/s. Movements occurring less than 100 ms apart were merged, and head movements with an amplitude of less than 2° between the onset and offset were treated as noise and therefore disregarded. This decision was based on the understanding that such movements fall within the range of variability observed in natural, unfixed head movements52.

Previous studies have reported increased delays in head movement onset9,11 with growing difficulty of the acoustic situation. However, this does not necessarily imply a delay in actual search behavior; rather, the head movement may be preceded by a more extended visual search period before individuals invest the higher energetic demand of moving their heads53. In such cases, the onset of the head movements would be delayed, while the onset of the search period remains constant. How often these initial movements are towards the wrong direction depends on the arrangement of the target and maskers as well as on the saliency of the target5,10,11. We therefore calculated the misorientation rate for each condition, excluding trials where the target talker was located centrally at 0° azimuth.

The maximum speed of head movements has been demonstrated to be slower in more challenging acoustic environments11, and this parameter was consequently included in our feature set. Additionally, we quantified the number of head movements as a measure of the overall amount of movement. While one might anticipate an increase in head movements with longer response times, it has been argued that some individuals make minimal head movements and instead predominantly use their eyes for motion54,55. To measure the variation of head positions, we calculated the standard deviation of the position29. Furthermore, we calculated the average head fixation duration, setting it zero in cases where only a single head movement occurred.

Eye gaze features

We categorized the gaze trajectories into saccades and fixations using an identification by velocity threshold (IVT) algorithm56. Saccade onset and offset were determined using a threshold value of 50°/s for gaze speed. Only saccades meeting specific criteria—peak velocity below 1000°/s57, a duration of 10 ms and an amplitude above 2°—were included in the analysis to filter out unlikely or noisy eye movements. Intersaccadic intervals were labeled as fixations. Consecutive fixations were merged if they occurred less than 75 ms apart and the difference in visual angle was less than 2°, with the chosen threshold values designed to avoid merging fixations disrupted by blinks or larger eye movements. Following the merging of adjacent fixations, fixations lasting less than 100 ms were excluded from the analysis, considering them too short to signify any attention shift.

From the classified gaze trajectories, we computed a set of features to characterize eye-movement behavior. To measure the extent of visual search during a head fixation, we counted the number of saccades for each individual head fixation and calculated the average over all fixations per trial. As gaze fixation durations have been shown to increase with rising task difficulty22,58, we also computed the average fixation duration. The standard deviation of the gaze position and the average gaze movement were used as a measure of movement quantity, as suggested by Hendrikse and collegues29. To determine how many potential talkers were considered, we counted the number of fixated sources. A fixated source was defined as any potential talker location within a 5° visual radius around the position fixated at least once. Additionally, we calculated the proportion of time spent looking at the target talker.

Statistical analyses

The impact of scene difficulty on various outcome measures was statistically analyzed through an analysis of variance of mixed linear models. The fundamental model employed was:

$$\text{y }\sim \#\text{talker}*\text{Reverberation}*\text{target\,\,position }+ 1|\text{Participants}$$

Here, y represents one of the outcome measures detailed in Outcome measures. Unless specified otherwise, the backward reduced model was used. In instances where effects did not persist in the final model, statistics for those effects are provided as they were in the model before removal. The reported effect sizes are partial η2, where 0.01 is considered a small effect, 0.06 indicates a medium effect, and >  = 0.14 signifies a large effect. For the computational analysis of main effects, we utilized R59 and the lmerTest package60, for calculating partial η2 the effectsize package61, and the emmeans package62 for post hoc analysis.