Introduction

Parkinson’s disease (PD) is the second most common neurodegenerative disorder1 and its prevalence is expected to increase with an aging population. It is multisymptomatic with a number of motor and non-motor impairments2,3. Its diagnosis is based on clinical assessment and the presence of two or more motor symptoms of tremor, rigidity, bradykinesia, or postural impairment or non-motor symptoms such as dysarthria, functional impairment or cognitive impairment are indicative of the disease4.

One of the early symptoms of PD is speech impairment, termed as Parkinsonian hypokinetic dysarthria. Speech symptoms are reported by 90% of people with PD5,6. The evaluation of Parkinsonian speech reveals a variety of disturbances such as reduced voice intensity, increased voice nasality, increased acoustic noise, reduced speech prosody, imprecise articulation, significantly narrower pitch range, mono loudness, longer pauses, vocal tremor, harsh and breathy voice quality, and disfluency7,8. Many of these are based on speech, which are limited by factors such as language skills or poor visual and auditory functions. Voice-based assessments have the advantage that these are more universal9,10.

Hypokinetic dysarthria is caused by poor activation and coordination of the speech production muscles8,11. The stiffness and tremor of the larynx muscle harden the vocal cords affects the vibration of the vocal cords and causes changes to the fundamental frequency, inadequate closed phases, and irregular or asymmetrical vocal motion during phonation8,12. The reduced controllability of the diaphragm muscles causes unstable phonatory airflow and pneumatic pressure to the larynx8,13,14. People with PD also have reduced control of other vocal tract muscles such as the tongue and lips.

The standard clinical method for classifying parkinsonian voice is by perceptual evaluation, which however is subjective15. Computerized voice analysis has been proposed for a more accurate, objective, and quantifiable alternative, which could also have the potential for telehealth and remote monitoring of the patients.

Studies on the effective Parkinsonian speech and voice biomarkers are clustered into four aspects: phonatory, articulatory, prosodic, and linguistic16. The study based on articulatory, prosodic, and linguistic aspects17 involves broad factors such as the psychology, linguistics, and cognitive conditions of patients. On the other hand, phonatory aspects of a sustained phoneme are less influenced by the above factors.

Studies have investigated the effectiveness of sustained phoneme parameters in representing the phenomenon of Parkinsonian hypokinetic dysarthria16,18,19,20,21. Most of the studies were focused on the parameters that are closely related to impairments in vocal cord vibration. The pitch frequency variation, number of pulses, jitter (perturbation of the glottal vibration period), shimmer (amplitude perturbation of glottal vibration), autocorrelation, and harmonics to noise ratio (HNR/NHR) were used in the authors previous work22, as well as in the work of Orozco-Arroyave23, Behroozi et al.24, Tsanas and Little25, Ali et al.26, Sakar et al.19, and Rusz et al.6.

Machine-based analysis can be correlated with perceptual features such as voice quality, loudness, pitch, and resonance. Some of the characteristics that have been assessed and found suitable for Parkinsonian voice are vocal intensity, jitter (frequency variability), shimmer (amplitude variability), harmonics to noise ratio (HNR), fundamental frequency (F0), and formant frequency profiles19,23,25,26,27,28,29.

Speech production features extracted from the glottal waveform remove the effect of articulation on the acoustic signal. They approximate the volume velocity of the air flowing through the vocal folds and may have an advantage for the analysis of the pathological voice.

Physiologically, these glottic source features are associated with (1) the frequency, amplitude, symmetry, and periodicity of vocal fold vibration; (2) the competency of glottic closure, and (3) speed of the vibratory cycle and the ratio of its open to closed phases. Breathiness, the hallmark perceptual voice quality of parkinsonian speech, is associated with incomplete closure of the vocal folds leading to air escape, and thus the presence of relatively higher noise in the voice, lowered the intensity and a predominance of the open phase of glottic pulse8,30. People with PD have higher jitter and lower HNR, associated with aperiodicity of vocal fold vibration and perceived as roughness. Connected speech of people with PD is monotonous and has reduced pitch and loudness variation.

Perez31 combined the above parameters with thirteen Mel Frequency Cepstral Coefficients (MFCCs) that represent the energy and articulatory positions. Fractal dimension (FD) features that measure the complexity of the signal was used by Viswanathan et al.32. More recently, multivariate deep-features have been found to be effective33.

Even though the above studies have demonstrated some significant differences between the voice parameters of controls and people with PD, their implementation in a generalized automatic system is not straightforward34. There is also evidence of inconsistent results between different studies32.

Gillivan-Murphy35 published preliminary findings based on nasolaryngoscopy which shows that PD voice tremor is not associated with the vocal folds. PD voice tremor is likely to be related to oscillatory movement in structures across the vocal tract rather than just the vocal folds. Furthermore, pronouncing a phoneme is a voluntary activity while PD tremors exist during rest. This may result in an inconsistent appearance of voice tremor in sustained and steady phoneme recordings which is essential for glottal vibration parameters.

The parameters other than the glottal vibration parameters that may potentially be used in PD identification are the parameters related to phonatory airflow and pneumatic pressure to the larynx such as voice intensity and the parameters related to vocal tract muscles such as formants and Vocal Tract Length (VTL)36,37.

This study has investigated and compared the effectiveness of three groups of parameters to differentiate the voice of people with PD from that of age-matched healthy participants. These are related to three domains of speech production control: (i) the stability of lung control, (ii) the periodicity and stability of glottal vibration control, and (iii) the stability of vocal tract control. Standard deviation (SD) and range of phonemes intensity were used to measure the lung stability while the shimmer, jitter, SD of pitch, and harmonics parameters were used for the stability of glottal vibration. The vocal tract stability was represented by the SD of the first four formants and the apparent Vocal Tract Length (VTL).

The comparison was examined using a statistical hypothesis test, followed by classification using the Support Vector Machine (SVM). The parameters were extracted from the recordings of sustained phonemes /a/, /e/, /i/, /o/, and /u/. Public database PC-GITA was used for this study. To evaluate the consistency of the method between different datasets, the SVM classifications were also applied to Viswanathan’s dataset38 which contains the recordings of /a/, /o/, and /m/.

Methods

Database of recordings

Two databases of recordings were used in this study. The first is the publicly available database, PC-GITA, provided by Rafael Orozco et al.23. It contains the recordings of 100 Columbian-Spanish native speakers, 50 of them were diagnosed with PD, and the other 50 were age and gender-matched participants with no PD or any other neurological disease symptoms. Table 1 presents participants’ demographic and clinical information. The p-values in the table confirm that there was no significant age difference between the groups as well as showing the matched clinical stage between male and female groups of PD subjects. The speech recording of the PD subjects was conducted within 3-h after their morning medication and hence has been in pharmacological ON-state. The procedure complied with the Helsinki Declaration and was approved by the Ethics Committee of the Clinica Noel, in Medellin, Colombia.

Table 1 Participants’ demographics of PC-GITA database.

The recordings were captured in noise-controlled conditions and sampled at 44,100 Hz with 16 resolution bits, using a dynamic omnidirectional microphone (Shure, SM 63L). In this study, we use the recording of the five vowels /a/, /e/, /i/, /o/, and /u/. The participants produced three repetitions of the sustained vowel, each done as long as possible in one breath, at their natural pitch and loudness. Figure 1 illustrates the waveforms of the five vowels recorded from control and PD patients.

Figure 1
figure 1

The waveforms of the five vowels recorded from the control subjects and the PD subjects.

The second is the Viswanathan’s dataset32 available publicly on request. This has the recordings from 24 people with PD and 22 people with no neurological disease and age-matched with PD, referred to as Controls. The people with PD were recruited from the Movement Disorders Clinic at Monash Medical Centre, Australia. All people with PD have been diagnosed within the last ten years. Three sustained phonemes /a/, /o/, and /m/ were recorded from each participant in a noise-restricted environment using Samson-SE50 microphone. The recordings were stored in a single-channel WAV format with a sampling rate of 48 kHz and a 16-bit resolution. The sustained phonemes of people with PD in the database were recorded in on-state and off-state medication. However, for this study, only the on-state recordings were used. Table 2 provides the demographics of the subjects. The detailed information can be found in22,32.

Table 2 Participants’ demographics of Viswanathan’s database.

Parameter extraction

A publicly available speech analysis software, Praat39, was used to extract speech features from the recordings. Before features extraction, the recordings were trimmed to a uniform duration of 0.5 s based on the assumption that vowels correspond to largely stationary signals. The recordings were filtered with an IIR 4th order Butterworth band-pass filter of 50 Hz to 4 kHz.

Voice intensity parameters

The voice intensity is controlled by the subglottal pressure, which is controlled by the respiratory muscles and the lung volume40 and thus, it is hypothesized that people with PD will have increased variation and reduced range of the voice intensity. The standard deviation and range of intensity are proportional to the fluctuation of lung pressure during the pronunciation of the sustained phoneme that may capture the tremor or rigidity due to Parkinson's disease.

The standard deviation and range of voice intensity were obtained for each recording. The parameters measure the ability of the subject to keep the stability of air pressure produced by the lung. The intensity, I (in dB), of an input voice s(t) with a duration of T, were calculated using Praat’s function with energy averaging method as in Eq. (1).

$$I=10 {log}_{10}\frac{1}{T}{\int }_{0}^{T}{10}^{\frac{s(t)/}{10}}dt$$
(1)

Periodicity and stability of glottal vibration

It is commonly assumed that Parkinsonian dysarthria is affected by the abnormal vibration of the vocal cords, such as the inadequate or excessive closing of the vocal cords and irregular or asymmetrical vocal fold, as well as a tremor in its muscles8,34,35. A total of 6 parameters related to the periodicity and stability of glottal vibration were extracted from each recording. The parameters were jitter absolute (abs), jitter relative (rel), the absolute shimmer (in dB), the relative shimmer, the standard deviation of pitch frequency (f0), the HNR, and the NHR.

The jitter parameters41 were related to time perturbation glottal pulses, Ti. The equation to calculate the two jitter parameters41 are shown in Eqs. (2) and (3):

$$Jitter\left(abs\right)=\frac{1}{N-1}\sum_{i=1}^{N-1}\left|{T}_{i+1}-{T}_{i}\right|$$
(2)
$$Jitter\left(rel\right)=\frac{\frac{1}{N-1}\sum_{i=1}^{N-1}\left|{T}_{i+1}-{T}_{i}\right|}{\frac{1}{N}\sum_{i=1}^{N}{T}_{i}}$$
(3)

The shimmer parameters41 were related to amplitude perturbation of the glottal cycles. The parameters were calculated with Eqs. (4) and (5):

$$Shimmer\left(abs,dB\right)=\frac{1}{N-1}\sum_{i=1}^{N-1}\left|20*\mathrm{log}\left(\frac{{A}_{i+1}}{{A}_{i}}\right)\right|$$
(4)
$$Shimmer\left(rel\right)=\frac{\frac{1}{N-1}\sum_{i=1}^{N-1}\left|{A}_{i+1}-{A}_{i}\right|}{\frac{1}{N}\sum_{i=1}^{N}{A}_{i}}$$
(5)

The standard deviation of the pitch was calculated based on the instantaneous pitch frequency f0 i = 1/Ti. The HNR and NHR were calculated based on the normalized autocorrelation function of the segment. Rxx[T0] is the peak next to the center of Rxx at a distance corresponding to the T0 of the recording. The HNR and NHR were calculated as described in Eqs. (6) and (7)42,43:

$$HNR=10*log\frac{{R}_{xx}[{T}_{0}]}{1-{R}_{xx}[{T}_{0}]}$$
(6)
$$NHR=1-{R}_{xx}[{T}_{0}]$$
(7)
Formants parameters

The limitations of the control in the speech production process by the people with PD leads to some disturbances including the change in phonatory and resonant characteristics34. The disturbances in the resonant characteristics are due to an inaccurate position of the articulators or a lack of control of vocal tract muscles. The accurate position and control of vocal tract muscles can be observed in the fluctuation of formants frequencies. The stability of vocal tract control in this study was measured with a standard deviation of the first four formants (F1, F2, F3, and F4) and the Vocal Tract Length (VTL). The formants of each recording were extracted from Praat using Burg’s method44 with a maximum formant value of 5.5 kHz, a window length of 25 ms, a time step of 6.25 ms, and a pre-emphasis from 50 Hz. The mean and standard deviation were then calculated for each recording.

Vocal tract length

The other parameter that captures the resonant characteristic of the vocal tube model of voice production is the apparent vocal tract length (VTL). VTL is the estimation of the physical vocal tract length of a subject while pronouncing a specific voice based on formants frequency. VTL has been used in other voice analyses such as speaker verification45, identifying body measures36,46.

VTL of each recording was calculated (in cm) from the mean of the four formants, Fi, with the formula in Pisanski et al.36.

$$VTL({F}_{i})=(2i-1)\frac{c}{4 {F}_{i}}$$
(8)

The constant, c = 33,500 cm/s, is the speed of sound in a uniform tube with one end closed. A total of four VTL were calculated for each recording associated with each formant, Fi.

Statistical analysis

The mean and standard deviation of all the parameters were computed for the two groups of the PC-GITA database: PD and CO. The normality of the extracted parameters was examined with the Anderson–Darling test47. Mann Whitney U-test48 was used to compare the group differences for speech parameters between PD and control subjects. The 95% confidence level was considered for the analysis and p-value < 0.05 to indicate that the mean of the groups was significantly different. All the statistical analyses were performed using MATLAB2018b (MathWorks).

Support vector machine classification

The effectiveness of the parameters to classify PD and control subjects was investigated with Support Vector Machines (SVM)49 classifier. The SVM was trained with a Gaussian kernel and validated using “leave-one-out” cross-validation. The Gaussian kernel was selected anecdotally since it yielded the best result compared to the other kernels. The input to the SVM were the sets of voice parameters and the ten highest-ranked features, selected using the Relief-F algorithm50 with 10 nearest neighbors (k = 10). The classification accuracy, sensitivity, and selectivity were evaluated based on the true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN).

Ethics

This paper reports the analysis of two datasets: Viswanathan and PC-GITA. Viswanathan dataset was developed using the research protocol for analysis was approved by RMIT University human experiments Committee for Ethics in Human Research and the experiments were performed in accordance with Helsinki declaration for ethical experiments, revised 2013. PC-GITA dataset was developed based on the procedure that complied with the Helsinki Declaration and was approved by the Ethics Committee of the Clinica Noel, in Medellin, Colombia. Both database confirm that all participants provided written consent for the experiments.

Results

Statistical analysis

The Anderson–Darling test confirmed that except for some VTL parameters, the parameters were not normally distributed. Mann Whitney U-test, a non-parametric test, was thus used to test for group differences in each of the features. Table 3 provides the statistical distribution (mean ± SD) and p-value and effective size of Mann Whitney U-test between CO and PD for all the features. The table shows that the parameters of people with PD fluctuated more than CO. The voice intensity of people with PD has both higher SD and range, which indicates their diminished ability to produce sustained phonemes with stable air pressure. The p < 0.05 shows that the group difference was significant.

Table 3 Statistical distribution and the result of Mann Whitney U-test.

The statistical distribution of the glottal vibration parameters, i.e., jitter, shimmer, SD of pitch, was significantly higher for people with PD compared to the CO, with p-value < 0.05. The HNR and NHR distribution show that PD voice had higher noise (non-periodic) components compared to healthy people.

For vocal tract parameters, except for phoneme /o/ and /u/, the first three formants (F1, F2, and F3) of PD patients have a significantly higher standard deviation compared to the normal subjects. The majority of VTL parameters did not show significant differences between PD and normal subjects. The p-value and effect size confirm that statistically, the mean of the groups was not significantly different.

SVM classification

The SVM classification results of recordings from the PC-GITA database for the four groups of input parameters are shown in Table 4. It presents the accuracy, sensitivity, and selectivity when considering each vowel independently and with the combination of the five vowels. For the sake of presentation simplicity and without loss to the outcome of this work, the table only presents the results of the vowel combination with significant accuracy. The results show that the classification accuracy of 84.3% was obtained with the combination of all the vowels when the SVM input were VTL(Fi); the overall observation is that VTL is the most effective feature to distinguish between voice of PD and CO. The SVM classification accuracy was 71.2% when it was given the ten highest-ranked features selected by the Relief-F algorithm. The ten highest-ranked features selected by Relief-F algorithm were dominated by the VTL (VTL(F4) of/o/; VTL(F1) of /i/;VTL(F2) of /o/; VTL(F3) of /u/; std(F1) of /o/; std(F2) of /o/; VTL(F1) of /e/; VTL(F1) of /a/; VTL(F2) of /i/; VTL(F2) of /u/). Comparing the vowels, the VTL of /i/ was the most effective parameter with an accuracy of 73.0%. The percentage of sensitivity and selectivity was about at the same level as the accuracy for almost all the input configurations.

Table 4 The SVM classification results of PC-GITA database.

To evaluate the consistency of SVM classification using VTL(Fi) in different databases, the SVM classifications using VTL(Fi) were also applied to Viswanathan’s dataset38 which contains the recordings of /a/, /o/, and /m/. Table 5 provides the classification results of the recordings in the database. The table shows that the SVM classification using VTL(Fi) as input parameters performs consistently with different databases. The highest accuracy was 96.0% with the combination of VTL(Fi) of /a/ and /m/, while an accuracy of 94.0% was obtained with the combination of /a/, /o/, and /m/.

Table 5 The SVM classification results of Viswanathan’s database.

Discussion

Several earlier studies that have proposed the use of voice-based diagnosis and assessment of Parkinson's disease16,18,19,20,21,22. These studies used the vocal cord vibration parameters such as pitch frequency variation, number of pulses, jitter, shimmer, autocorrelation, and harmonics to noise ratio (HNR/NHR). While these studies showed the potential of voice-based biomarkers for Parkinson’s disease, these show inconsistent results in different databases6,23. As an example, the vocal cord vibration parameters based analysis gave classification accuracy of 78.1% in Viswanathan’s dataset22 but performed poorly for PC-GITA dataset as shown in Table 4 (70.9% of accuracy).

This study has identified VTL as a potential parameter to be used in the classification of PD patients based on sustained phoneme recordings. The parameters have achieved 84.3% accuracy, 84.0% sensitivity, 84.7% specificity when used in PC-GITA database with five vowels /a/, /e/, /i/, /o/, and /u/. This study showed the consistency of the parameters when applied in different datasets. Table 5 shows that when applied in Viswanathan's datasets, VTL parameters could classify PD patients from healthy subjects with an accuracy of 96.0%.

This study has shown that among the features reported in the literature, VTL features are most suitable for differentiating the voice of people with PD from that of Control. VTL is an approximate measure of the physical vocal tract length while producing voice. The shape and length of the vocal tract affect the value and space of formants. Longer vocal tracts produce lower, more closely spaced formants36. Although the length of the vocal tract mainly depends on the physical body structure, the study of Piransky et al.37 found that a person may voluntarily modify the length of the vocal tract up to 25%. The result reported in this paper indicates the possible relation between the modification of vocal tract length by a subject with a symptom of PD. When a PD patient, due to the reduction in the ability to control speech muscle, modifies the length of the vocal tract, the properties of voice modulation in the vocal tract change. The relation is a higher-order relation. The linear separation by statistic test could not properly separate the PD from healthy subjects.

The novelty of this study is the high performance in differentiating between voices of PD from Controls, and which is consistent for two different databases. We are the first study that investigated the use of VTL to identify voices of people with PD and found that VTL parameters outperformed the features reported in the literature that are related to perturbation of glottal vibration, such as jitter, shimmer, pitch frequency, and harmonics ratio. The finding in this study suggests and supports the argument in35 that the neuro-physiology change in PD patients is manifested more in the change of vocal tract control compared to glottal vibration or air pressure control by the lung. This opens the potential for computerized and remote monitoring of people with PD.

The limitation of this study is we have only investigated two databases; Columbian-Spanish native speakers and Australian native speakers. Further study needs to be conducted of people from other demographics and ethnicity to validate the findings for global use. While the size of the datasets are sufficient, larger datasets are required that will allow the examination of the various confounding factors. There is also the need to investigate the effect of PD medication such as Levodopa on these parameters and to test this over repeated voice recordings.

Conclusion

This study has investigated the effectiveness of using three sets of voice features of sustained phonemes to differentiate people with PD from age-matched healthy participants using two independent and different sets of publicly available databases. It has found that the most effective feature set was using apparent vocal tract length (VTL). The classification accuracy in identifying PD from control was 84.3% when combining the VTL features of all the five vowels /a/, /e/, /i/, /o/, and /u/. The classification accuracy when using /a/, /o/ and /m/ using Viswanathan dataset obtained using smartphone was 96%. This performance was significantly higher than the accuracy obtained when using the glottal vibration parameters (jitter, shimmer, pitch, and harmonics) and voice intensity. Another advantage of VTL parameters is that there were obtained automatically and thus suitable for computerized analysis of the voice recordings using smartphones. Unlike deep-learning approach, this method has the benefit because it has identified the specific voice parameters which allows the clinician to understand the differences. This has the potential for telephone-based diagnosis for PD.