Introduction

Precise and sensitive measures of the motor severity of Parkinson’s disease (PD) remain elusive but critical if new therapies are to be fairly and quickly evaluated. The internationally validated, familiar and easily interpretable Movement Disorder Society-Unified Parkinson’s Disease Rating Scale Part III (MDS-UPDRS III)1 continues to be the favoured primary endpoint of major trials in PD2. But it is time-consuming for clinicians and its poor calibration and sensitivity3 may have played a part in the gross failure of novel therapies to translate into clinical practice4,5.

Mobile devices, such as wearables and smartphones, enable low-cost objective repeated monitoring of motor severity that are likely to improve the sensitivity of clinical trials in which they are used as end-points. Digital assessments show promise6,7,8 but have yet to be widely adopted partly due to the lack of transparency and harmonisation of the analysis methods used and the lack of subsequent high-quality subject-level evidence comparing them to current reference-standard measures9.

Feasibility studies have reported that digital assessments correlate with total MDS-UPDRS III8 and subcomponents of it6,10 but the equivalence of digital and clinical measures at the individual level remains unclear. Cross-validated subject-level prediction of disease category and total MDS-UPDRS III has been demonstrated in ten patients7, but when scaled to larger numbers, a model discriminating disease category—the simpler task—was only accurate at the cost of unstable features11.

This failure to generalise may be due to three fundamental reasons. First, digital models trained with single human scores incorporate subjective bias within them, rather than removing it. Second, approaches that consist of only 5–7 digital subtests may be too blunt to capture the individual heterogeneity12 evident within the 33-item clinical MDS-UPDRS III, and third, previous studies are at high risk of providing over-optimistic results due to feature selection bias if a large number of post hoc candidate digital features or machine-learning algorithms are tested within a limited size study.

We designed the CloudUPDRS Smartphone Software in Parkinson’s (CUSSP) study to address these concerns. We assessed the degree to which subject-level smartphone-based measures predicted subject-level MDS-UPDRS III subitems. A randomised crossover design and blinded assessment by three clinical raters mitigated concerns about subjective bias. We used a larger 16-item smartphone-based assessment to increase the capacity to capture individual heterogeneity, and methods were pre-published6 or pre-registered to reduce post hoc feature selection bias.

Results

Cohort details

Overall, 62 participants were recruited with a minority (5) from the second site (Homerton University Hospital), which started recruitment later. Network faults during data capture and storage errors of video-recordings, or incorrect task performance (detected on review of video-recordings) resulted in loss of data for two subjects, 18/1008 remaining smartphone tests and 6/882 remaining clinical MDS-UPDRS III items. Only three participants opted for an OFF/ON recording, the treatment status of the remaining participants being considered intermediate. Subsequently, the final analysis included 60 patients, 63 sessions, 990 smartphone subtests, 876 MDS-UPDRS III item videos and 2628 human score ratings. Twenty participants were female, eight participants were left-handed. Demographic details are shown in Table 1 and confirm that patients in this cohort were wide-ranging in age, had prominent motor and non-motor symptoms, required moderate amounts of medication and tended to remain functionally independent without cognitive impairment (as selected by the inclusion criteria). The total MDS-UPDRS III scores were skewed towards the mild/moderate end, as expected given the inclusion criteria. The subitems showing the greatest variation were left finger tapping and left pronation/supination, whilst left and right leg tremor showed very little variation (see Fig. 1). There was significant inter-rater agreement between clinical raters in all subitems at the population level, with rates of agreement mostly ranging from moderate to substantial (apart from left pronation/supination where agreement was only fair, see Table 2). Subject-level inter-rater agreement analysis showed that all three blinded clinical raters rarely completely disagreed (<5%, see Table 2).

Table 1 Demographic details of the cohort.
Fig. 1: Clinical MDS-UPDRS III score subitem distributions.
figure 1

Distributions show a skew towards lower values consistent with other studies of early and moderate disease. Some clinical features (notably Left and Right Leg Tremor) show poor variation within this sample of 60 individuals with Parkinson’s.

Table 2 Population-level and subject-level inter-rater agreement for clinical MDS-UPDRS III subscores.

Overall predictive accuracy of smartphone assessments

The primary outcome was the overall leave-one-subject-out cross-validation (LOSO-CV) classification accuracy of the smartphone-based prediction of the MDS-UPDRS III subscores. A fully pre-specified analysis classified 70.3% (SEM 5.9%) of subjects into a similar category to a clinical rater (see Fig. 2 and Supplementary Table 4). This was above a random (36.7%; SEM 4.3%) baseline and below the performance achieved with optimised classifier and feature selection (78.7%, SEM 5.1%). Classifiers generally predicted more than one category with notable exceptions where there was an extreme degree of class imbalance within the clinical scores (left or right leg tremor, left hand kinetic tremor; see Supplementary Table 4). Smartphone scores were poorer at predicting the median MDS-UPDRS III subscores overall (57.0%, SEM 8.0%) but this was improved with optimised classifier and feature selection (65.2%, SEM 7.5%; see Supplementary Table 5).

Fig. 2: Primary outcome (any-rater criterion).
figure 2

The mean leave-one-subject-out cross-validation (LOSO-CV) classification accuracy of the smartphone-based prediction of the blinded MDS-UPDRS III. The accuracy of a number of approaches is compared to a random baseline (similar to rolling a dice where subjects were randomly assigned to a clinical category). The fully pre-specified analysis (blue) relied on pre-published features and a standard multinomial regression model. The Best Classifier approach selected the best classifier from a range based on best performance but used only the pre-specified features. The Best Feature approach selected the best feature from a range but used only the pre-specified classifier. The Best Classifier and Feature approach selected the best combination of both. Approaches are graded according to the risk of selection bias: the pre-specified analysis has a very low risk, the Best Classifier or Best Feature analyses have low risk whilst the combination approach has a moderate risk of over-optimistic accuracy. Error bars represent SEM.

Item-specific predictive accuracy of smartphone assessments

Item-specific LOSO-CV accuracy for each of the 16 smartphone tests is shown in Fig. 3 and Supplementary Table 4. Notably although classifier performance for tremor was universally high, in the case of leg tremor (left leg tremor 97.0%, right leg tremor 97.0%), this was achieved by simply predicting the commonest category consistently (very few participants had leg tremor). On the other hand, bradykinesia scores had good variation across subjects and the highest pre-specified analysis accuracies were achieved for bilateral pronation/supination movements (left 74.6%, right 73.0%) and bilateral leg agility (left 63.5%, right 69.8%) rather than variants of finger tapping (1 target variant: left 53.2%, right 62.9%). The median-rater item-specific agreement analysis followed a similar pattern but with generally lower accuracy values; see Supplementary Table 5. The best-performing classifiers for each were as follows: Radial Basis Function Support Vector Machine for left hand rest tremor, right pronation/supination; AdaBoost for right hand rest tremor, left fingertap (1 and 2 targets), right/left leg agility; Nearest Neighbours for left leg rest tremor; Decision Tree for right leg rest tremor, left hand postural tremor, and left/right hand kinetic tremor; Random Forest for right hand postural tremor; Naïve Bayes for right fingertap (1 target); Linear Support Vector Machine for right fingertap (2 targets); Multinomial Logistic Regression for left pronation/supination (see Supplementary Tables 2 and 3 for full details).

Fig. 3: Individual Test Prediction Accuracy (any-rater criterion).
figure 3

The leave-one-subject-out cross-validation (LOSO-CV) classification accuracy of individual subtest smartphone-based prediction of the blinded MDS-UPDRS III subitems. Baselines and models are described in the legend to Fig. 2. Subtests relate to the corresponding MDS-UPDRS III subitems. Note that the Finger tapping MDS-UPDRS III subitem is repeated because it is predicted by two smartphone tasks (one or two target tapping).

Discussion

New measures of motor severity in PD must be robustly evaluated at the subject level prior to widespread use. Although digital assessments such as finger tapping speed are objective and more likely to be reproducible, they can only be meaningfully interpreted as severity scales either when matched to prima facie valid subject-level outcomes such as degree of functional dependence or, more commonly, when matched to well-established and familiar but subjective scales such as the MDS-UPDRS III. Subsequently, most digital tools trained on single human classifications or ratings incorporate the subjective bias of that particular rater within them6,8,10. Here, we mitigate against this bias by training on the median of three blinded human ratings with mostly moderate-to-substantial inter-rater agreement (see Table 2), reducing the potential idiosyncratic influence of any individual rater (i.e. over-fitting to a particular rater). This should improve the subsequent generalisability of the smartphone-based measures that have been developed here. We report that even when generalised to out-of-sample subjects, classification of CloudUPDRS smartphone measures was reasonably equivalent to at least one of three MDS-UPDRS III human-raters overall (LOSO-CV accuracy 70.3%; SEM 5.9%), and that individual subtests had variable degrees of correspondence (Fig. 3 and Supplementary Table 4). A more stringent analysis requiring smartphone-derived measures to predict the median of three clinical raters exactly (a requirement to perform better than any individual human rater) was as expected less accurate overall (57.0%, SEM 8.0%) but still better than a random baseline (28.5%, SEM 4.7%).

Any clinical study is limited by how well the study cohort represents the clinical population in question. We analysed data from 60 subjects which cannot represent the entire population but compares favourably with other studies comparing supervised smartphone measures to MDS-UPDRS III scores7,8,10. Our selection criteria successfully captured those with mild to moderate disease who may have mild/moderate depression or mild cognitive impairment (MoCA > 20), but not of a severity that would impair understanding of the simple motor tasks required. The sample is, therefore, broader and more transparent than previous reports and more representative of a typical secondary care population, recruiting from two hospital sites and not excluding patients based on poor compliance with home monitoring, or phone availability for example. In spite of this, some clinical features (e.g. leg tremor) did not have enough sample variation to train a useful predictive model, and generally models were more sensitive to changes in mild to moderate (rather than severe) disease. This highlights an important limitation to most current digital studies that focus on mild/moderately affected individuals—if classifiers are not trained on data from more severely affected patients (or patients assess in the ‘OFF’ state), they will not generalise well to these populations. In our study, only three subjects opted to be assessed after overnight withdrawal from dopaminergic medication. Although the effect of treatment is not the question we are addressing here, this may have limited the number of high MDS-UPDRS III scores in our cohort. Similarly to other digital tools, therefore, our smartphone software may be of most benefit to populations with mild/moderate disease undergoing interventions to modify disease progression or treat motor fluctuations. Future work should aim to address this by collecting more data or changing inclusion criteria to incorporate more severely affected patients. Because all assessments were performed under clinical supervision and reviewed by video, we can also have confidence that the smartphone tasks were performed adequately without undue influence from idiosyncratic factors (such as inadequate understanding of the tasks, concurrent disabling dyskinesias or distraction by a conversation). Although mild cognitive impairment and depression may correlate with disease severity and therefore slower movements, this is likely to equally affect the clinical MDS-UPDRS III and smartphone subtests and therefore is unlikely to be a significant confound to the primary outcome. Finally, the reproducibility of these findings also depends on the number of blinded clinical raters and their inter-rater agreement. Our three blinded raters were of similar clinical experience and so population-level and subject-level inter-rater agreement was reasonable. Future work may look at the effect of a larger and more varied pool of trained blinded raters.

PD is heterogeneous—the motor signs of tremor and bradykinesia typically vary within a patient across body parts and over time12. Most current digital assessments that focus on only 5–7 pre-selected subtests13,14,15,16,17 are likely to have reduced sensitivity as compared to the 33-item MDS-UPDRS III. We rely on 16 independent smartphone subtests to evaluate individual heterogeneity across body parts and report results for individual tests in addition to a combined score (see Figs. 2 and 3 and Supplementary Table 4). This allows for a closer match to the original MDS-UPDRS III, but some tests (notably of rigidity and axial features) remain currently outwith the scope of smartphone tests. Motor features such as tremor amplitude can change over minutes and so we explicitly randomised the order of smartphone and clinical assessments (performed within minutes of each other) across patients to avoid systematic biases in measurement present in other studies. We found that clinical tremor scores were predicted well by subject-level smartphone measures (LOSO-CV accuracy 46−97%) in keeping with its importance in previous studies18,19, but that in the case of leg tremor, this was largely due to low variation in the sample of MDS-UPDRS III tremor scores. Without collecting additional functional outcome data, however, it remains unclear whether this apparent insensitivity of the MDS-UPDRS III tremor classification has a clinically meaningful significance. Bradykinesia tests also showed variable performance with pronation/supination measures standing out as having good sampling variance across the population and good subject-level correspondence between clinical and digital measures (mean LOSO-CV accuracy 73–74.6%). This is an important consideration for current smartphone assessments that typically focus on finger-tapping only13,14,15,16. Future analyses should determine whether a combination of subtests can be used to predict non-tested items (such as limb rigidity) or approximate total MDS-UPDRS III score but this is beyond the scope of the pre-specified analysis plan that we report here. Until this is established, we suggest ongoing assessment of a broader range of motor tests to improve the sensitivity of subject-level smartphone-based measures to diverse motor presentations of disease.

Conversion of a raw digital measurement into a useful digital biomarker typically requires specific data selection and transformation (feature selection) and choice of an optimal machine-learning algorithm. The greater the pool of post hoc features and algorithms from which the selection takes place, the greater the risk of feature selection bias where predictive accuracy is over-optimistic and relies on chance relations in the data (also known as ‘over-fitting’). This problem is magnified when the ratio of features to observations is high and may explain why previous models failed to generalise11. We mitigated against this by pre-registering our design and features6 and using a standard linear classification algorithm. Our approach, based on more restrictive pre-specified analysis and the consideration of individual features only, is conservative compared to the majority of prior studies in the literature. We additionally performed an exploratory graded feature and classifier selection process which maximally led to an improved overall LOSO-CV accuracy of 78.7% (SEM 5.1). The combined results provide conservative and relatively unbiased benchmark accuracies suitable for clinical translation (full pre-specification) together with increasingly optimised results which are at risk of bias, but facilitate comparison with other studies and suggest optimal features and classifiers to be tested by future research.

This study demonstrates how digital assessments for PD can be robustly validated within prospective clinical trials—a necessary step prior to widespread adoption. We mitigated against prior concerns of subjective bias and feature selection bias whilst remaining sensitive to the diverse motor presentations of the condition and utilising open-source analysis software. Our evaluation provides a conservative but relatively unbiased benchmark performance of how smartphone-based measures correspond to standard-reference measures at the subject level. Future smartphone-based studies should consider similar precautions and incorporate more severely affected patients to improve the reliability and generalisability of their conclusions.

Methods

Study design

We designed and conducted a prospective, pre-registered (NCT02937324), dual-site crossover-randomised study comparing structured single time-point smartphone-based and blinded clinical rater assessments of motor severity in PD. The primary outcome was the degree to which subject-level smartphone-based measures predicted subject-level MDS-UPDRS III as calculated by three blinded clinical raters. This was quantified as the leave-one-subject-out cross-validation (LOSO-CV) predictive accuracy. This work has received ethical approval from the UK HRA and the local research ethics committee and written informed consent was obtained from all participants.

Participants

Participants were recruited from both the National Hospital for Neurology and Neurosurgery, London, UK and Homerton University Hospital, London, UK between 8 August 2017 and 1 March 2019. The inclusion criteria were (1) diagnosis of probable idiopathic PD according to Brain Bank criteria20, (2) age over 18 years old, (3) a score >20/30 on the Montreal Cognitive Assessment (MoCA)21, (4) no anti-parkinsonism medication changes within the last week, (5) capacity to consent, (6) ability to understand English well enough to operate the phone software, follow its instructions and be able to answer the study questions. Exclusion criteria were: (1) concurrent acute medical illness, (2) other co-morbidity that in the opinion of the Investigator may preclude their participation in the study, (3) inability to consent. A target of 60 participants was pre-specified based on feasibility rather than a power calculation.

Clinical data collection

Patient data were collected over one or two visits, as individually convenient. Screening assessments performed after enrolment included the full MDS-UPDRS1, MoCA21, Beck Depression Inventory22, PDQ-3923 and Hoehn and Yahr24. Participants opted to perform the motor assessments immediately, or at a second morning visit after overnight withdrawal of their anti-parkinsonian medications. Participants attending for a second visit performed the test initially in a practically defined ‘OFF’ state, and then took their usual medication and repeated the testing in the ‘ON’ state. The latter test was carried out after a maximum of 1.5 h from the initial testing and only after both the examiner and participant agreed that the participant had entered the ‘ON’ state. Participants attending for only one visit were labelled as an ‘Intermediate’ state.

Each motor assessment included a standard video-recorded assessment of the 33-item MDS-UPDRS III1 by a member of the movement disorder clinical team and a 16-item smartphone-based assessment supervised by the clinician. Smartphone assessments were performed on either the participant’s own phone or on a smartphone supplied by the study team depending on availability. The smartphone and clinical assessments were performed one immediately after the other in a crossover design—their order randomised by software within the CloudUPDRS application at the point of enrolment for each participant.

Blinded clinical video evaluation

In addition to the live clinical MDS-UPDRS III ascertained by the examining clinician, we also obtained three blinded rater scores to mitigate against subjective rater bias. Three neurologists with at least 12 months of specialist movement disorders training post certification who were not involved with the initial data collection, separately rated the videoed examination of each MDS-UPDRS III subcomponent included in the study (3.4 Finger tapping, 3.6 Pronation/supination movements of hands, 3.8 Leg agility, 3.15 Postural tremor of hands, 3.16 Kinetic tremor of hands, 3.17 Rest tremor amplitude; see Supplementary Table 1 for further details) blinded to the medication status, crossover randomisation order and clinical details of the participant and to each others’ scores.

The CloudUPDRS system

The CloudUPDRS system is a CE marked software device consisting of the CloudUPDRS application for Android smartphones, a cloud-based scalable data-collection service and a data-mining toolkit that we developed (Manufacturer: Birkbeck College, University of London, MHRA Manufacturer’s Ref. number CA015327, Class 1a Medical Device, date: 29/2/16)6,25,26. The graphical-user-interface has been iteratively improved over a number of design cycles with the input of patient-led focus groups, clinicians and user-interface design experts to improve user experience in the target population, many of whom have impaired visual perception and dexterity. As far as possible it is also designed to harmonise data collection across multiple device types (e.g. by preserving distance and size of tapping targets)26. The software guides each participant through 17 subtests that correspond to subcomponents of the MDS-UPDRS III. Each subtest has specific written, visual and audio instructions (available at http://www.updrs.net/help/) and lasts between 60 and 90 s during which time the relevant phone sensor is recorded at the maximum sampling rate the phone allows (minimum of 50 Hz). Raw sensor data are time-stamped, assigned a unique identification number and stored locally. On completion of the test subsets, data are automatically uploaded to a remote secure server via an encrypted link. Full computational details have been published previously25,26.

Smartphone data collection

Smartphone subtests were designed to resemble the MDS-UPDRS III as far as possible, whilst also considering safety and practicality in the context of concurrent smartphone use. Three types of subtests were performed. Finger tapping required participants to tap one or two consistently spaced targets on the smartphone whilst screen sensors recorded the onset, duration, upwards or downwards movement and the coordinates of each touch. Tremor and proximal limb bradykinesia tests required the participant to hold or repeatedly move the phone in a stereotyped pattern whilst acceleration in 3- or 6-axes was recorded. Walking tests required the participant to place the phone in their pocket and walk 5 m, turn around and walk back (however note that gait analysis was not part of the pre-specification and so has been excluded from the current report). Supplementary Table 1 shows the correspondence between each MDS-UPDRS III item and each CloudUPDRS subtest, and gives details of sensors used, recording times and the features extracted. Note that the clinical MDS-UPDRS III subitem Finger tapping is assessed with two separate smartphone test items per hand (One Target Finger Tapping and Two Target Finger Tapping). Participants were not given any specific training and were asked to follow the onscreen instructions when performing smartphone tasks. In some cases, examiners emphasised the aspects of the instruction to ensure the task was performed correctly (e.g. ‘with your left hand first’, ‘tap as fast as you can for 1 min’). All smartphone examinations were also video-recorded and reviewed to confirm correct task performance. Trials where there were substantial errors, such as unintended use of wrong hand, were removed.

Smartphone data preprocessing and feature extraction

Raw data were stored in a custom flat text files and processed offline using PDkit version 1.2.1 (https://github.com/pdkit/pdkit), an open-source data science toolkit for PD running in a Python 3 environment that has been previously developed by some of the authors of this study. Data were ingested and converted to a standardised time-series data type. There are many ways in which subject-level acceleration time-series data and touch event data can be summarised by single numbers (features). For example, acceleration data can be characterised by the power at its peak frequency or the amplitude over a frequency range. Touch event data can be summarised as a function of the hold-time of a tap, the time between taps or the spatial proximity of the touch to the visual target. PDKit provides robust, transparent and automated extraction of hundreds of distinct features currently in use in the literature and is designed to support harmonisation and comparison of metrics within one fully transparent and non-proprietary platform. We calculated nine features for finger tapping subtests, and 35 features for each other subtest. A persistent archive of feature names and the code used to calculate them is available at: https://doi.org/10.5281/zenodo.3632529.

Pre-specified features

We separately report analyses of pre-specified features (from our previously reported study6) and the best-performing feature for each subtest (see below). Pre-specified features were calculated exactly as previously. That is, for tremor subtests, the magnitude of the scalar sum acceleration in three axes was filtered with a high-pass second-order Butterworth filter at 2 Hz prior to fast Fourier transform. Tremor amplitude was calculated as the sum of the resultant power spectrum between 2 and 10 Hz. For bradykinesia assessments excluding tapping tests, the signal had DC removed prior to applying a low-pass second-order Butterworth filter at 4 Hz and subsequent fast Fourier transform. The amplitude of the remaining signal was calculated as the sum of the amplitude between 0 and 4 Hz. Note that in this context, amplitude relates to the ‘magnitude’ of tremor in the frequency domain rather than the measured distance travelled during movement. For single- or dual-tapping tests we calculated the tapping frequency as the total number of taps divided by the time period of the task (60 s).

Graded approach to mitigating feature and classifier selection bias

With many more features available than observations, post hoc selection of one or a few of these features is a commonly used strategy to improve model stability. This can, however, subsequently induce feature selection bias. Post hoc selection of a classifier can induce a similar bias and so we adopted a graded approach to address both issues. At the conservative end, we used single pre-specified features from our previously reported study6 and standard statistical classifiers (multinomial logistic regression) maximally free from bias but likely to under-fit the data (under-optimistic accuracy). At the exploratory end, we selected both the feature and classifier with maximum out-of-sample predictive accuracy which is moderately likely to over-fit the data (over-optimistic accuracy). Intermediate accuracy values where either the classifier or the feature were selected based on performance are also presented. A full description of the feature and classifier selection process and details of the pre-specified features are presented in Supplementary Note 2. Additionally, all available features and the accompanying software implementation can be viewed in the PDkit online documentation (https://pdkit.readthedocs.io/).

Statistical analysis

The primary outcome was the overall LOSO-CV classification accuracy of the smartphone-based prediction of selected subitems of the MDS-UPDRS III (3.4 Finger tapping, 3.6 Pronation/Supination movements of hands, 3.8 Leg agility, 3.15 Postural tremor of hands, 3.16 Kinetic tremor of hands, 3.17 Rest tremor amplitude). For each subject, each item from the MDS-UPDRS III was blindly rated three times and so the median of these clinical ratings was taken forward as the dependent variable for model training. For each of the 16 smartphone subtests, a single (normalised) smartphone feature was entered into a classifier using the corresponding median MDS-UPDRS III subscore as the target variable. The target score maximally consisted of five categories reflecting the ordinal 5-point scale of each MDS-UPDRS III item, and often consisted of fewer categories being restricted to measures in the study population sample. To reduce the risk of over-fitting and to understand how classifiers would perform on new unseen subjects, we performed LOSO-CV. The out-of-sample prediction for each subject was therefore made from a model trained only on the remaining data. Correct classification was defined as an individual prediction consistent with any of the three individual clinical raters (any-rater criterion). We additionally performed a more stringent analysis in which correct classification was defined as an individual prediction consistent with the median of three individual clinical raters (median-rater criterion). Higher accuracy can reflect characteristics of the target distribution (the distribution of MDS-UPDRS III item scores) such as class imbalance, rather than the utility of the smartphone data and so we performed two additional checks. First, we determined whether the pre-specified models were predicting a range of categories rather than consistently predicting one category. Secondly, we provide a reference ‘random baseline’ accuracy from a classifier that randomly assigns subjects into each available clinical category with uniform probability across categories. As well as using pre-specified features and the standard multinomial logistic regression model, we tested a number of other feature and classifier combinations using the same LOSO-CV procedure (see Supplementary Note 2).

To ensure the three blinded clinical raters were consistent with each other, we additionally calculated inter-rater reliability for each MDS-UPDRS III subscore using Fleiss’ Kappa, which is a generalisation of Cohen’s Kappa to more than two raters27. Kappa ranges from −1 to 1, where 0 indicates chance agreement, 1 indicates complete agreement, and −1 indicates complete disagreement at the population level. We also provide a measure of subject-level inter-rater agreement. This was calculated for each MDS-UPDRS III subscore as the percentage of subjects where the blinded rating clinicians agreed completely (3 raters agreed), moderately (2 raters agreed) or disagreed (all 3 ratings were different). Inter-rater agreement was calculated in Matlab R2020a (The MathWorks Inc, Natick, MA, USA) using the Fleiss toolbox (https://www.github.com/dnafinder/Fleiss).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.