Introduction

Preterm birth is a major cause of long-term neurodevelopmental disability1. Preterm infants at highest risk for neurodevelopmental disorders are those born before 28 weeks of gestational age (GA; extremely preterm [EP]), and the prevalence of mild-to-severe neurodevelopmental disorders at 2 years of age is > 50%2. Thereafter, with increasing GA, the prevalence of neurodevelopmental disorders decreases to 10% in very-to-late preterm (V-LP) infants born at 28–36 weeks of GA3,4,5,6. These prognostic trends have led studies to improve our understanding of neurodevelopmental disorders in EP infants and identify the most vulnerable preterm infants7,8. However, increasing evidence of long-term neurodevelopmental delays in V-LP infants, and previous findings suggest that neurodevelopmental outcomes of EP and V-LP differ9,10,11. The developing brain may be affected differently, depending on the level of prematurity12. Identifying and understanding the differential factors related to neurodevelopment between EP and V-LP is important for developing early intervention strategies for potentially vulnerable populations13.

Studies on neurological underpinnings of preterm birth have shown that the brains of preterm infants are characterized by macro- and micro-structural alterations, such as abnormal white matter (WM) integrity and brain connectivity, as well as morphological changes in the cerebral cortex14,15,16. Neuroimaging studies examining the relationship between magnetic resonance imaging (MRI) indices and neurodevelopment suggest that microstructural abnormalities without brain injury can affect late neurodevelopment12. A premature brain is easily exposed to various external stimuli and stress, causing demyelination, axon degeneration, and late migratory neuron reduction, potentially altering the microstructure of the WM17. These changes can disrupt the efficient transfer of information between brain regions, thereby affecting overall neurodevelopment18,19. Moreover, depending on GA, the normal scheme of myelination in a typical caudorostral pattern may be disrupted20. Graph theory-based WM connectome analysis has been used to quantify the efficiency of information transmission in brain regions altered by preterm labor and distinguish between normal and abnormal brain network characteristics21,22,23, demonstrating their relevance to various neurodevelopmental disorders24,25,26.

Recent advances in artificial intelligence have enabled the prediction and interpretation of neurodevelopmental outcomes in preterm infants by modeling complex, nonlinear relationships, and helping clinicians make decisions regarding early intervention and follow-up27,28. In classification studies for the identification of high-risk groups for neurodevelopment in preterm infants, a random forest (RF) showed the highest classification accuracy29. Another study using logistic regression showed high accuracies of 100 and 88%, respectively, for identifying cognitive and motor delays30. However, to provide detailed information on neurodevelopmental severity, prediction of continuous variables may be more effective than that of binary variables31. Recent demands in clinical practice emphasize the need for regression models that can predict developmental scores, and studies using convolutional neural network models have been recently conducted32,33,34.

Previous studies predicting neurodevelopmental outcomes in premature infants have used various structural and diffusion MRI measurements as key predictors, along with prenatal, perinatal, and environmental influences35,36,37; however, many of them have limited clinical interpretation because of unimodal or bimodal predictions. Moreover, although the structural connectome contains descriptive information on preterm brain, predicting developmental outcomes using a single predictor may provide incomplete clinical information38,39,40,41. The predictive utility of multimodality has been demonstrated to improve the predictive accuracy and clinical interpretation of attention deficit hyperactivity disorder42 and autism spectrum disorder43 classifications. These studies underscore the utility of predictive models that incorporate multiple variable sets, including volumetric, structural network, and clinical variables.

In this study, we applied a machine learning approach to model structural connectivity in the preterm brain. By selecting local connectivity variables through graphical network analysis (GNA) and combining multimodal and multivariate machine learning techniques, we tested the hypothesis that the predictions of structural connectivity change with GA. We aimed to quantify the local connectivity for EP and V-LP groups and identify the variables that contribute to their prediction.

Results

Demographics and clinical characteristics

The neonatal intensive care unit neonatal and maternal data, clinical information derived during follow-up, and Bayley Scales of Infant and Toddler Development, Third Edition (BSID-III) subscale results are presented in Table 1. The 193 preterm infants who participated in the study were divided into EP (n = 62) and VLP (n = 131) groups.

Table 1 Clinical and maternal characteristics of preterm infants.

Local connectivity features with net effect on BSID-III subtests

Table 2 shows GNA results for partial relationship to BSID-III subtests score with WM integrity indices, and factors that directly affect BSID-III scores include positive (+) or negative (−) correlations (Table 2):

Table 2 Local connectivity features with partial correlation with each BSID-III subscale score extracted from graphical network analysis.

Machine learning performance of multimodal feature sets

Using multimodal feature sets, we determined the best performing linear and nonlinear models (Table 3; Fig. 1). Regarding the cognitive scores, the preterm (root mean squared error [RMSE], 13.352; variance explained, 17% on ElasticNet) and V-LP (RMSE, 11.205; variance explained, 17% on ElasticNet) groups exhibited the highest predictive performance for models that included local connectivity features. In the EP group, the RF model, including volumetric and global network feature sets, demonstrated the highest predictive performance (RMSE, 15.402; variance explained, 13%). The subgroup that included local connectivity features for motor scores demonstrated the highest predictive performance (EP group: RMSE, 11.363; variance explained, 15% on XGBoost; V-LP group: RMSE, 13.698; variance explained, 10% on RF). Regarding language scores, the preterm group demonstrated a high-performing prediction that included only local connectivity features (Preterm: RMSE, 11.792; variance explained, 15% on XGBoost). However, the other models exhibited relatively low-performance predictions (EP: RMSE, 11.674; variance explained, 3% on XGBoost; V-LP: RMSE, 12.425; variance explained, 6% on ElasticNet).

Table 3 Highest predictive models and best performing feature sets.
Figure 1
figure 1

Scatter plot for best predictive model within BSID-III subset. Feature set (A) Local connectivity features; (B) Clinical characteristic features; (C) Volumetric features; (D) Global connectivity features. Abbreviations: EP, extremely preterm; V-LP, very-to-late preterm; BSID-III, Bayley Scales of Infant and Toddler Development, Third Edition.

Feature importance within the best performing model in each BSID-III subset

The top ten features that were important for prediction were selected, and the quota for each feature set is shown in Fig. 2. Each BSID-III subset showed a different feature importance, depending on the group. In terms of cognitive and language scores, local connectivity and volume predictors had the highest proportions, whereas in motor scores, clinical variables and volume predictors (e.g., cerebellum) had the highest proportions.

Figure 2
figure 2

(A) Number of importance features in the best performing model in each BSID-III subset frequencies range between 1 and 10. (B) Feature importance for the best performing model in all preterm groups. For all brain region abbreviations, see table S2 in the supplementary materials. EP, extremely preterm; V-LP, very-to-late preterm; Freq, Frequencies; BC, betweenness centrality; NLp, nodal shortest path length; DC, degree centrality; Ne, local efficiency; GE, global efficiency; BSID-III, Bayley Scales of Infant and Toddler Development, Third Edition.

We presented the brain lobe distribution of the local connectivity predictors (Fig. 3) and frequencies of the top 10 predictors from the best-performing models for all BSID-III subsets (Table 4). The feature importance frequency of the brain regions in the local connectivity was ranked by the left superior temporal gyrus (STG), thalamus, and inferior frontal gyrus (opercular), and the remaining regions were counted once or twice.

Figure 3
figure 3

Visualization of the hemispheric distribution of predictors presented by the best performing feature set. Each brain region is represented by a color assigned to its brain lobe. For all brain region abbreviations, see table S2 in the supplementary materials. These images were created using BrainNet Viewer (version 1.7)44.

Table 4 Frequencies of feature importance within nine best performing models.

Discussion

To the best of our knowledge, the present study is the first to apply linear and nonlinear machine learning methods to predict 2-year neurodevelopmental outcomes in preterm infants, utilizing a comprehensive set of multi-modal features. The predictive performance demonstrated notable improvement when considering multimodal feature sets compared to single-feature sets, with a primary contribution to performance enhancement observed from local connectivity sets. Feature importance in the best-performing models differed, depending on neurodevelopmental subsets, and was primarily ranked in the left STG and thalamus.

Differences in core WM developmental patterns in the EP and V-LP groups suggest that the affected brain regions may differ depending on the degree of prematurity at birth12. Preterm groups in this study might not have shared a common etiology, as indicated by differences in WM development patterns, depending on GA. Between 24 and 28 weeks of gestation, thalamocortical afferent axons developed in the frontal, temporal, and occipital areas, and initial synaptic connections and spatial reorganization in the frontal and occipital regions occurred under the influence of sensory-sensitive cortical development45,46. In contrast, after 28 weeks of gestation, myelination became prominent, along with astrocyte and oligodendrocyte production, and the sensory-driven development of long-range association fibers of the thalamocortical, somatosensory, visual, and auditory processes occurred47,48.

Although the majority of preterm infants performed within the normal range of general cognitive functions49,50,51,52, previous studies have shown that prediction performance is poor for 2-year-olds with EP who perform at levels 1–2 standard deviation below the expected cognitive function49,53,54. Similarly, the significant difference in cognitive scores between EP (93.92 ± 13.91) and V-LP (102.53 ± 13.83) groups may cause performance differences in predicting cognitive outcomes. This suggests that the influence of prematurity level and latent variables early in the EP may cause simultaneous changes in local connectivity and increase collinearity between variables. In this case, global connectivity variables that have been proven in studies comparing pre- and full-term infants at term-equivalent ages may be effective for prediction21,22,55.

The emergence of cognitive function is largely influenced by the development of specific subnetworks rather than the development of the whole brain56. Differential brain regions identified in this study that predicted cognitive outcomes included the left cuneus, lingual, superior occipital gyrus, and putamen, which are consistent with the regions identified in previous studies57,58. They may be responsible for a series of cognitive processes in early brain development, such as the primary processing of visual, and somatosensory information for cognitive59,60 and sensory association for higher-order cognitive developments61.

Predictors with significant partial correlations with motor outcomes were identified in the thalamus, cerebellum, and frontotemporal regions. The identified predictors shared key biomarkers found in previous neurodevelopmental prediction studies. The thalamus has been identified as an important feature of the preterm subgroup, suggesting that the relationship between the thalamus and motor outcomes may be stratified according to GA62. Thalamic development is linearly related to the degree of prematurity12,63, which could weaken the thalamocortical connections and lead to the disruption of connections within key brain structures in preterm infants26,55. Moreover, a study by Kline et al. showed that the thalamic volume was associated with motor outcomes at 2 years of age64, and further correlated with motor function at 7 and 11 years of age65,66, suggesting that early thalamic development may have effects that persist throughout childhood and adulthood. Additionally, the feature importance in the left insula and frontal and temporal brain regions may reflect the involvement of the high-level cerebellothalamic pathway in motor development67. The left superior frontal gyrus contains part of the premotor cortex and is an important predictor of motor outcomes33. Moreover, the insula is partially involved in controlling sustained intentional movements68, and the temporal pole is known to play a role in controlling visuomotor movements69, suggesting its potential as a key marker for later functional development.

The preterm subgroup exhibited the lowest predictive performance and unreliable feature importance in predicting language scores. These results may be due to the lack of a cohort; therefore, the relationships between multivariate variables and language scores were no longer stratified. Moreover, language performance is more sensitive to potential environmental factors that cannot be completely explained by imaging70, and our study may not have considered complex clinical factors derived from the EP group. Nevertheless, the left STG identified in the preterm group was closely related to the language score, again highlighting the importance of this variable as single predictors. A previous study that examined the relationship between language ability at 2 years of age, and local connectivity in preterm infants showed that the left STG was negatively correlated with language scores, suggesting that the left STG is a key region for language development and has microstructural vulnerability71.

Although this study attempted to follow the quality assessment criteria provided in recent reviews on predicting pediatric development (Supplementary Table 1)72, the most prominent limitation of the current study is the limited amount of data. Generalization to other datasets might be limited due to the lack of external cross-validation. Also, the implementation of a more complex predictive model such as a non-linear SVM or artificial neural network was not possible because of the limited amount of data and poor interpretability of such models. Therefore, the complex relationships between multiple features and predictive value may not be fully represented in the model. A future collaboration between multiple sites can overcome the hurdle of a small sample size. Dependencies among global network metrics exhibiting similar patterns have been identified (Supplementary Fig. 1), which may potentially impact the model's predictive power and should be interpreted with caution73. Future research may perform analyses that limit collinearity among metrics while accommodating the inherent biological complexity within each network metric and measuring the uniqueness of network structures74,75. Quantification of the structural connectome should be performed with the cerebellum because structural connections within the cerebellum may be important for WM connections around the thalamus. The importance of the local connectivity was primarily identified in the left hemisphere. Given that early brain lateralization is a multivariate trait influenced by a variety of factors, such as stress and the external environment76,77,78. Future studies could utilize lateralization indices in predictive models.

In conclusion, we found that the prediction performance and feature importance differed, depending on the preterm group for the BSID-III. Additionally, the STG and thalamus are important markers for predicting motor and language development. Machine learning approaches that leverage brain connectivity can improve individual risk stratification by improving our understanding of how alterations in the brain microstructure affect neurodevelopment.

Methods

Study populations

The participants of the present study included preterm infants born at < 37 weeks GA who were admitted to the neonatal intensive care unit of the Hanyang University Hospital and participated in a follow-up project at the Hanyang Inclusive Clinic for Developmental Disorders between 2017 and 2021. Of 218 eligible preterm infants using the BSID-III79, eight preterm infants with brain injury, one with hypothermia, and one with metabolic abnormalities were excluded from image processing. Fifteen of 208 participants were excluded from the WM analysis because of motion artifacts and poor image quality. Similarly, 38 patients were excluded from the volumetric analysis. A total of 193 participants in a WM analysis and 170 in a volumetric analysis were recruited with suitable MRI data obtained at near-term age (postmenstrual age, 35–44 weeks) without congenital brain abnormalities, congenital infections, cystic periventricular leukomalacia, diffuse ventriculomegaly, evidence of genetic disorders, focal abnormalities, intraventricular hemorrhage (IVH, grades II–IV), or punctate WM injury. The Institutional Review Board of the Hanyang University Hospital approved the study protocol, and informed consent was obtained from infant’s parents prior to participation in present study. All procedures were performed in compliance with the principles of the Declaration of Helsinki.

All the preterm infants were assessed at corrected ages between 18 and 24 months by certified examiners for cognitive, motor, and socioemotional development using the BSID-III, with subtests scaled based on age at test. For statistical analysis, preterm groups were divided based on GA 28 weeks (28 < EP; 28 ≥ V-LP).

Clinical data collection

Detailed information on clinical data was gathered through a systematic and prospective chart review. Clinical variables adopted in this study have useful biological premises and potential associations with later neurodevelopment, and variables that have been validated with improved variance explained in previous papers predicting 2-year neurodevelopmental scores in preterm infants were selected80. The present study followed nine predefined clinical factors (GA, postmenstrual age, male sex, maternal education, small gestational age [SGA], IVH, 5-min Apgar score, and bronchopulmonary dysplasia) from the PENUT dataset that are expected to be associated with long-term outcomes80. Maternal education level was based on the years of schooling and categorized by educational level.

Data analysis: Clinical characteristics

The demographics of preterm subgroups were statistically compared using SPSS 27.0 (SPSS. Chicago, IL) software. We utilized the Student’s t-test and chi-square analysis to compare clinical characteristics between preterm subgroups.

MRI acquisition

Individual preterm infants were scanned at near-term age (postmenstrual age [PMA], 35–44 weeks) using whole-body 3 T magnetic resonance imaging (MRI) scanner (Philips, Achieva, 16-channel phase-array head coil, Best, Netherlands) during natural sleep, using a blanket to preserve body temperature. An experienced pediatrician monitored the pulse oximeter during the MRI to determine the heart and respiratory rates of each infant. Single-shot spin-echo three-dimensional echo planar images were obtained using diffusion tensor imaging (DTI). Parameters of the DTI were b-value = 800 s/mm2, echo time = 75 ms, repetition time = 4,800 ms, flip angle = 90°, field of view = 120 × 120 mm, number of electrostatic gradient directions = 32, voxel sizes = 1.56 × 1.56 mm2, slice thickness = 2 mm, number of averages = 2, total acquisition time = 6 min 17 s, and water-fat shift = 4.68 Hz/pixel. The slices were axially parallel to the anterior–posterior commissure line with a 40–50 slices covering the entire hemisphere and brainstem. Estimates of motion artifacts of diffusion-weighted images were calculated for individual participants, including absolute and relative volume-to-volume motion and percentage of outliers using the EDDY QC tool81 in the FMRIB Software Library (FSL, https://fsl.fmrib.ox.ac.uk/fsl/fslwiki)82. Additionally, structural T2-weighted images were acquired for volumetric analysis and to exclude white matter (WM) abnormalities. The parameters for T2-weighted image were echo time = 90 ms, repetition time = 4,800 ms, flip angle = 90°, field of view = 180 × 180 mm2, voxel sizes = 0.5 × 0.5 mm2, slice thickness = 3 mm, number of averages = 1, total acquisition time = 6 min 30 s, and water-fat shift = 4.68 Hz/pixel.

Image preprocessing

Imaging data were preprocessed using an eddy correction tool for eddy current distortions, and motion artifacts83. A nondiffusion-weighted image (b0 image) was extracted from the raw image, including the skull and nonbrain tissues. To remove the effect of low-frequency intensity inhomogeneity on the b0 diffusion data, the bias field estimated using N4 bias field correction in advanced normalization tools (ANTs)84. Subsequently, the principal eigenvalues of the diffusion tensor model were computed by simple least-squares fitting of the diffusion-weighted volume. Fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD) were calculated using tensor-eigenvalues. Quality control of the preprocessed images was performed via visual inspection by two independent reviewers. All procedures were performed using FSL.

Network construction

A 12-parameter affine transformation and the nonlinear symmetric normalization algorithm from ANTs were employed to transform the individual b0 images into T2-weighted images from the University of North Carolina (UNC) neonate atlas85, and vice versa. Inverse transformations were used to warp the automated anatomical labeling atlas from the UNC space to the native space. Discrete labeling values were preserved using nearest-neighbor interpolation, which is a family of sinc-based methods. Using this procedure, we obtained 90 brain regions (each representing a node in the network) for the underlying structural network of each participant (Supplementary Table 2).

Whole-brain fiber tracking using probabilistic tractography was performed for each neonate using FSL. First, to prepare for probabilistic tracking, BEDPOSTX86 was used to model the direction of the crossing fiber, and partial volume effects were corrected for thick slices (BEDPOSTX arguments: fiber 3 and Rician for uniform noise levels). Probabilistic tractography was then performed on individual diffusion images using PROBTRACKX87. The network matrix, assigned through the connectivity probabilities between brain regions i and j, was calculated as the total proportion of fibers sampled from all voxels in brain region i reaching all voxels in brain region j (PROBRACKX arguments are sample tracts per seed voxel: 5,000; step length: 0.5 mm; curvature threshold: 0.2; and fractional anisotropic volumes: 0.01).

Probabilistic tractography depends on the seeding point; therefore, the probability from i to j can differ from the probability from j to i. Therefore, we defined the unidirectional connection probability Pij between regions i and j and created a 90 × 90 symmetric matrix after averaging these two probabilities. Moreover, we performed pair-wise Pearson correlation for all 4,005 connections with nonzero probability values for all the participants and set r = 0.7 as the threshold to remove spurious connections with a small probability of connection. The weighted (W) network edges were calculated as Wij = Pij.

Global and local network analysis

Prior to brain network quantification, a sparsity threshold of 0.25 (i.e., which is the ratio of the number of actual edges to the maximum possible number of edges in a structural network)85, was applied to individual networks to remove the weakest connections subject to experimental noise88. The specific threshold selection procedure followed that of our previous network study23. Global and local network properties were analyzed using the Brain Connectivity Toolbox89 and GRETNA software (http://www.nitrc.org/projects/gretna/)90.

Graph metrics were used to quantify brain global (global efficiency, Eglob; local efficiency, Eloc; modularity, Q; small-worldness, S; normalized clustering coefficient, Cp; normalized shortest path length, Lp)89 and local (betweenness centrality, BC; degree centrality, DC; nodal clustering coefficient, NCp; nodal shortest path length, NLp; nodal efficiency, Le; nodal local efficiency, NLe)91,92 connectivity. Global metrics were computed for 1,000 random networks with conserved number of nodes, number of edges, and degree distribution at predefined sparsity thresholds23. Local network metrics were used as indicators of neonatal and children brain development, and employed to elucidate clinical implications23,93,94,95,96. Details of the described graph-theoretical measures can be found in supplementary text 1.

WM integrity analysis

We aligned the Johns Hopkins University (JHU) neonatal, probabilistic, WM pathway atlas to the FA images of individual diffusion spaces using a nonlinear symmetric normalization algorithm in the ANTs to compute the mean FA, MD, AD, and RD values for specific WM pathways97. This atlas provides 27 major WM pathways in a population-averaged neonatal template.

Volumetric analysis

To extract neonatal volumetric features, we used the morphologically adaptive neonatal tissue segmentation toolbox (MANTiS) to segment and measure neonatal brain tissue from T2-weighted images98. Additionally, MANTiS extends the existing approach to tissue classification implemented in Statistical Parametric Mapping software to neonates, combining template adaptation and topological filtering and segmenting the neonatal brain into eight tissue classes: gray matter, WM, deep gray matter, hippocampus, amygdala, cerebellum, and brainstem. These volumetric values were corrected by dividing them by the total brain volume without cerebrospinal fluid.

Local connectivity feature

Local connectivity features are inherently complex and high-dimensional and are difficult to describe linearly. This may have caused overfitting because there were more variables than the sample size. GNA is a useful technique for determining conditional effects between a set of observed variables. Additionally, GNA can identify collinear variables and provide estimates of the most transparent relationships among variables through nonlinear relationship modeling. Recently, this methodological approach has provided support for follow-up studies, and has the potential to improve clinical care by identifying variables independently associated with clinical and maternal characteristics and neurodevelopmental outcomes in preterm infants80.

We employed GNA to identify the net effect of individual variables on each BSID-III subtest. Four WM integrity indices for 27 Johns Hopkins University pathway atlas and 6 local network properties for 90 brain regions were used to determine whether each of the candidate predictors showed a partial effect, even after considering the clinical characteristics. This analysis uses the method described by Williams and Rast to identify significant correlations with a single variable by forming a matrix of precision for relationships between variables, considering relationships with all other variables99. A precision matrix was constructed using the maximum likelihood estimation method, and the Fisher Z-transformation (95% confidence intervals) was performed to establish a network with significant relationships between variables 99.

Linear and nonlinear models: using multimodal feature sets

Four predictor sets were identified. Each predictor group comprised local connectivity features (feature set A), clinical characteristic features (feature set B, n = 12), volumetric features (feature set C, n = 8), and global connectivity features (feature set D, n = 5). The prediction model uses 15 combinations of prediction sets for all possible combinations.

Linear (ElasticNet) and nonlinear regression (RF; XGBoost) analyses for predicting cognitive, language, and motor scores were performed using the presented combination of feature sets. However, GA and postmenstrual age were included in all feature set combinations. All of these regressors were implemented using Python’s scikit learning library (https://github.com/scikit-learn/scikit-learn)100,101 except for XGBoost102. A randomized hyperparameter optimization was performed with fivefold cross-validation to identify high-performance models with reduced computational costs. The predictive power of the regression models was evaluated by calculating the RMSE on the held-out test set. Randomly selected 30% of the data were used as the held-out test set. We also plotted actual BSID-III scores against predicted scores for all feature set combinations of all the regression models.

Feature importance

In ElasticNet, feature importance was calculated by penalizing the coefficients in the form of absolute values through the combination of L1 and L2 regulation. In the RF, feature importance was calculated by evaluating the extent to which each feature reduced impurity. Additionally, XGBoost computes feature importance by averaging the information gains from all decision trees into which a particular predictor is split.