Main

Rapid advancements of artificial intelligence (AI) and deep learning (DL) in computational analysis of digital pathology images—known as computational pathology1,2—have led to substantial progress across diverse tasks such as cancer subtyping3,4, prognostication5,6,7 and mutation prediction8,9. Due to the limited availability of large patient cohorts with associated clinical metadata, a common practice in computational pathology involves training algorithms on publicly available data from consortia such as The Cancer Genome Atlas (TCGA). These algorithms are then often tested on in-domain test sets from the same consortia7,10,11,12 or on smaller, independent cohorts from external institutions3,8,13,14. Recent works have shown that DL models can learn dataset-specific bias and artificially inflate performance when trained and tested on public datasets, even with careful dataset-splitting strategies to prevent data leakage15,16,17. However, it remains largely unexplored whether these issues persist or whether new failure modes emerge for computational pathology models trained on public data but tested on external cohorts, which likely have different demographic compositions and do not share the same dataset biases as the training set. With strong evidence from the general machine learning community revealing inadvertent biases of DL models when testing on independent datasets18,19,20,21, there is a need to conduct a systematic investigation into this matter.

One prominent form of bias evident within publicly available datasets used in computational pathology is the underrepresentation of patients from minority demographic groups. For instance, across 8,594 samples from 33 cancer types in TCGA, 82.0% of all patients are white, 10.1% are Black or African American, 7.5% are Asian, and 0.4% are Hispanic, Native American, or Native Hawaiian and other Pacific Islanders (denoted as ‘other’ in TCGA). However, these percentages are quite different from those of the general population in the United States, where non-Hispanic or Latino white individuals make up only 58.9% of the population22. This disproportional representation is endemic in other public computational pathology datasets23,24,25, and it is highly probable that the demographic distribution of patients in any independent cohort or clinical setting will differ from that in public datasets. Such disproportionate representation becomes problematic in the context of well-established studies, which demonstrate that ethnicity and race-related risk factors26,27,28, along with social determinants of health29, lead to discernible variations in disease presentation30,31,32, molecular subtypes33,34, incidence35,36,37 and outcomes between distinct demographic groups38,39.

Therefore, it becomes paramount for stakeholders to carefully assess how different demographic compositions between training and testing cohorts may influence the performance of DL models40,41,42,43. Computational pathology studies typically do not evaluate model performance across demographic subgroups8,13,44,45 (refer to Supplementary Data Table 1 for more examples). Although demographic shift and other biases have been extensively studied in radiology20,46,47,48,49,50,51,52,53 and other medical fields18,21,28,54,55,56,57,58,59, this question has not yet been fully explored in computational pathology for a few reasons. First, demographic factors such as race are generally not incorporated into diagnostic or patient triaging processes in the clinical practice of pathology. Second, existing datasets are not curated in a race-stratified manner, making systematic evaluation more challenging.

To aid the advancement of accurate and fair methods in computational pathology, we here examined demographic disparities in two types of clinically important cancer diagnosis tasks: subtyping of breast and lung carcinomas and prediction of IDH1 mutations in gliomas. Cancer subtyping is critical for clinical triaging, and errors can result in inappropriate and harmful treatment regimens60. For instance, in lung carcinomas, the anti-vascular endothelial growth factor medication bevacizumab benefits only patients with adenocarcinoma and is not recommended for patients with squamous cell carcinoma. Conversely, the anti-epidermal growth factor receptor drug necitumumab is effective only in squamous cell carcinoma61. Similarly, accurate identification of IDH1 mutations is essential for diagnosis of gliomas, serving as an important prognostic indicator and informing the treatment strategy62. To assess the utility and equity of models in our study, we simultaneously compared performance and fairness metrics. To measure performance across demographic groups, we used demographic-stratified area under the receiver operating characteristic curve (ROC AUC) and F1 score. To measure fairness under the equalized opportunity framework63, we compared the true positive rate (TPR) of demographic groups with that of the overall population. Concurrently considering these metrics helped us examine whether models will have disparate results if deployed clinically47,64,65. Using these metrics, we investigated the impact of techniques to mitigate image acquisition differences and batch effects on performance and fairness of downstream tasks. We also explored the consequences of common modeling choices in computational pathology and bias mitigation strategies for the utility and equity of multiple instance learning (MIL) classification models. While our experiments focused on fairness in terms of self-reported race, other demographic factors can be used, which we explored in isolation and intersection with self-reported race.

Results

Dataset and study description

Our investigations considered subtyping breast and lung carcinomas and predicting IDH1 mutations in gliomas using TCGA, the EBRAINS brain tumor atlas and in-house patient data. For breast subtyping, we trained models on the TCGA breast invasive carcinoma (TCGA-BRCA) cohort (n = 1,049) to differentiate between invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) of the breast. For lung subtyping, models were trained on the TCGA-lung cohort (n = 1,043) to distinguish between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). Subtyping models were then evaluated on independent test cohorts collected at Mass General Brigham (MGB) (breast, n = 1,265; lung, n = 1,960). For IDH1 mutation prediction, models were trained on the EBRAINS brain tumor atlas (n = 873) to differentiate between IDH1 wild-type and mutant cases and tested on the TCGA glioblastoma (TCGA-GBM) and low-grade glioma (TCGA-LGG) cohorts (collectively called the TCGA-GBMLGG cohort) (n = 1,123). TCGA, a publicly available data consortium collecting tissues and clinical metadata for 33 cancer types (2009–2013)66, is notably skewed towards white patients, with few examples from other underrepresented ethnicities (Fig. 1a). Numerous clinical sites have contributed to TCGA with site-specific differences in image acquisition, demographics and label distribution15. In contrast, the MGB cohorts, while also having a majority of white patients (Supplementary Data Tables 2 and 3), reflect the patient population at Massachusetts General Hospital and Brigham and Women’s Hospital in Boston. The base rates of classes differ among the cohorts from TCGA, MGB and the EBRAINS brain tumor atlas, as well as among races. White and Black patients generally exhibit a skew toward the IDH1 wild-type, IDC and LUAD classes, whereas the class distributions vary for Asian patients across cohorts (Supplementary Data Tables 2–4). Differences extend to sex, with TCGA-lung and TCGA-GBMLGG skewed toward male patients, MGB-lung having a majority of female patients, and both the TCGA-BRCA and MGB-breast cohorts including a small number of male patients (as is expected when dealing with breast carcinomas). TCGA lacks information on patient insurance or income, whereas only a few MGB cohort patients are uninsured. In our datasets, Black patients are often younger or similar in age to white patients, whereas Asian patients are often the youngest. The EBRAINS brain tumor atlas (1995–2019) is also a public dataset collected at the Medical University of Vienna67. No information on patient race, insurance and income is provided, whereas patient age and sex are known; the cases are skewed toward IDH1 wild-type. Further details and the full data collection description are available in Supplementary Data Tables 2–4 and Methods.

Fig. 1: Dataset characteristics, fairness metrics and modeling choices investigated.
figure 1

a, Composition in number of slides for TCGA, cohorts from MGB, and the EBRAINS brain tumor atlas, which were used to investigate demographic bias in MIL slide-level cancer diagnosis algorithms for breast and lung carcinoma subtyping and IDH1 mutation prediction. Disparities were investigated using race-stratified ROC AUC, TPR disparity and race prediction on the independent test cohorts. b, Different stages of the DL pipeline used in MIL computational pathology studies: tissue segmentation and patching, mapping to low-dimensional representation using a patch encoder, and classification. Techniques investigated with respect to fairness are shown (control batch effects and test set bias, modeling choices, and bias mitigation strategies). c,d, Common bias mitigation strategies investigated. c, IW samples patients from racial groups inversely to their population size to ensure equitable representation. d, AR mitigates bias by making embeddings agnostic to race. Loss from a secondary race classifier is maximized to achieve this. Example ROC curves show mean values (n = 10 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5 × the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Detailed demographic distributions for each task are available in Supplementary Data Tables 2–4. Some illustrations were created with BioRender.com.

Due to the large size of digital histology slides, known as whole-slide images (WSIs), in this study, we used the MIL framework68 to predict slide-level labels in a weakly supervised manner. The framework consists of customizable parts (Fig. 1b): segmentation of tissue from background and tessellation into patches, projection of patches into low-dimensional space using pretrained patch encoder, and aggregation of patches into slide-level representations, which are classified into the desired labels4. We considered various popular choices for all stages and studied their effects on fairness. For the patch encoder, we first considered a ResNet50 network pretrained on ~106 natural images (ResNet50IN)69. Additionally, we considered a shifted-window (Swin) transformer pretrained on ~15 × 106 histology images (CTransPath)70 and a large Vision Transformer (ViT-L) pretrained on ~100 × 106 histology images (UNI)71. While UNI and ResNet50IN were not pretrained on the WSIs used in this study, CTransPath was trained on TCGA (without subtype or mutation labels), which could lead to artificially inflating performance when testing on TCGA71. Next, we considered three common patch aggregator modules that differ in how relations between the patches are assumed, namely attention-based MIL (ABMIL) (patches are independent)72, clustering-constrained attention MIL (CLAM)4 and transformer-based MIL (TransMIL) (patch interactions learned)11,73. Finally, we investigated two common bias mitigation strategies, namely importance weighting (IW)74,75,76 (Fig. 1c) and adversarial regularization (AR)77,78,79 (Fig. 1d).

To assess the effect of modeling choices and data processing techniques on the performance and fairness of classification models, we compared a few performance metrics, namely demographic-stratified ROC AUC and F1 score. We also compute TPR disparity between different races and the entire cohort population, under the framework of equalized opportunity63,64,65, which has been used in other medical fairness and general studies47,65,80,81. The AUC reflects the model’s ability to discriminate between binary classes, whereas the F1 score indicates the balance between the model’s precision and recall. However, the AUC and F1 score do not show class-specific error rates, which are crucial for understanding model weaknesses. Therefore, we examined TPR disparity to detect sensitivity variances among binary classes (refer to ‘Definition and quantification of the fairness metrics’ in Methods for more details). A TPR disparity of zero for a demographic group signifies similar group and population recall, whereas deviations signal higher or lower TPR from the population, indicating signs of unfairness. Clinically, sensitivity is paramount as it reflects the model’s success in identifying true cases. Thus, TPR disparity provides meaningful clinical insights into potential performance differences upon deployment. By evaluating race-stratified AUC, F1 scores and TPR disparities, we can understand both the performance and fairness of our model for each race, ensuring the model’s utility and equity in clinical settings.

Baseline race-stratified assessment

A standard study design for model development in computational pathology is randomly splitting a large public dataset (as such TCGA) into training and validation folds, which do not account for different patient demographic subgroups. To understand first to what degree this standard approach affects the subgroup-specific performance in subtyping and IDH1 mutation prediction tasks, we split the TCGA-BRCA, TCGA-lung and EBRAINS brain tumor atlas cohorts into 20 task label-stratified, Monte Carlo cross-validation (CV) folds and trained models using ABMIL72—a popular weakly supervised slide-level classification algorithm used across many computational pathology tasks4,72,82,83. We then assessed the performance on independent test cohorts, namely MGB-breast and MGB-lung for subtyping and TCGA-GBMLGG for IDH1 mutation prediction. To establish baselines from which different fairness-promoting strategies could be added progressively, we trained ABMIL with the ResNet50IN69,84, CTransPath70 and UNI71 patch encoders without any bias mitigation strategies.

As measured by 20-fold average AUC on the independent test cohorts, ABMIL performs well in subtyping tasks, especially when paired with self-supervised patch encoders (Fig. 2a and Extended Data Figs. 1 and 2a,e,i); similar trends extend to IDH1 mutation prediction (Fig. 2a and Extended Data Fig. 3a,d,g). However, when using a less robust patch encoder such as ResNet50IN (Fig. 2b), we consistently observed a performance gap (in ROC AUC) between Black and white patients across all tasks (3.0% for breast subtyping, 10.9% for lung subtyping and 16.0% for IDH1 mutation prediction) (Extended Data Figs. 13a). This gap was further supported by lower F1 scores, particularly for Black patients in lung subtyping and IDH1 mutation prediction (Fig. 2e,f), whereas the F1 score for white patients was lower in breast subtyping (Fig. 2d). Importantly, these performance disparities were substantially reduced when we used self-supervised patch encoders trained on histology images, such as CTransPath and UNI. For instance, when using UNI, the AUC gap between white and Black patients decreased to 2.2% for lung subtyping and to 12.3% for IDH1 mutation prediction (Fig. 2e,f), whereas it remained at 3.8% for breast subtyping (Fig. 2d). However, a closer examination of the high AUC values achieved with UNI revealed imbalances in the F1 score for race groups. Notably, the F1 scores for Black patients in the lung subtyping and IDH1 mutation prediction tasks remained notably lower than those for white patients (Fig. 2e,f). The AUC and F1 score do not pinpoint the specific error types contributing to lower performance seen among Black patients. Analyzing sensitivity for different classes stratified by race revealed that Black patients with LUAD and LUSC, along with Black patients with an IDH1 mutation, had notably poorer recall rates than the overall population, also evident from their negative TPR disparity values (Fig. 2c and Supplementary Data Tables 6 and 7). Moreover, in breast subtyping, in which AUC remained consistently high with CTransPath across Black, white and Asian patients, the TPR disparity revealed that Black patients with ILC and white patients with IDC underperformed compared to the general patient population (Extended Data Fig. 1e and Supplementary Data Table 5). In summary, our analysis highlights racial discrepancies in performance across subtyping and mutation prediction tasks.

Fig. 2: Investigating bias from data characteristics.
figure 2

a, Race-stratified and overall ROC AUC (averaged over n = 20 folds) using ABMIL with the ResNet50IN, CTransPath and UNI patch encoders for lung and breast subtyping and IDH1 mutation prediction tested using resampled test sets. b, Patch encoder pretraining scale. The number of patches is shown in log scale. NA, not applicable. c, Race-stratified TPR disparities for each task. ABMIL with the UNI encoder (no data processing, called ‘Base’) is contrasted with a variant with site-preserved training and stain normalization tested on resampled test cohorts. Site-stratified training not available for IDH1 mutation prediction. df, Comparison of race-stratified and overall macro-averaged performance metrics for the ResNet50IN and UNI encoders with addition of data processing techniques for breast subtyping (d), lung subtyping (e) and IDH1 mutation prediction (f). Race group color scheme from a used in cf. For subtyping, ABMIL was trained on TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) and tested on MGB-breast (n = 1,265 slides) and MGB-lung (n = 1,960 slides), respectively. For IDH1 mutation prediction, ABMIL was trained on the EBRAINS brain tumor atlas (n = 873 slides) and tested on TCGA-GBMLGG (n = 1,123 slides). In resampled test sets, 500 slides from each race and class were sampled. Boxes indicate quartile values of TPR disparity and performance metrics as defined by the respective axis (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits), with the center being the 50th percentile. Whiskers extend to data points within 1.5 × the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Presented P values are from a nonparametric two-sided paired permutation test after multiple-hypothesis correction. Demographic distributions for each task are available in Supplementary Data Tables 2–4.

Bias from data characteristics

With our baseline indicating the existence of notable disparities between different races, even with strong patch encoders, we next assessed whether existing approaches to mitigate variability and bias in the training and testing datasets help reduce the racial disparities. We explored approaches that are not directly race-aware (that is, site stratification and stain normalization) and an approach that is race-aware (that is, test set resampling).

Impact of site stratification

The TCGA-BRCA and TCGA-lung cohorts comprise digital histology slides from various tissue-contributing hospital sites. Due to differences in tissue preparation protocols and patient demographics across sites, a ‘dataset shift’ issue inevitably arises, in which models are developed and deployed on mismatched data distributions, a common failure mode of machine learning applications in healthcare17,21,46,85,86,87. Site-stratified CV is a bias mitigation strategy that holds out a subset of sites to prevent models from learning spurious correlations between site-specific factors and diagnoses15. To test whether site-specific demographic variability contributes to the performance disparity, we trained the ABMIL model using a tenfold site-stratified CV and various patch encoders (Extended Data Figs. 1 and 2b,f,j). Site-stratified splits, when used with less robust patch encoder such as ResNet50IN, were found to exacerbate existing disparities in lung subtyping. Specifically, as measured by AUC, these splits led to a 5.2% drop for Asian patients, a 2.8% drop for Black patients and a 2.7% drop in overall performance (Extended Data Fig. 2b). Correspondingly, F1 scores also decreased across races with the ResNet50IN encoder (Fig. 2e). Using site-stratified splits with self-supervised patch encoders such as UNI and CTransPath showed improvements in subtyping AUC values for Black patients, albeit small ones (Extended Data Figs. 1 and 2f,j). For example, in breast subtyping, using site-stratified splits with the CTransPath encoder led to equalizing the AUC for white and Black patients (Extended Data Fig. 1f). Despite these improvements, the F1 score for Black patients remained lower than that for the overall population with the UNI encoder in lung subtyping (Fig. 2e). Recall for Black patients with ILC in breast subtyping (Supplementary Data Table 5) and those with the LUAD subtype in lung subtyping (Supplementary Data Table 6) was also lower. In summary, the effect of site stratification on performance is contingent on the patch encoder used. While it could exacerbate disparities with weaker encoders, it offered some disparity reduction with self-supervised encoders. Nevertheless, gaps persisted in lung subtyping, as measured by AUC, F1 score and TPR disparity. A similar investigation for IDH1 mutation prediction could not be performed as the EBRAINS brain tumor atlas does not provide information on the tissue source site.

Impact of stain normalization

In addition to site-stratified CV, stain normalization is a common domain-adaptation technique for reducing differences in staining variability and is advocated in tandem with site-stratified CV15,88,89,90,91. We thus used stain normalization on both training and testing cohorts, with the tenfold site-stratified CV, to investigate whether the disparities are reduced. The impact of stain normalization on disparities was again dependent on the patch encoder used and the disease being investigated (Fig. 2d–f). Using ResNet50IN and stain normalization in IDH1 mutation prediction led to increases in the race-stratified and overall AUC (Extended Data Fig. 3b). However, this intervention led to notable drops of 6.1% and 1.9% in Black patients’ AUC for breast and lung subtyping, respectively (Extended Data Figs. 1 and 2c). Conversely, stain normalization with stronger encoders did not offer any substantial TPR disparity reduction and also led to performance drops (Extended Data Figs. 1 and 2g,k). For example, with the CTransPath encoder, using stain normalization decreased Black patients’ AUC by 2.5% and 3.0% in breast and lung subtyping, respectively, while not notably affecting white patients’ performance. We note that, even with stain normalization, large gaps persisted in recall between Black patients and the overall population. For example, when using the UNI encoder, there was a mean TPR disparity of −0.060 (95% confidence interval (CI) −0.114, −0.020) for Black patients with LUAD (Supplementary Data Table 6) and −0.284 (95% CI −0.482, −0.086) for Black patients with an IDH1 mutation (Supplementary Data Table 7). Demographic gaps also persisted when applying stain normalization without site stratification (Extended Data Fig. 4 and Supplementary Data Table 8). Overall, depending on the disease being studied, stain normalization may be beneficial when used with less robust patch encoders.

Impact of test set resampling

Lastly, we investigated whether a disproportionate demographic composition in the test cohorts causes performance disparity. To this end, we constructed unbiased test cohorts by resampling with replacement an equal number of patients from each racial group and each class within the test cohorts92. We evaluated the models on resampled test cohorts, with stain normalization and site stratification still in effect. Nevertheless, performance disparities persisted among race groups irrespective of the patch encoder (Fig. 2d–f and Extended Data Figs. 1, 2d,h,l and 3c,f,i). In breast subtyping, a notable 9.2% AUC gap between Black and white patients was evident with the ResNet50IN encoder (Fig. 2d). Although the gaps were smaller at 2.8% with the UNI encoder, low F1 scores and substantial TPR disparities among Asian and Black patients with ILC indicate lingering disparities in breast subtyping (Fig. 2c and Supplementary Data Table 5). In lung subtyping, both ResNet50IN and UNI showed lower AUC and F1 scores for Black patients compared to the overall population and white patients, supported by negative TPR disparities among Black patients (Fig. 2e and Supplementary Data Table 6). In the prediction of IDH1 mutations, substantial gaps persisted between white and Black patients. For instance, using UNI, the AUC for Black patients was 14.0% lower than that for white patients, which is reflected in lower F1 scores for Black patients and negative TPR disparities among Black patients with an IDH1 mutation (Fig. 2f and Supplementary Data Table 7). Overall, despite correcting for imbalances in prevalence in the independent test sets, performance disparities continued to persist.

In summary, performance disparities among different racial groups persist even when accounting for data-related sources of disparities and dataset shifts. It is important to recognize that both the site stratification and stain normalization techniques have their pros and cons15. Site-stratified splits can aid in learning features that are resilient to site-specific image acquisition variations, potentially enhancing performance. However, they may also lead to performance declines, as the exclusion of certain sites could result in the exclusion or underrepresentation of specific demographic groups during training. Likewise, stain normalization can alleviate dataset shifts but may inadvertently remove staining distinctions that arise from the diverse underlying biology of individual patients with cancer. In general, we observed that self-supervised patch encoders, such as UNI, tend to remain indifferent to site-specific artifacts and staining variations, whereas weaker encoders remain more amenable to such techniques.

Bias from MIL model architectures

The rapid progress in computational pathology has led to the improvement of components from all stages of the typical DL pipeline used (Fig. 1b). We thus investigated the effect of different modeling choices for patch encoders (ResNet50IN69,84, CTransPath70 and UNI71) and aggregators (ABMIL72, CLAM4 and TransMIL11) on the disparities between different racial groups for breast subtyping, lung subtyping and IDH1 mutation prediction. We additionally implemented commonly used fairness-aware strategies to study whether these are effective in mitigating demographic biases. Our choice of bias mitigation strategies was governed by the ongoing debate in fair machine learning on whether DL model embeddings should differ based on protected attributes or remain agnostic to them28,54. Specifically, we used IW74,75, which emphasizes examples from underrepresented groups in the model’s training, and AR77, which trains the model to be agnostic to information predictive of protected attributes.

Among all the modeling choices we investigated, the use of self-supervised patch encoders had the largest impact on the performance of breast and lung subtyping as well as IDH1 mutation prediction, irrespective of the patch aggregator and bias mitigation strategy used (Fig. 3b,d,f and Supplementary Data Tables 9–11). For instance, when we replaced ResNet50IN with UNI in ABMIL, the F1 score for Black patients showed improvements of 2.4% in breast subtyping, 21.6% in lung subtyping and 10.3% in IDH1 mutation prediction (Supplementary Data Tables 9–11). Similar gains were observed in subtyping and mutation prediction AUC (Fig. 3b,d,f). Notably, any amount of pretraining on histology images rather than natural images helped improve race-stratified performance (Extended Data Fig. 5a–c). The performance of different patch aggregators exhibited task dependency. For instance, when using the ResNet50IN encoder, more complex patch aggregators led to a reduction in overall performance in lung and breast subtyping (Fig. 3b,d), but they proved beneficial for IDH1 mutation prediction (Fig. 3f). Transformer-based patch aggregators such as TransMIL11 capture patch relations, whereas ABMIL72 assumes patch independence; such model inductive biases could interact differently with various tasks and diseases.

Fig. 3: Investigating bias from MIL model architectures and bias mitigation strategies.
figure 3

af, Combinations of patch aggregators (ABMIL, CLAM, TransMIL), bias mitigation strategies (no mitigation (No mit.), IW, AR) and patch encoders (ResNet50IN, CTransPath, UNI) were evaluated for breast subtyping (a, b), lung subtyping (c, d) and IDH1 mutation prediction (e, f). For each task, the TPR disparity for Black patients is visualized in a, c and e, whereas shifts in performance due to modeling choices are depicted using the mean race-stratified and overall ROC AUC (n = 20 folds) in b, d and f. For breast and lung subtyping, ABMIL was trained on the TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) cohorts and tested on the respective resampled MGB cohorts (nwhite = nBlack = nAsian = 1,000, with 500 slides per subtype for each race). For IDH1 mutation prediction, ABMIL was trained on EBRAINS (n = 873 slides) and tested on the resampled TCGA-GBMLGG cohort (nwhite = nBlack = nAsian = 1,000, with 500 slides per class for each race). Error bars in bar plots indicate the 95% CI, whereas the center is the mean value (n = 20 folds). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5 × the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for each task are available in Supplementary Data Tables 2–4.

We observed that IW had an adverse impact on race-stratified performance across different patch encoders and tasks, as evident from reductions in both the AUC (Fig. 3b,d) and F1 score (Supplementary Data Tables 9 and 10). While some configurations showed an improvement in TPR disparities with IW, such as ABMIL with CTransPath in lung subtyping, this improvement came at the expense of lowering the race-stratified performance (Supplementary Data Table 10). In contrast, AR with self-supervised patch encoders offered marginal enhancements in race-stratified performance and reductions in TPR disparities. For instance, in breast subtyping, using CTransPath with TransMIL and AR resulted in high AUC values and F1 scores for Black and white patients and minimal TPR disparities for Black patients (Fig. 3a,b and Supplementary Data Table 9). However, in lung subtyping, the gains with AR were limited, and the standard ABMIL with the UNI encoder and no bias mitigation strategy emerged as an effective approach for reducing disparities while maintaining high performance (Fig. 3c,d and Supplementary Data Table 10).

Considering TPR disparity and AUC concurrently, we observed that, in breast and lung subtyping, improvements in performance for individual racial groups contributed to narrowing the performance gaps between them. This was evidenced by the TPR disparity for Black patients approaching zero with an increase in AUC (Fig. 3a–d). For example, in lung subtyping, the mean AUC for Black patients improved from 0.795 (95% CI 0.771, 0.823) for ABMIL with the ResNet50IN encoder to 0.954 (95% CI 0.941, 0.967) for ABMIL with UNI (Fig. 3d), and the mean TPR disparity improved from −0.127 (95% CI −0.172, −0.092) to 0.012 (95% CI −0.007, 0.029) for Black patients with LUSC (Supplementary Data Table 10). However, in IDH1 mutation prediction, increases in AUC for Black patients were not accompanied by large improvements in TPR disparity for Black patients with an IDH1 mutation (Fig. 3e,f and Supplementary Data Table 11). This suggests that, although more robust patch encoders enhance performance for Black patients in predicting IDH1 mutations, the performance gaps between Black patients and the overall population persist. Hence, concurrently considering fairness and performance metrics provides insights into whether performance gains reduce disparities between race groups.

While numerous sophisticated bias mitigation and patch aggregation methods have been proposed, our findings indicate that, when combined with weaker patch encoders such as ResNet50IN, these complex methods are not as effective in reducing disparities compared to simpler aggregators paired with strong self-supervised patch encoders. This underscores that, while patch aggregators and bias mitigation strategies have a valuable role, they may not be substituted for robust patch encoders. Instead, they can provide incremental performance enhancements, as evidenced by their successful application in breast subtyping with the CTransPath encoder and AR. Nevertheless, we note that, even with the UNI encoder and ABMIL without bias mitigation strategy, disparities of 4.4% and 9.4% in the F1 score persisted between white and Black patients in lung subtyping and IDH1 mutation prediction, respectively (Supplementary Data Tables 10 and 11). The same observation held true even when performance was assessed by race-stratified AUC (Fig. 3d,f) and TPR disparity (Supplementary Data Tables 10 and 11), indicating that more work is needed in mitigating demographic disparities in computational pathology.

Race prediction from pretrained MIL models

We further investigated whether the patients’ race can be predicted from the slide-level representations (equation (2)) from models used in subtyping and mutation prediction. Previous works have demonstrated that histology carries information about race in TCGA due to correlations of hematoxylin-and-eosin stain intensity with hospital site and, thus, demographic information15; however, we investigated whether slide embeddings used for clinical tasks can also be used for race prediction. We trained models for all possible combinations of patch encoders, patch aggregators and bias mitigation strategies on the TCGA-BRCA and TCGA-lung subtyping and EBRAINS IDH1 mutation prediction tasks; froze the patch aggregators; and used logistic regression to predict race on the task-associated independent test cohorts20. We found that slide-level representations used for the subtyping and mutation prediction tasks were highly predictive of race and that the race prediction performance was positively correlated with the task performance on the test cohorts (Fig. 4a–c). This is in line with protected attributes being able to be predicted from other medical imaging modalities20,93,94. We found that slide representations learned with stronger patch encoders were able to predict race better. For example, in IDH1 mutation prediction, ABMIL with the ResNet50IN patch encoder had a mean race prediction AUC of 0.590 (95% CI 0.539, 0.619), whereas the same patch aggregator with UNI features had a mean race prediction AUC of 0.852 (95% CI 0.839, 0.865) (Supplementary Data Table 44). Additionally, we found that models trained with AR, which intends to remove race-predictive information from deep embeddings, had a high race prediction AUC. For example, ABMIL trained with AR on lung subtyping with UNI features had a mean race prediction AUC of 0.787 (95% CI 0.773, 0.795), which is comparable to the race prediction AUC with no mitigation strategy (0.784 (95% CI 0.773, 0.794)) (Supplementary Data Table 43). Overall, models showing high performance may encode more protected attribute information and popular bias mitigation strategies may not successfully overcome this. We want to emphasize that encoding protected attribute information should not be misconstrued as the information being used for downstream tasks. Our analysis demonstrates that the primary task and the protected attribute information may be related, but it does not establish a causal relation92.

Fig. 4: Evaluating race information in embeddings.
figure 4

All combinations of patch encoders (ResNet50IN, CTransPath and UNI), patch aggregators (ABMIL, CLAM and TransMIL) and bias mitigation strategies (IW and AR) were trained for breast subtyping, lung subtyping and IDH1 mutation prediction using TCGA-BRCA (n = 1,049 slides), TCGA-lung (n = 1,043 slides) and the EBRAINS brain tumor atlas (n = 873 slides) in 20 Monte Carlo folds and tested on resampled MGB-breast, MGB-lung and TCGA-GBMLGG, respectively (all resampled test sets had nwhite = nBlack = nAsian = 1,000, with 500 slides per class for each race). After freezing the patch aggregators in all trained models, the logistic regression model was trained to use the slide-level embedding to predict race on the respective resampled test cohorts in fivefold CV studies. ac, Overall task AUC (n = 20 folds) versus race prediction AUC (n = 5) plotted for breast subtyping (a), lung subtyping (b) and IDH1 mutation prediction (c). Each point corresponds to a combination of patch encoder, patch aggregator and bias mitigation strategy. ρ is the Spearman correlation coefficient associated with the trend line shown in dashed red and is presented with 95% CI. The x and y error bars indicate the 95% CI, with the center as the mean value. Demographic distributions for each task are available in Supplementary Data Tables 2–4.

Bias from training data type, diversity and size

In computational pathology, it is common to train on the multisite TCGA data and test on data from independent institutions13,95. To understand demographic disparities better, we also tried the inverse approach; that is, we trained models on MGB data for breast and lung cancer subtyping and then tested them on TCGA cohorts. We additionally examined the effects of training set diversity and size on performance disparity, creating sets that vary by the number of samples (namely, 5 samples per subtype (referred to as k = 10) and 25 samples per subtype (referred to as k = 50)) and by racial composition (white only, Asian and Black only, and a combination of all races). This approach was also applied reciprocally, with training on TCGA and testing on MGB. For each dataset size and composition, sampling was done ten times to create ten training folds (except for the ‘k = all’ category, in which 20 Monte Carlo folds were used). The UNI patch encoder and ABMIL aggregator were used for all experiments as CTransPath was pretrained on trained on TCGA70, which may lead to data leakage. A similar investigation for IDH1 mutation prediction could not be performed as the EBRAINS brain tumor atlas does not provide patient race information.

Training on MGB and testing on TCGA

Training on the MGB-breast and MGB-lung cohorts and testing on the corresponding TCGA cohorts revealed that, although the ABMIL model discriminates between subtypes of breast and lung carcinomas with high efficacy for both white and Black patients, aligning with prior findings96, the performance measured by AUC was notably lower for Asian patients. For instance, in breast subtyping done with all patients from all races, the mean AUC for TCGA white patients was 0.956 (95% CI 0.945, 0.967) compared to 0.889 (95% CI 0.874, 0.905) for Asian patients (Fig. 5a). Furthermore, in examining recall for each subtype, Asian patients demonstrated lower recall for the ILC subtype in breast subtyping and the LUSC subtype in lung subtyping than white patients, corroborated by race-stratified F1 scores and TPR disparities per subtype (Supplementary Data Tables 13 and 15). Despite similar subtyping AUC results for white and Black patients in TCGA-lung and TCGA-BRCA, this uniformity did not extend to other TCGA cohorts. For instance, Black patients exhibited a lower AUC compared to white patients in the TCGA-GBMLGG cohort for IDH1 mutation prediction, whereas the AUC for Asian patients was generally comparable to that for white patients (Fig. 2f). Internal TCGA and MGB cohorts also showed demographic disparities (Extended Data Fig. 6 and Supplementary Data Tables 31–34). These results indicate that demographic discrepancies in computational pathology extend beyond MGB cohorts.

Fig. 5: Effect of training set diversity and size on disparities.
figure 5

ABMIL models with the UNI patch encoder were trained using different compositions of the training set: white patients only, Asian and Black patients only, and a combination of all racial groups. For each type of composition, the training set’s size was increased from 5 samples per subtype (k = 10) to 25 samples per subtype (k = 50) to training on all patients pertaining to that composition type (k = all), with sampling done ten times to create ten folds. For k = all, 20 Monte Carlo splits were used and actual training dataset size is mentioned. ad, Race-stratified and macro-averaged overall ROC AUC (n = 10 folds) presented for breast subtyping, with training sets derived from MGB-breast and testing done on resampled TCGA-BRCA (a); lung subtyping, with training sets derived from MGB-lung and testing done on resampled TCGA-lung (b); breast subtyping, with training sets derived from TCGA-BRCA and testing done on resampled MGB-breast (c); and lung subtyping, with training sets derived from TCGA-lung and testing done on resampled MGB-lung (d). All resampled test sets had nwhite = nBlack = nAsian = 1,000, with 500 slides per subtype for each race. Boxes indicate quartile values of TPR disparity (n = 10 folds for k = 10 and k = 50 and n = 20 folds for k = all), with the center being the 50th percentile. Whiskers extend to data points within 1.5 × the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for each task are available in Supplementary Data Tables 2 and 3.

Effect of training dataset size

When expanding the training cohort size from k = 10 to k = 50 while considering all races within the MGB cohorts and testing on TCGA data, we observed substantial enhancements in race-stratified AUC across all racial groups for both subtyping tasks (Fig. 5a,b). These improvements in AUC were consistently mirrored by similar enhancements in race-stratified F1 scores and reductions in TPR disparities (Supplementary Data Tables 13 and 15). Conversely, models trained on larger TCGA cohorts containing all races and tested on MGB cohorts also exhibited improved race-stratified performance (Fig. 5c,d). Notably, training with smaller cohorts resulted in demographic disparities within the AUC values. For instance, in lung subtyping trained on TCGA and tested on MGB with k = 10 patients from all races, white patients initially had a mean AUC of 0.840 (95% CI 0.831, 0.850), while Black patients had a lower mean AUC of 0.740 (95% CI 0.724, 0.757). However, after training on all available patients, these values improved to 0.985 (95% CI 0.979, 0.990) for white patients and 0.954 (95% CI 0.941, 0.967) for Black patients (Fig. 5d). Similarly, in breast subtyping, the mean AUC for Black patients improved from 0.636 (95% CI 0.619, 0.654) to 0.944 (95% CI 0.933, 0.944) when training size increased from k = 10 to all patients (Fig. 5c). Similar gains were seen in F1 scores and reductions in TPR disparities (Supplementary Data Tables 12 and 14). Thus, our findings suggest that training on large datasets is vital for disparity mitigation.

Effect of demographic group diversity in training datasets

Expanding the demographic diversity in large public consortia, typically used as training datasets, results in enhanced race-specific model performance on independent test cohorts. For example, in breast subtyping with k = 50, ABMIL trained on the TCGA dataset that combines white, Asian and Black patients had an improved mean Black AUC of 0.889 (95% CI 0.873, 0.905) compared to 0.850 (95% CI 0.831, 0.869) achieved when training only on white patients (Fig. 5c). These gains are corroborated by similar increases in the F1 score for Black patients in breast subtyping with k = 50 over models trained only on white patients in a similar configuration (Supplementary Data Table 12). Similarly, in lung subtyping with k = all, ABMIL trained on the TCGA dataset that combines white, Asian and Black patients saw a higher mean AUC for Black patients (Fig. 5d). These improvements in performance also led to reductions in TPR disparity. For example, in breast subtyping with k = 50, ABMIL trained on TCGA white patients had a mean TPR disparity of −0.092 (95% CI −0.124, −0.068) for Black patients with ILC, which decreased to −0.059 (95% CI −0.084, −0.039) when ABMIL was trained on patients from all races (Supplementary Data Table 12). Similar trends in TPR disparities were also seen for lung subtyping with k = 50, specifically for Black patients with LUSC (Supplementary Data Table 14). When training for breast and lung subtyping on MGB and testing on TCGA, we observed that the AUC and F1 score improved for Black patients when training on k = all patients from all races when compared to training on only white or Asian and Black patients only (Fig. 5a,b and Supplementary Data Tables 13 and 15). These enhancements were supported by improvements in TPR disparities for Black patients with IDC and LUAD in breast and lung subtyping, respectively.

Overall, we found that demographic disparities in subtyping exist when training on MGB and testing on TCGA. Further investigations into the training set (TCGA or MGB) showed that larger and demographically diverse training sets can help alleviate disparities, and efforts should be made to collect and curate demographically diverse public consortia.

Investigating disparities beyond race

For lung and breast subtyping, we investigated whether computational pathology models perform equally well for subgroups defined by other demographic variables, such as insurance type, postal code-inferred median household income (referred to as ‘income’ for simplicity) and age groups (refer to ‘Forming demographic subgroups’ in Methods for further details). In addition, we analyzed disparities within a subgroup by stratifying the constituents of the subgroup by other demographic factors, referred to as ‘intersectional analysis’ (refs. 47,97). Notably, the ABMIL model with UNI features on 20 Monte Carlo folds and the original test sets were used for this analysis. For IDH1 mutation prediction, such investigations were limited to patient race and age.

We observed that discrepant performance extends beyond racial subgroups, primarily for postal code income groups in lung subtyping (Supplementary Data Table 21), insurance groups in both lung and breast subtyping (Supplementary Data Tables 22 and 23), and age groups in IDH1 mutation prediction (Supplementary Data Table 27). For example, in lung subtyping on MGB-lung, we found patients from middle- and high-income postal codes to have higher F1 scores than patients from low-income postal codes. Specifically, the F1 score of patients from low-income postal codes in lung subtyping was 2.4% and 2.1% lower than that of patients from middle-income and high-income postal codes, respectively (Supplementary Data Table 21). These trends were also reflected by TPR disparities (Fig. 6a). Using the F1 score as a measure of performance, we consistently found across breast and lung subtyping that patients without Medicare (American federal health insurance program for people 65 or older) had lower performance, thus indicating higher misdiagnosis rates (Supplementary Data Tables 22 and 23). In the IDH1 mutation prediction task, we found that patients in the ‘>40 and ≤60’ years age group had higher performance than younger patients (≤40 years) and older patients (>60 years) (Extended Data Fig. 7c and Supplementary Data Table 27). Overall, we found that demographic disparities in AI models may extend beyond self-reported race, which echoes the findings of numerous previous studies98,99,100.

Fig. 6: Investigating lung subtyping disparities beyond race.
figure 6

TPR disparity was assessed in various demographic subgroups of the MGB-lung test cohort (n = 1,960 slides) for an ABMIL model trained with UNI features on the TCGA-lung cohort (n = 1,043 slides) in a 20-fold study for lung subtyping. a, TPR disparity for different income groups. bd, TPR disparity computed for subgroups of LUAD and LUSC patients from low-income postal codes (n = 615 slides), stratified by other demographic variables: racial groups (b), insurance groups (c) and age groups (d). e, TPR disparity for different racial groups. fh, TPR disparity computed for subgroups of white patients with LUAD and LUSC (n = 1,630 slides), stratified by other demographic variables: insurance groups (f), income groups inferred from postal code (g) and age groups (h). ik, TPR disparity computed for subgroups of Black patients with LUAD and LUSC (n = 128 slides), stratified by other demographic variables: insurance groups (i), income groups inferred from postal code (j) and age groups (k). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5 × the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Presented P values are from a nonparametric two-sided paired permutation test after multiple-hypothesis correction. Demographic distributions for each task are available in Supplementary Data Table 3.

Through intersectional analysis, we can observe disparities among patients in demographic subgroups, even after adjusting for confounding factors such as age. For example, in lung subtyping, when considering patients within the same age group of ≤70 years, we found that Black patients had worse performance than the overall population (Supplementary Data Table 25). In breast subtyping, we conversely found white patients to have worse performance in the ≤62 years age category (Supplementary Data Table 24). In IDH1 mutation prediction, Black patients had worse performance than the other race groups in the ‘≤40 and >60’ years age groups (Supplementary Data Table 27). A similar analysis can be done for other demographic factors, such as postal code-inferred income. When lung cancer patients from low-income postal codes were stratified by race, we found that Black patients from this subgroup had lower recall for the LUAD subtype than white patients from the same subgroup (Fig. 6b and Supplementary Data Table 21). Differences by insurance type (Fig. 6c) and age (Fig. 6d) were also found for patients from low-income postal codes. Intersectional analysis also revealed that aggregating diverse individuals within coarse race groups can mask disparities47,97. For example, when white lung cancer patients (Fig. 6e) are stratified by postal code-inferred income (Fig. 6g) and age (Fig 6h), differences in recall between subgroups are found, but no significant differences are found when stratifying by insurance type (Fig. 6f). For example, in IDH1 mutation prediction, when white patients were stratified by age, patients ≤40 years old and those >60 years old had worse performance (Extended Data Fig. 7b and Supplementary Data Table 26). This is in contrast to white patients overall having consistently high performance. Such intersectional analysis is not limited to one race group, as we found TPR disparities within Black patients by insurance, age and postal code income (Fig. 6i–k and Supplementary Data Table 19), which also extended to breast subtyping (Extended Data Fig. 8i–k and Supplementary Data Table 18).

Analysis of misclassified cases

To understand better the failure modes of one of the models, a board-certified pathologist analyzed misclassified cases. For this, we chose the lung subtyping task trained on TCGA and tested on MGB because (1) large racial disparities in lung subtyping were seen despite using the best data processing and modeling choices, and (2) the morphological and tissue architectural characteristics of the disease are well established. The pathologist examined the pathology reports and WSIs used in the study for cases that were misclassified on at least two or more folds for ABMIL with the UNI encoder trained on the TCGA-lung dataset in 20 Monte Carlo folds and tested on the MGB-lung cohort. For each case, the pathologist determined whether the histology on the slide matched that in the corresponding pathology report, what the reported grade of the case was, whether the case was a biopsy or resection specimen, and whether lineage-specific immunohistochemistry (IHC) or special stains were used to make the diagnosis (for example, thyroid transcription factor 1 IHC or mucicarmine staining to make a diagnosis of LUAD but not IHC for programmed death ligand 1 or anaplastic lymphoma kinase). For all cases, the morphology observed on the slide matched that in the pathology report—no misdiagnoses by the original pathologists were found in the subsequent review. In general, the misclassified cases tended to be poorly differentiated if the grade was provided in the report, there was morphological overlap with the other class (for example, solid architecture in LUAD), the samples were biopsy specimens (thus had limited tissue area) or lineage-specific stains were required to make the diagnosis, indicating that these cases were also difficult for the pathologists who originally signed out the cases, corroborating the difficulty for the model. Among these misclassified cases, those from white patients tended to require lineage-specific stains more often than did cases from Asian or Black patients (68.4% for white patients versus 50.0% and 42.4% for Asian and Black patients, respectively), although they were less often biopsy specimens (56.5% for white patients versus 63.2% and 72.7% for Asian and Black patients, respectively) (Supplementary Data Table 28). Moreover, within the misclassified cases, Asian and Black patients tended to be younger than white patients. Therefore, the decreased performance on cases from Asian and Black patients in this experiment could be due to the smaller tissue area (and, thus, fewer patches that may be informative for the diagnosis) available to the model, as opposed to differences in grade or morphology, at least as evidenced by similar proportions of cases across grades and lower proportions of cases requiring stains. When stratifying performance by the specimen type (resection or biopsy), biopsy specimens generally had lower performance in both subtyping tasks (Supplementary Data Tables 29 and 30). A similar analysis of misclassification of TCGA-lung subtyping using ABMIL and the UNI encoder also revealed that these cases were usually poorly differentiated; however, they were mainly resection cases and the patient reports did not often describe IHC testing (Supplementary Data Table 28). While such analysis posits one potential failure mode, the observation of disparities in larger tissue resections in IDH1 mutation prediction suggests that the root causes of these differences are not fully understood and warrant further investigation.

Discussion

In this study, we assessed the performance of state-of-the-art computational pathology approaches across different demographic subgroups, including racial and income groups, for binary classification of subtypes of breast and lung carcinomas and for predicting IDH1 mutations in gliomas. We observed variations in the performance of current computational pathology methods among different demographic subgroups, even after accounting for statistical differences and using site-specific CV techniques15,92. Notably, these demographic disparities became more pronounced when we used weaker patch encoders, but they were reduced when self-supervised patch encoders were used. Additional disparity reduction was achieved when AR was used with self-supervised encoders, effectively mitigating disparities in breast subtyping. Hence, more robust patch encoders offer a promising avenue for mitigating disparities. However, the persistent gaps in lung subtyping and IDH1 mutation prediction, despite the use of state-of-the-art modeling choices, indicate that the issue of demographic disparities remains unresolved101. We additionally observed that increasing the diversity and size of demographic groups in the training dataset, rather than oversampling from underrepresented groups during training, such as through IW, can help reduce demographic disparities. This underscores the necessity for large and diverse public datasets rather than relying solely on post hoc modeling solutions to address biases. Our findings also indicate that models with higher performance on the test cohort often encode more protected attribute information. This occurred despite the use of AR, a technique aimed at actively removing race-predictive information from deep embeddings. Finally, we also show that demographic gaps can extend beyond racial categories to include variations by postal code income, age and insurance status. Our study, along with a previous work16, demonstrates that conventional, state-of-the-art computational algorithms can exhibit demographic biases on common diagnostic tasks in breast and lung carcinomas and gliomas.

We hope that our study underscores the importance of considering fairness and performance simultaneously when assessing algorithms for clinical deployment to avoid prioritizing one over the other, as improving performance at the cost of fairness, or vice versa, raises complex ethical considerations for scientists and clinicians. Notably, we found that IW often reduced TPR disparity in subtyping tasks, but at the cost of performance. Similar degradation of performance with the use of algorithmic fairness methods has been noted in other studies on fair machine learning for clinical use102. While self-supervised patch encoders increased fairness and performance in subtyping, we found that in predicting IDH1 mutations, increases in overall performance did not lead to large reductions in gaps between white and Black patients. Such considerations can help avoid the selection of unfair models that might be difficult to identify solely based on overall population performance, which poses a risk of exacerbating existing inequalities in healthcare. Thus, simultaneously measuring bias and performance must become standard practice for medical imaging AI algorithms deployed in clinical settings.

Our finding that AR, the active removal of features predictive of protected attributes, can affect performance in different ways for different diseases suggests an intricate interplay between demographic groups and phenotypic traits103,104. Recently, there has been mounting evidence in population genetics and cancer genomics that genetic ancestry is an important biological determinant in cancer health disparities, which may also manifest as demographic-specific histological phenotypes due to the correlations between ancestry and race105,106,107. For instance, in breast and lung cancers, innate immune variants and gene mutation frequencies are known to differ across people of European, African and Asian ancestry108,109,110. This is in contrast to a prominent view in fairness literature, which proposes learning invariant representations to protected attributes such as race54. While our findings suggest that AR combined with self-supervised encoders might reduce demographic disparities in breast subtyping, such techniques might preclude learning population-specific histological phenotypes in the cancer types we investigated. Nonetheless, further research is required to confirm such hypotheses with larger and more diverse cohorts27,54,111.

As the identification and mitigation of bias are known to be difficult, our study has a few limitations. In IDH1 mutation prediction, the EBRAINS brain tumor atlas does not provide site or patient race information, limiting our understanding of the composition of the dataset and its contributions to disparities. Our external test datasets for breast and lung subtyping contain relatively few images, are derived from a single hospital system and mainly include insured patients. Moreover, the datasets used in the study were largely collected in North America and Europe, often excluding geographic regions such as Asia and Africa. Although the slide diagnosis labels in our external test cohort were reviewed by several pathologists, inconsistencies in self-reported race can introduce label noise. Moreover, coarse race groups, such as ‘Black’ or ‘African American’, might mask variations within demographic groups97, which can be heterogeneous112. Demographic labels can be influenced by numerous characteristics, such as age, socioeconomic status and levels of cultural assimilation103. Mitigating such label noise remains challenging as traditional strategies of bias mitigation may not provide effective corrections, leading to inherent biases embedded in data and models. While the development of individual fairness criteria to eliminate label noise remains an ongoing challenge113, our findings may have implications across multiple healthcare fields, including radiology, dermatology and genome-wide association studies20,46,92,114. However, because we investigated only binary classification problems in this study, caution should be taken when attempting to generalize our findings to other machine learning tasks such as regression115, ranking116 and generative models117.

As slide classification via MIL has largely been approached using out-of-domain natural image pretrained patch encoders4,7,11, upgrading to in-domain, histopathology encoders developed using self-supervised learning is an intuitive solution for improving the general performance of pathology AI algorithms. However, although we observed self-supervised pretrained encoders to have a large impact on mitigating disparities, we refrain from suggesting that they are a complete remedy for fairness. While self-supervised pretraining of encoders on extensive histology data continues to be a rapidly developing area, such models have limited public availability, primarily due to the proprietary patient data used during training. Hence, our choice of foundation models investigated was limited to Vision Transformer architectures trained on histology images, as these are commonly used and available in the field of computational pathology. We encourage future studies to investigate how convolutional neural networks and other architectures pretrained on large-scale histology data affect disparities. Furthermore, the foundation models considered are trained on extensive data from various disease types, as large-scale single-disease data are often limited. Future work should explore the effects of pretraining on large-scale data from a single disease, as opposed to diverse cancer types, on disparities. Finally, as more histology data are collected and organized, the training scale of foundation models considered here will inevitably be eclipsed118,119. Future work could investigate the effect of pretraining dataset size, demographic composition (which is often unavailable) and training strategies on disparities in healthcare.

Although our analysis used a variety of performance and fairness metrics to establish disparities, it is important to acknowledge the limitations of using TPR disparity. TPR can be influenced by population and prevalence biases51, and dataset resampling has been suggested to address these issues92,120. Our findings revealed that TPR disparities persist even after mitigating prevalence shifts through resampling, corroborated by other performance metrics. However, it is essential to recognize that resampling provides only an approximation of ideal data. We performed resampling using patient race, which may not account for other confounders such as age. Future research should explore how a causal understanding of data generation processes and clinical covariates can inform resampling to yield more realistic yet unbiased test datasets121,122. Additionally, group-specific threshold selection has been proposed to reduce TPR differences47,63,64, especially when dealing with prevalence shifts. However, this technique has notable drawbacks. It requires an intersection between ROC curves of different demographics63,64, which may not always be feasible, particularly with an increasing number of demographic groups. Furthermore, race is a social construct, and there can be noise in self-reported race and other demographic variables, making boundaries between groups ambiguous123. More importantly, using non-biological factors such as race to define clinical thresholds may lead to disparities in healthcare settings27,28,124,125,126. With respect to clinical deployment, implementing group-specific thresholds necessitates knowing demographic variables, such as race, at the time of deployment, which may not always be possible. Finally, selecting fairness metrics and definitions is crucial because simultaneously fulfilling them may not be possible64,127, and they can also often conflict128. Nevertheless, we believe that striving for an equally high TPR across groups is essential to ensure that DL-based solutions maintain high clinical sensitivity across all subgroups.

While our study notes variability in the performance of AI models across different demographic groups, the exact role of self-reported race as a potential causal factor for these variations is far from definitive. Current research indicates that medical outcomes can be influenced by social determinants of health129,130,131 and disparities in healthcare access132,133,134, which have complex interplays with race and other demographic variables such as education level and sex. The independent test cohorts used in our investigation encompass a broad spectrum of demographic characteristics, including insurance status, age, income and race. Performance disparities in such a diverse setting suggest that a complex combination of both social and biological factors, including but not limited to self-reported race, may contribute to the observed differences in model performance. While our intersectional analysis aims to decipher bias from different demographic factors, we currently consider broad demographic groups to ensure sufficiently large sample sizes. Moreover, controlling for one factor, such as age, may not account for other factors, such as sex. Future research on larger patient cohorts should explore intersectional groups involving multiple specific demographic factors simultaneously. Finally, while our findings show that there is persistent low performance in lung subtyping and IDH1 mutation prediction for Black patients, we caution against generalizing this finding as indicative of a universal trend for a particular group because one must recognize the substantial biological heterogeneity inherent within coarse racial categorizations52,106,135,136. While we do not claim that computational pathology systems consistently underperform in any single demographic group, we highlight the existence of notable demographic variances across numerous datasets. Such differences warrant careful rectification before clinical application to ensure equitable healthcare outcomes.

Recent investigations into demographic disparities within algorithms used in healthcare are more than theoretical inquiries; they carry profound implications, directly contributing to the enhancement of healthcare equity and quality across all populations18,27,28,54,137,138,139,140. However, algorithms are currently approved without necessitating the provision of test cohort demographics or the explicit reporting of their performance across different demographic groups (Supplementary Data Table 1). In upcoming years, frameworks for auditing AI algorithms will likely have an important role in clinical deployment. This study, with support from an extensive body of literature20,46,47,48,49,80,141, underscores that algorithms do not perform equally well across different demographic categories. If left unchecked, such failure modes may amplify existing healthcare inequities142,143,144. We encourage medical regulatory agencies to consider these findings and make reporting of demographic-stratified metrics necessary when evaluating models for clinical deployment and public use. This aligns with reporting guidelines such as CONSORT-AI (Consolidated Standards of Reporting Trials extension for AI interventions)145 and STARD-AI (AI-specific version of the Standards for Reporting of Diagnostic Accuracy Studies checklist)146, which advocate for transparent reporting of AI performance assessments. Such measures can increase the trust in medical imaging AI and lead to more effective adoption by clinicians. Overall, we hope that our study serves as an entry point for investigations into the complex entanglements between demographic factors, DL and the clinical practice of pathology and for the implementation of policies by relevant stakeholders to ensure that AI algorithms are developed to be safe and effective for patients across diverse demographic backgrounds.

Methods

Dataset description

The MGB institutional review board approved the retrospective analysis of pathology slides and corresponding pathology reports. Research conducted in this study involved a retrospective analysis of pathology slides, and the participants were not directly involved or recruited for the study. The requirement for informed consent for analyzing archival pathology slides was waived. Before scanning and digitization, all pathology slides were deidentified to ensure anonymity. Likewise, all digital data, which encompassed WSIs, pathology reports and electronic medical records, underwent deidentification before being subjected to computational analysis and model development. Sample sizes were determined by data availability, and all in-house data used in the research were dated between 2016 and 2022.

Our overall dataset comprised a total of 7,313 WSIs, including both publicly available and in-house datasets, amounting to 8.0 terabytes of raw data. The population demographics of each dataset are provided in Supplementary Data Tables 2–4. Our overall dataset consisted of the following sources.

The Cancer Genome Atlas

The TCGA dataset is a public and comprehensive collection of genomic and clinical data from various cancer types. In this study, we used the TCGA-BRCA (breast invasive carcinoma collection), TCGA-LUAD, TCGA-LUSC, TCGA-GBM and TCGA-LGG cohorts. We refer to the combined set of TCGA-LUAD and TCGA-LUSC as the TCGA-lung cohort. We refer to the combined set of TCGA-GBM and TCGA-LGG as the TCGA-GBMLGG cohort. We used 1,049 WSIs from the TCGA-BRCA cohort sourced from 40 different tissue-contributing sites. Of the 1,049 TCGA-BRCA WSIs, 838 are WSIs of IDC and 211 are WSIs of ILC. The TCGA-lung cohort comprised 1,043 lung WSIs from 73 distinct tissue-contributing sites in the TCGA dataset. The TCGA-lung cohort had 531 cases of LUAD and 512 cases of LUSC. We used 1,123 WSIs from TCGA-GBMLGG, which were collected from 37 tissue-contributing sites. TCGA-GBMLGG comprised 698 WSIs of IDH1 wild-type cancers and 425 WSIs of IDH1 mutant cancers. For TCGA-BRCA, TCGA-lung and TCGA-GBMLGG, only representative formalin-fixed, paraffin-embedded diagnostic slides that had tumor tissues present were included. No independent test cohorts contributed data to TCGA. The TCGA WSIs and associated clinical data can be accessed through the National Cancer Institute’s Genomics Data Commons portal (https://portal.gdc.cancer.gov/) and the cBioPortal (https://www.cbioportal.org). Any clinical data missing on cBioPortal were acquired from pathology reports provided by Genomics Data Commons portal.

In-house data

The in-house data collected from MGB consisted of 3,225 WSIs corresponding to the same number of cases. To select patients for the study, we queried our in-house database of pathology slides for patients from 2016 to 2022. We selected all patient cases within this period with available slides that met the following inclusion criteria: (1) have a lower-magnification downsample for segmenting and processing the tissue image and (2) have nonzero tumor content. Cases with missing slides were excluded. Cases with blurry scans were rescanned and included in the study. This dataset consists of cases of invasive breast carcinoma, which we call the MGB-breast cohort, and cases of adenocarcinoma and squamous cell carcinoma of the lung, which we collectively refer to as the MGB-lung cohort. The MGB-breast cohort comprised 1,265 invasive breast cancer cases, including 982 cases of IDC (MGB-IDC) and 283 cases of ILC (MGB-ILC). The MGB-lung cohort comprised 1,960 cases, consisting of 1,626 cases of LUAD (MGB-LUAD) and 334 cases of LUSC (MGB-LUSC). These slides were scanned either at 20× magnification using an Aperio GT450 scanner or at 40× magnification using a Hamamatsu S210 scanner (and included ×20 and ×10 pyramid downsamples). For MGB-breast, 208 cases were scanned using the Hamamatsu S210 scanner (nAsian = 92, nBlack = 116), whereas 1,057 cases were scanned using the Aperio GT450 scanner (nwhite = 904, nAsian = 50, nBlack = 48, nnonreporting/other = 55). For the MGB-lung cohort, 134 cases were scanned with the Hamamatsu S210 scanner (nAsian = 74, nBlack = 60), and 1,826 cases were scanned with the Aperio GT450 scanner (nwhite = 1,630, nAsian = 67, nBlack = 68, nnonreporting/other = 61). Extended Data Fig. 9 compares the hue and saturation of slides corresponding to different races; we found no statistically significant differences in the hue and saturations between the slides of different races. For in-house patients, the diagnosis assigned was based on a comprehensive review conducted by multiple pathologists. Protected patient information, including self-reported race categories (‘white’, ‘Asian’ and ‘Black’), age at diagnosis, postal code and patient insurance type, was collected from electronic medical records. No slides from MGB were contributed to TCGA’s data collection initiative. For data availability, refer to ‘Data availability’; for institutional review board approval, refer to ‘Dataset description’ above.

EBRAINS brain tumor atlas

The EBRAINS brain tumor atlas67 is a public dataset collected by digitizing a considerable portion of a large, dedicated brain tumor bank based at the Division of Neuropathology and Neurochemistry of the Medical University of Vienna, covering brain tumor cases from 1995 to 2019 (ref. 67). Slides were digitized using a Hamamatsu NanoZoomer 2.0 HT scanner at 40× magnification. At least two experienced neuropathologists checked each slide scan to ensure conformity of the diagnosis with the current revised 4th edition of the World Health Organization Classification of Tumours of the Central Nervous System and to ensure sufficient scan quality. Ambiguous cases were excluded, and WSIs of inferior quality were rescanned. We selected 873 cases with known IDH1 mutation status. There were 540 IDH1 wild-type cases and 333 IDH1 mutant cases. There were 508 GBM cases and 365 LGG cases. No information on patient race, insurance and income is provided, whereas the patients’ age and sex are known. The EBRAINS brain tumor atlas is available publicly from the official EBRAINS data portal (https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994).

Processing of digital histology slides

As WSIs are exceptionally large (often spanning 150,000 × 150,000 pixels), it is computationally infeasible to use raw WSIs directly in DL pipelines147. In line with previous work4,95,148, we first segmented the tissue from the background, divided the tissue into smaller nonoverlapping tiles (referred to as patches, in which all patches from a WSI comprise a bag of patches) and further downsampled them using pretrained neural networks. The details of these steps are as follows.

Tissue segmentation

Tissue from WSIs was segmented using the CLAM4 library at 20× magnification for each slide. First, the image was converted from RGB (red, green, blue) to HSV (hue, saturation, value) color space. Binary thresholding was then applied to the ‘saturation’ channel of the HSV-space image to compute a binary mask for tissue regions. To refine further the segmented tissue contours and address potential artifacts such as small gaps and holes, we used a combination of median blurring and morphological closing techniques. To obtain the final tissue segmentation, we subjected the approximate contours of the detected tissue and tissue cavities to a filtering process.

Patching

The segmented tissue was cropped into 256 × 256 patches (no overlap). This was performed at 20× magnification if available in the image pyramid range; otherwise, 512 × 512 patches were cropped from the 40× magnification and resized to 256 × 256 (ref. 4).

Feature extraction

As each bag of patches representing WSIs can be extremely large (for example, >10,000 patches149), we encoded each patch into a compact low-dimensional feature vector using a pretrained neural network, which is referred to as the ‘patch encoder’. Because the patch encoder is frozen during training, the choice of pretraining data and strategy is essential. In this study, we examined the use of three patch encoders: a ResNet50 convolutional neural network69 pretrained on natural images (ImageNet)84, a Swin transformer150 trained on approximately 15.6 million histology images (from TCGA and PAIP (Pathology AI Platform, http://wisepaip.org/paip)) using MoCo v3 (refs. 70,151), and a ViT-L/16 transformer152 trained on approximately 100 million histology images using a DINOv2 self-supervised pretraining scheme71,153. For the ResNet patch encoder, the adaptive mean-spatial pooling after the third residual block of the network was used to convert each patch into a compact 1,024-dimensional feature vector. For the Swin transformer and ViT-L patch encoders, each patch was first resized to a 224 × 224 image and then the provided model weights were loaded in the architecture to convert each patch into a 768-dimensional and 1,024-dimensional feature vector, respectively. The ResNet encoder is referred to as ResNet50IN; the Swin transformer is referred to as CTransPath; and the ViT-L/16 encoder is referred to as UNI. To increase the speed of the feature extraction step, we used three NVIDIA 3090Ti graphics processing units (GPUs) with a batch size of 384 per GPU. To test the effect of the pretraining scale on demographic disparities, we also used a ViT-L/16 transformer152 on approximately 1 million and 16 million histology images using a DINOv2 self-supervised pretraining scheme71,153, naming the encoders UNI– and UNI-, respectively. While none of UNI, UNI– and UNI- was pretrained on any data used in this study, we note that CTransPath was originally pretrained on TCGA (without any subtype or mutation labels). Thus, MIL models using CTransPath trained on EBRAINS and evaluated on TCGA may unfairly inflate performance due to data leakage from pretraining. Finally, the demographic composition of the pretraining datasets for the encoders used is not available and often challenging to collect, as in the case of CTransPath trained on public datasets that organize histology images from worldwide institutions where demographic data may not have been collected (http://wisepaip.org/paip).

Stain normalization

To extract stain-normalized features, we applied Macenko stain normalization to individual image patches before they were input into the patch encoder, using the implementation from https://slideflow.dev/slide_processing/ (ref. 154).

Weakly supervised classification

In this study, we trained MIL-based weakly supervised WSI classification algorithms for three binary classification tasks: IDC versus ILC (we refer to this task as ‘breast subtyping’) and LUAD versus LUSC (we refer to this task as ‘lung subtyping’), and IDH1 wild-type versus IDH1 mutant (we refer to this task as ‘IDH1 mutation prediction’). DL models have been shown to perform with high accuracy on these tasks3,71,155,156,157,158,159, eliminating the need to optimize the models’ performance on the classification task and instead allowing to focus on the models’ performance within demographic groups. For slide-level classification of pathology images, three MIL approaches were implemented: ABMIL72, CLAM4 and TransMIL11. These approaches were chosen because they can perform slide-level classification without any region-of-interest extraction or patch-level annotations, can be adapted to both tumor biopsy and resection cases, and have previously demonstrated strong performances on the TCGA dataset and independent test cohorts148. The implementation details of ABMIL, CLAM and TransMIL are now covered.

ABMIL and CLAM

To learn histology-specific feature representations, we passed the patch embeddings extracted by the patch encoder through three fully connected layers. These layers are respectively parameterized by \({W}_{1}\in {{\mathbb{R}}}^{768\times 1,024},{W}_{2}\in {{\mathbb{R}}}^{512\times 768}\) and \({W}_{3}\in {{\mathbb{R}}}^{512\times 512}\). The bias terms are implied and omitted for simplicity. Each fully connected layer is followed by rectified linear unit activation. Thus, each patch embedding \({z}_{k}\in {{\mathbb{R}}}^{1,024}\) is mapped to \({h}_{k}\in {{\mathbb{R}}}^{512}\), which serves as the input to downstream patch aggregation. ABMIL uses an attention module to learn to rank the relative importance of each image patch to classify the WSI. The attention module takes in each patch embedding hk and learns the weights \({V}_{a}\in {{\mathbb{R}}}^{256\times 512},{U}_{a}\in {{\mathbb{R}}}^{256\times 512}\) and \(W\in {{\mathbb{R}}}^{1\times 256}\) to score the patch (ak):

$${a}_{k}=\frac{\exp \{{W}_{a}(\tanh ({V}_{a}\cdot {h}_{k}))\odot \,{{\mbox{sigmoid}}}\,({U}_{a}\cdot {h}_{k})\}}{\exp\left\{\mathop{\sum }\nolimits_{j = 1}^{K}{W}_{a}(\tanh ({V}_{a}\cdot {h}_{j}))\odot \,{{\mbox{sigmoid}}}\,({U}_{a}\cdot {h}_{j})\right\}}.$$
(1)

The slide-level representation, \({h}_{\textrm {slide}}\in {{\mathbb{R}}}^{512}\), is then the sum of the patches weighed by the attention weights:

$${h}_{\textrm {slide}}=\mathop{\sum }\limits_{j=1}^{K}{a}_{j}\cdot {h}_{j}.$$
(2)

A dropout layer (P = 0.25) is used after each layer in the attention backbone for model regularization. The deep slide features hslide are then fed to a fully connected layer (defined by \({W}_{\textrm {c}}\in {{\mathbb{R}}}^{1\times 512}\)), which is followed by the softmax operator to produce slide-level binary class prediction probabilities, pslide:

$${p}_{\textrm {slide}}=\,{{\mathrm{softmax}}}\,\{{W}_{\textrm{c}}\cdot {h}_{\textrm {slide}}\}.$$
(3)

CLAM, in addition to slide-level classification, also performs instance-level clustering as additional supervision to constrain similar diagnostic image regions with similar importance weights4.

TransMIL

TransMIL11 approximates self-attention with Nyström attention160 to perform self-attention-based pooling of individual patches in a WSI. Specifically, TransMIL first squares a sequence of patch embeddings, applies different-sized convolutions to encode spatial information, flattens the sequence of patches, attaches a class token (called CLS token) and then uses multihead self-attention (a linear approximation of self-attention provided by Nyström attention160) to learn the correlations between patches and the encoded spatial information.

Data-splitting strategies

Two different strategies were used for creating training and validation cohorts. We first used the common strategy of Monte Carlo CV, in which the dataset is randomly split into training and validation data in a fixed ratio (stratified by subtype or mutation label); this process is repeated for a fixed number of folds. In our study, we used 90% of the data for training and 10% for validation over 20 folds4,10. Previous work has shown that there are site-specific digital histology signatures in the TCGA dataset15, and different tissue-contributing sites have different racial compositions. Thus, if the submitting sites within a dataset are randomly split into equal-sized groups for Monte Carlo CV, it is likely that a feature of interest would not be evenly represented among these groups, resulting in biased estimates of accuracy. Hence, we used the quadratic programming solution from Howard et al.15 to generate tenfold site-stratified splits for the TCGA-BRCA and TCGA-lung cohorts, which led to approximately 90% data for training and 10% for validation. The public Python package for site-stratified split generation was accessed at https://github.com/fmhoward/PreservedSiteCV. Ten was the maximum number of folds that the quadratic programming solution provided by Howard et al.15 could converge on for the TCGA-lung cohort. Site-stratified splits could not be made for the EBRAINS brain tumor atlas as it does not provide information on tissue source sites. Multiple training folds are used for all tasks to avoid bias towards certain data. Validation splits are used for saving model checkpoints. Models from all folds are tested on task-specific independent test cohorts and mean metrics are reported. Slides from same case were not distributed between training and validation splits.

Training details

The training of all models was done using the AdamW optimizer161. Following previous studies4,95,148, we trained the ABMIL and CLAM models using a learning rate of 1 × 10−4 and an L2 weight decay of 1 × 10−5. Following Shao et al.11, we trained TransMIL using a learning rate of 2 × 10−4 and an L2 weight decay of 1 × 10−5. ABMIL and TransMIL were trained by minimizing the cross-entropy loss. CLAM was trained using a weighted loss of the cross-entropy loss for slide classification and the smooth top-1 support vector machine loss162 for distinguishing high- and low-attention patches in instance-level clustering. Following CLAM, the weights were set to be 0.7 and 0.3 for the slide-level loss and instance-level loss, respectively, and the temperature scaling parameter α and the margin parameter τ were both set to 1.0. The weights and bias parameters of all the models were initialized randomly. During training, unless otherwise stated, slides were randomly sampled from the training cohort and provided to the model with a mini-batch size of 1. All models were trained for a maximum of 20 epochs. After the initial ten epochs, if the loss on the respective validation fold has not decreased for five consecutive epochs, early stopping was triggered and model weights were saved. For each fold, the model checkpoint with the lowest validation loss was used for evaluation on the respective independent test cohort.

Bias mitigation strategies

Common preprocessing and in-processing bias mitigation strategies were applied to investigate their ability to reduce differences between different demographic groups. These strategies included IW74,75,79,163,164 from preprocessing and AR77 from in-processing. While IW tries to improve a model’s performance on underrepresented samples by weighted sampling (inversely proportional to a group’s size), AR encourages the model not to use information correlated with protected attributes. The bias mitigation techniques, which need access to protected attributes, were applied only to the TCGA-BRCA and TCGA-lung cohorts during the training phase, and no mitigation technique was applied to the MGB independent test set. Both of the bias mitigation techniques are model agnostic. Bias mitigation strategies could not be applied to the IDH1 mutation prediction problem as race information for the EBRAINS brain tumor atlas is not provided. We now cover their implementation details.

Importance weighting

In IW (Fig. 1c), samples from the underrepresented groups in the dataset are shown more frequently to the model, giving them higher importance and thereby improving the performance of the model on such groups165. To implement IW, we first calculated the proportions of different races in the overall TCGA-lung and TCGA-BRCA datasets. We considered the ‘white’, ‘Asian’ and ‘Black’ race categories. As nonreporting patients account for approximately 10% of the overall dataset, we also considered nonreporting group as a category to not substantially reduce training dataset size. During the training of the subtyping models, each patient from the training dataset was randomly sampled with a probability that was inversely proportional to the representation of the patient’s race in the overall dataset. Thus, underrepresented patients, such as ‘Black’ and ‘Asian’ patients, were sampled more often and shown more frequently to the model than overrepresented ‘white’ patients. Such weighted random sampling was not applied to the validation and test splits; the model was evaluated on each sample from the validation split and independent test dataset only once.

Adversarial regularization

To make the model agnostic to the information related to the sensitive attribute (that is, race), we first passed the slide-level representation (the same representation used to learn the main task of subtyping) through a fully connected layer to predict the attribute of the patient (Fig. 1d). Cross-entropy loss was used to calculate the attribute prediction discrepancy. As the aim was to make the model invariant to the sensitive attribute, the negative of the attribute prediction loss was back-propagated, making the model poor in predicting the attribute. The attribute classifier was trained with the same hyperparameters as the subtyping model, and its weights were updated with the same frequency as the subtyping model. The implementation of the attribute classifier and the training updates were adapted from https://github.com/ShenYanUSC/Multimodal_Fairness. In addition to ‘white’, ‘Asian’ and ‘Black’ race categories, nonreporting group was also considered as to not reduce the training dataset size substantially.

Evaluation

Evaluation metrics

Subtyping and mutation prediction tasks are binary classification tasks. These tasks were evaluated in both overall and race-stratified manners. In the overall evaluation, all samples of the dataset were considered as patients from nonreporting group make a substantial portion of the dataset and the patient population. In the race-stratified evaluation, metrics for individual races were calculated, while not calculating metrics on the nonreporting group as these patients could be of any demographic. For both the overall and race-stratified settings, we report the ROC AUC. The ROC curve plots the TPR against the false-positive rate as the classification threshold is varied. For individual classes, we also report the overall and race-stratified recall, which measures the proportion of positive instances (for example, true positive (TP) and false negative (FN)) that the model correctly identified as positive, indicating the model’s ability to find all relevant positive samples:

$${\textrm {Recall}}=\frac{{\mathrm {TP}}}{{\mathrm{TP}}+{\mathrm{FN}}}.$$
(4)

We also report the macro-averaged F1 score for the overall and race-stratified settings. The macro-averaged F1 score is computed by calculating the F1 score (the harmonic mean of precision and recall (equation (5)) for each class independently. Then, these individual F1 scores are averaged together. Here, ‘FP’ stands for false positive.

$$\frac{2\times {{\mathrm{TP}}}}{2\times {{\mathrm{TP}}+{\mathrm{FP}}+{\mathrm{FN}}}}.$$
(5)

For the multiclass classification problem of race prediction, the macro-averaged one-versus-rest (OVR) AUC is reported. The macro-averaged OVR AUC generalizes the AUC to the multiclass case by averaging over the ROC AUC of all pairwise combinations of classes.

Selection of cutoffs

When testing a model on any independent test cohort, we used the Youden J statistic method166 to find the optimal cutoffs. Specifically, for a fold, to convert the model’s predicted logits on the independent test cohort into positive and negative classes, we used the Youden J statistic from the model’s corresponding validation fold. The Youden J statistic finds the optimal balance between sensitivity and specificity on the validation fold. The same method was applied to both subtyping and mutation prediction167. When testing on the internal TCGA and MGB cohorts, we used the validation set to determine the threshold.

Definition and quantification of the fairness metrics

To characterize the fairness of AI oncology models, we followed the group fairness metrics and estimated algorithmic fairness as defined by the separation criterion64, also known as ‘equality of opportunity’ (equalized opportunity)63. Equalized opportunity is a condition for classification parity that suggests that TPR should be equalized across protected subgroups for model fairness and nondiscrimination.

As there do not exist obvious ‘positive’ or ‘negative’ classes in the tasks considered, we assessed equalized opportunity for each class. Formally, for a binary prediction Ŷ made on a sample X with a protected subgroup A (for example, race) and ground-truth outcome Y, Ŷ satisfies equalized opportunity if the TPR is equalized across all protected subgroups.

For example, for subgroups of white, Asian and Black patients in our race-stratified evaluation, equalized opportunity is satisfied if

$$\begin{array}{l}P(\hat{Y}=1| A=r,Y=1)=P(\hat{Y}=1| A={r}^{{\prime} },Y=1),\quad\\\qquad\qquad\qquad\forall r,{r}^{{\prime} }\in \{\,{{\textrm{white, Asian, Black}}}\,\},r\ne {r}^{{\prime} }.\end{array}$$

With the separation framework, we established our fairness metric as ‘TPR disparity’, which measures the difference in TPR between protected subgroups and the population TPR. Negative TPR disparity in a subgroup indicates that the model misdiagnoses patients in that subgroup at a greater rate than in the overall population. The same definition can be extended to the other protected attributes considered in the study, such as income, insurance and age groups. During intersectional analysis, we compared the sensitivity of the intersectional group to that of the general population.

Test set resampling

The goal of test set resampling is to create an unbiased test set. We did this by sampling (with replacement) 500 patients from each of the protected attribute subgroups for each subtype of lung and breast cancer. The same method was also used for creating unbiased test sets for the IDH1 mutation prediction task. Test set resampling was applied only to the independent test cohorts for all tasks and to none of the training and validation datasets. No considerable effect was found by varying the number of samples sampled per subgroup (Extended Data Fig. 10).

Review of Food and Drug Administration-approved medical imaging AI algorithms

Documentations (510(k) and de novo approvals) submitted by companies that develop medical imaging AI algorithms to the US Food and Drug Administration (FDA) between January 2017 and December 2020 were reviewed. Only algorithms that work with medical imaging modalities (computed tomography (CT), positron emission tomography–CT, magnetic resonance imaging, radiography, microscopy, autofluorescence imaging and ultrasonography) were considered. The list of devices and the FDA approval numbers for the algorithms were accessed at https://ericwu09.github.io/medical-ai-evaluation/ (ref. 168). For each algorithm, the FDA approval number was used to access the publicly available documentation used for approving the algorithm. The documentation, acquired as a single PDF file, was manually reviewed to determine whether the company recorded the exact demographics of their test set (age, sex, ethnicity and race). Next, the documentation was reviewed to determine whether the company reported any performance metrics (sensitivity, specificity, AUC or accuracy) for the different demographics of their test set. If the company reported that no significant differences were found by demographics, this was considered as reporting demographic-stratified performance metrics. In addition to a manual review of the documentation, a keyword search was done to ensure that the reporting of demographics or their metrics was not missed. Keywords included the following terms: age, old, young, sex, gender, male, female, race, ethnicity, ancestry, white, Caucasian, Black, African American, Asian, European, demographics and subgroups. Paige AI was approved in 2021. It was included in the analysis because its algorithm is highly relevant to the imaging modality (microscopy/histology) and application domain (oncology) of our study. Our analysis was based only on publicly available documentation that can be accessed at https://www.fda.gov.

Forming demographic subgroups

To divide patients with breast and lung cancer into subgroups by age at diagnosis, we used the national average age at diagnosis for breast and lung cancer patients. The national average age at diagnosis is 62 years for breast cancer patients169 and 70 years for lung cancer patients170. To convert postal code information to median household income in the postal code (simply referred to as ‘income’ in the study), we used the database at https://pypi.org/project/uszipcode/. Specifically, we used income from the 2010 US census, as this was the most recent income at the postal code level available at the time of the study through the software. To create the three income groups (low, middle and high), we used the 33rd and 66th percentiles to divide the patients into approximately three equal subgroups. Anytime a reference to patients from an income group is made, one must know that this is the median household income of the postal code neighborhood the patient has self-reported and may not be the patient’s actual household income. Regarding insurance, we categorized patients into (1) those with no public insurance (we call this group ‘not on Medicare’) and (2) those with some form of public insurance, namely Medicare (we call this group ‘on Medicare’). The ‘on Medicare’ group included patients who were on Medicare only and those who were on some private insurance and Medicare. MGB-breast had seven white uninsured patients (six with IDC and one with ILC). MGB-lung had ten white patients with unknown insurance (five with LUAD and five with LUSC), nine Black patients with unknown insurance (five with LUAD and four with LUSC), and ten Asian patients with unknown insurance (five with LUAD and five with LUSC). As the number of patients with unknown insurance was small, this category was not considered. TCGA-GBMLGG provides the age at diagnosis for patients but not income or insurance information. As GBM and LGG have different prevalences by age171,172, we used the 33rd and 66th percentiles to divide the patients into approximately three equal subgroups (≤40, >40 and ≤60, >60 years). Any demographic subgroup that had fewer than three patients or had patients from only one subtype or mutation class was excluded.

Predicting protected attributes from embeddings used for primary tasks

In this study, we predicted protected attributes (that is, race) from the embeddings used for breast and lung subtyping and IDH1 mutation prediction. After models were trained on a training dataset, we froze all layers of the model except the final, fully connected classification layer. We replaced the classification layer with a logistic regression model. Specifically, the slide-level representation from equation (2) was used for ABMIL and CLAM, whereas the CLS token was used for TransMIL. In a fivefold CV study (folds stratified by race and label), we trained the logistic regression model to predict the protected attribute on the independent test cohort for the task. The logistic regression model and the CV study were developed using sklearn173. The logistic regression model was trained for 1,000 iterations with lbfgs (limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm) solver, L2 penalty and C = 0.5 (the inverse of regularization strength). Macro-averaged OVR ROC AUC was used to measure the accuracy of protected attribute predictions. Note that we did not train weakly supervised AI models to predict race directly from WSIs. Rather, we only investigated whether the primary task and race prediction are related by predicting race from the slide-level embeddings learned for subtyping or mutation prediction (equation (2)).

Analyzing the impact of training dataset size and composition on disparities

To understand the impact of the training dataset size and the demographic subgroups forming it, we systematically varied the dataset size and diversity on subtyping tasks. Specifically, we created training sets that vary by the number of samples (from 5 per subtype (referred to as k = 10) to 25 per subtype (referred to as k = 50)) and by racial composition (white only, Asian and Black only, and a combination of all races including the nonreporting group). We applied this approach to (1) creating training sets from TCGA and using MGB cohorts as the test sets and (2) creating training sets from MGB and using TCGA cohorts as the test sets. For each dataset size and composition, sampling was done ten times to create ten training folds, and the UNI patch encoder and ABMIL aggregator were used for all experiments. Samples that were not sampled for training for a fold were used for validation. When training with all patients of a specific subgroup composition (k = all), we created 20-fold Monte Carlo CV splits with 90% data for training to stay consistent with previous experiments using data from all race groups. To test for demographic disparities in internal TCGA and MGB cohorts, we used the k = 50 set composed of all races and reported the disparities on the associated validation set. This was done to ensure that sufficient samples of each subtype from all races were present in the training and validation sets.

Statistical analysis

Hypothesis testing

To compare the TPR disparity for different demographic groups within a class for a single experiment, we conducted two-sided paired nonparametric permutation tests71,174,175,176. Specifically, we tested the following null hypothesis: ‘a weakly supervised AI model performs equally well on demographic group1 of a class relative to patients of demographic group2 of that class in terms of their TPR disparity’. For example, consider the white and Black race groups for the LUAD subtype for the ABMIL aggregator with a UNI encoder trained on TCGA-lung and tested on MGB-lung without any bias mitigation strategy. To perform the permutation test, we first pooled the data from both race groups and then randomly assigned them to either the first or second sample. Then, the statistic (difference of means of samples) was calculated. This process was performed repeatedly, npermutations = 10,000 times, generating a distribution of the statistic under the null hypothesis. The statistic of the original data was compared to this distribution to determine the P value. The raw P values for comparisons between demographic groups within a class for a single experiment were then corrected for multiple-hypothesis testing using Benjamini–Hochberg correction177 with a false discovery rate set at 0.05 (ref. 15). Groups varied based on the demographic variable (race, insurance type, income and age) or the intersection of demographic variables being considered. P > 0.05 was considered not significant.

Confidence intervals

While reporting mean metrics, we estimated 95% CIs using all of the folds. To calculate the 95% CI across all folds, we selected the model from each fold, resampled the test set with replacement to maintain its original size and evaluated the selected model on the resampled test set, repeating this procedure over all folds. The resulting metrics were averaged to represent one point in the bootstrap distribution. This process was repeated for 1,000 iterations (that is, 1,000 nonparametric bootstrap iterations), thus defining the bootstrap distribution for the metric. Subsequently, we calculated the 95% CI using this bootstrapped distribution71,148.

Correlation

To calculate correlations between variables, we used Spearman correlation coefficients implemented by SciPy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)178. The 95% CI for the Spearman correlation coefficient was calculated by converting to z score using the method outlined by Lane179.

Computing hardware and software

We used Python (version 3.8.13) and PyTorch180 (version 2.0.0, CUDA 11.7) (pytorch.org) for all experiments and analyses in the study. All downstream experiments were performed on three 24-GB NVIDIA 3090 GPUs. All WSI processing was supported by OpenSlide (version 4.3.1), openslide-python (version 1.2.0) and CLAM (http://github.com/mahmoodlab/CLAM). Pillow (version 9.3.0) and OpenCV-python were used to perform basic image processing tasks. We use scikit-learn181 (version 1.2.1) for its implementation of logistic regression. We used SciPy178 (version 1.11.4) to calculate correlation coefficients. Implementations of other visual pretrained encoders benchmarked in the study are available at the following links: ResNet50IN with ImageNet transfer (https://github.com/mahmoodlab/CLAM), CTransPath (github.com/Xiyue-Wang/TransPath) and UNI (https://arxiv.org/pdf/2308.15474.pdf). For extracting features, multi-GPU code was implemented using PyTorch’s distributed data-parallel module. For training weakly supervised ABMIL models, we adapted the training code from the CLAM codebase (https://github.com/mahmoodlab/CLAM). Matplotlib (version 3.7.1) and Seaborn (version 0.12.2) were used to create plots and figures. Numpy (version 1.24.4) and pandas (version 1.5.3) were used for numerical operations. Stain normalization was performed using Slideflow154 (version 2.1.0). The code used for this study has been made publicly available at https://github.com/mahmoodlab/CPATH_demographics. Usage of other miscellaneous Python libraries is listed in the Reporting Summary.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.