Demographic bias in misdiagnosis by computational pathology models

Vaidya, Anurag; Chen, Richard J.; Williamson, Drew F. K.; Song, Andrew H.; Jaume, Guillaume; Yang, Yuzhe; Hartvigsen, Thomas; Dyer, Emma C.; Lu, Ming Y.; Lipkova, Jana; Shaban, Muhammad; Chen, Tiffany Y.; Mahmood, Faisal

doi:10.1038/s41591-024-02885-z

Article
Published: 19 April 2024

Demographic bias in misdiagnosis by computational pathology models

Anurag Vaidya^1,2,3,4,5^na1,
Richard J. Chen ORCID: orcid.org/0000-0003-0389-1331^1,2,3,4,6^na1,
Drew F. K. Williamson ORCID: orcid.org/0000-0003-1745-8846^1,2,7^na1,
Andrew H. Song^1,2,3,4,
Guillaume Jaume^1,2,3,4,
Yuzhe Yang ORCID: orcid.org/0000-0002-7634-8295⁸,
Thomas Hartvigsen⁹,
Emma C. Dyer¹⁰,
Ming Y. Lu ORCID: orcid.org/0000-0003-0009-9699^1,2,3,4,8,
Jana Lipkova^1,2,3,4,
Muhammad Shaban^1,2,3,4,
Tiffany Y. Chen ORCID: orcid.org/0000-0003-2546-2941^1,2,3,4 &
…
Faisal Mahmood ORCID: orcid.org/0000-0001-7587-1562^1,2,3,4,11

Nature Medicine volume 30, pages 1174–1190 (2024)Cite this article

3062 Accesses
1 Citations
113 Altmetric
Metrics details

Subjects

Abstract

Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas. For example, when using common modeling approaches, we observed performance gaps (in area under the receiver operating characteristic curve) between white and Black patients of 3.0% for breast cancer subtyping, 10.9% for lung cancer subtyping and 16.0% for IDH1 mutation prediction in gliomas. We found that richer feature representations obtained from self-supervised vision foundation models reduce performance variations between groups. These representations provide improvements upon weaker models even when those weaker models are combined with state-of-the-art bias mitigation strategies and modeling choices. Nevertheless, self-supervised vision foundation models do not fully eliminate these discrepancies, highlighting the continuing need for bias mitigation efforts in computational pathology. Finally, we demonstrate that our results extend to other demographic factors beyond patient race. Given these findings, we encourage regulatory and policy agencies to integrate demographic-stratified evaluation into their assessment guidelines.

You have full access to this article via your institution.

Download PDF

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning

Article Open access 16 April 2024

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Main

Rapid advancements of artificial intelligence (AI) and deep learning (DL) in computational analysis of digital pathology images—known as computational pathology^1,2—have led to substantial progress across diverse tasks such as cancer subtyping^3,4, prognostication^5,6,7 and mutation prediction^8,9. Due to the limited availability of large patient cohorts with associated clinical metadata, a common practice in computational pathology involves training algorithms on publicly available data from consortia such as The Cancer Genome Atlas (TCGA). These algorithms are then often tested on in-domain test sets from the same consortia^7,10,11,12 or on smaller, independent cohorts from external institutions^3,8,13,14. Recent works have shown that DL models can learn dataset-specific bias and artificially inflate performance when trained and tested on public datasets, even with careful dataset-splitting strategies to prevent data leakage^15,16,17. However, it remains largely unexplored whether these issues persist or whether new failure modes emerge for computational pathology models trained on public data but tested on external cohorts, which likely have different demographic compositions and do not share the same dataset biases as the training set. With strong evidence from the general machine learning community revealing inadvertent biases of DL models when testing on independent datasets^18,19,20,21, there is a need to conduct a systematic investigation into this matter.

One prominent form of bias evident within publicly available datasets used in computational pathology is the underrepresentation of patients from minority demographic groups. For instance, across 8,594 samples from 33 cancer types in TCGA, 82.0% of all patients are white, 10.1% are Black or African American, 7.5% are Asian, and 0.4% are Hispanic, Native American, or Native Hawaiian and other Pacific Islanders (denoted as ‘other’ in TCGA). However, these percentages are quite different from those of the general population in the United States, where non-Hispanic or Latino white individuals make up only 58.9% of the population²². This disproportional representation is endemic in other public computational pathology datasets^23,24,25, and it is highly probable that the demographic distribution of patients in any independent cohort or clinical setting will differ from that in public datasets. Such disproportionate representation becomes problematic in the context of well-established studies, which demonstrate that ethnicity and race-related risk factors^26,27,28, along with social determinants of health²⁹, lead to discernible variations in disease presentation^30,31,32, molecular subtypes^33,34, incidence^35,36,37 and outcomes between distinct demographic groups^38,39.

Therefore, it becomes paramount for stakeholders to carefully assess how different demographic compositions between training and testing cohorts may influence the performance of DL models^40,41,42,43. Computational pathology studies typically do not evaluate model performance across demographic subgroups^8,13,44,45 (refer to Supplementary Data Table 1 for more examples). Although demographic shift and other biases have been extensively studied in radiology^{20,46,47,48,49,50,51,52,53} and other medical fields^{18,21,28,54,55,56,57,58,59}, this question has not yet been fully explored in computational pathology for a few reasons. First, demographic factors such as race are generally not incorporated into diagnostic or patient triaging processes in the clinical practice of pathology. Second, existing datasets are not curated in a race-stratified manner, making systematic evaluation more challenging.

To aid the advancement of accurate and fair methods in computational pathology, we here examined demographic disparities in two types of clinically important cancer diagnosis tasks: subtyping of breast and lung carcinomas and prediction of IDH1 mutations in gliomas. Cancer subtyping is critical for clinical triaging, and errors can result in inappropriate and harmful treatment regimens⁶⁰. For instance, in lung carcinomas, the anti-vascular endothelial growth factor medication bevacizumab benefits only patients with adenocarcinoma and is not recommended for patients with squamous cell carcinoma. Conversely, the anti-epidermal growth factor receptor drug necitumumab is effective only in squamous cell carcinoma⁶¹. Similarly, accurate identification of IDH1 mutations is essential for diagnosis of gliomas, serving as an important prognostic indicator and informing the treatment strategy⁶². To assess the utility and equity of models in our study, we simultaneously compared performance and fairness metrics. To measure performance across demographic groups, we used demographic-stratified area under the receiver operating characteristic curve (ROC AUC) and F1 score. To measure fairness under the equalized opportunity framework⁶³, we compared the true positive rate (TPR) of demographic groups with that of the overall population. Concurrently considering these metrics helped us examine whether models will have disparate results if deployed clinically^47,64,65. Using these metrics, we investigated the impact of techniques to mitigate image acquisition differences and batch effects on performance and fairness of downstream tasks. We also explored the consequences of common modeling choices in computational pathology and bias mitigation strategies for the utility and equity of multiple instance learning (MIL) classification models. While our experiments focused on fairness in terms of self-reported race, other demographic factors can be used, which we explored in isolation and intersection with self-reported race.

Results

Dataset and study description

Our investigations considered subtyping breast and lung carcinomas and predicting IDH1 mutations in gliomas using TCGA, the EBRAINS brain tumor atlas and in-house patient data. For breast subtyping, we trained models on the TCGA breast invasive carcinoma (TCGA-BRCA) cohort (n = 1,049) to differentiate between invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) of the breast. For lung subtyping, models were trained on the TCGA-lung cohort (n = 1,043) to distinguish between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). Subtyping models were then evaluated on independent test cohorts collected at Mass General Brigham (MGB) (breast, n = 1,265; lung, n = 1,960). For IDH1 mutation prediction, models were trained on the EBRAINS brain tumor atlas (n = 873) to differentiate between IDH1 wild-type and mutant cases and tested on the TCGA glioblastoma (TCGA-GBM) and low-grade glioma (TCGA-LGG) cohorts (collectively called the TCGA-GBMLGG cohort) (n = 1,123). TCGA, a publicly available data consortium collecting tissues and clinical metadata for 33 cancer types (2009–2013)⁶⁶, is notably skewed towards white patients, with few examples from other underrepresented ethnicities (Fig. 1a). Numerous clinical sites have contributed to TCGA with site-specific differences in image acquisition, demographics and label distribution¹⁵. In contrast, the MGB cohorts, while also having a majority of white patients (Supplementary Data Tables 2 and 3), reflect the patient population at Massachusetts General Hospital and Brigham and Women’s Hospital in Boston. The base rates of classes differ among the cohorts from TCGA, MGB and the EBRAINS brain tumor atlas, as well as among races. White and Black patients generally exhibit a skew toward the IDH1 wild-type, IDC and LUAD classes, whereas the class distributions vary for Asian patients across cohorts (Supplementary Data Tables 2–4). Differences extend to sex, with TCGA-lung and TCGA-GBMLGG skewed toward male patients, MGB-lung having a majority of female patients, and both the TCGA-BRCA and MGB-breast cohorts including a small number of male patients (as is expected when dealing with breast carcinomas). TCGA lacks information on patient insurance or income, whereas only a few MGB cohort patients are uninsured. In our datasets, Black patients are often younger or similar in age to white patients, whereas Asian patients are often the youngest. The EBRAINS brain tumor atlas (1995–2019) is also a public dataset collected at the Medical University of Vienna⁶⁷. No information on patient race, insurance and income is provided, whereas patient age and sex are known; the cases are skewed toward IDH1 wild-type. Further details and the full data collection description are available in Supplementary Data Tables 2–4 and Methods.

**Fig. 1: Dataset characteristics, fairness metrics and modeling choices investigated.**

Due to the large size of digital histology slides, known as whole-slide images (WSIs), in this study, we used the MIL framework⁶⁸ to predict slide-level labels in a weakly supervised manner. The framework consists of customizable parts (Fig. 1b): segmentation of tissue from background and tessellation into patches, projection of patches into low-dimensional space using pretrained patch encoder, and aggregation of patches into slide-level representations, which are classified into the desired labels⁴. We considered various popular choices for all stages and studied their effects on fairness. For the patch encoder, we first considered a ResNet50 network pretrained on ~10⁶ natural images (ResNet50_IN)⁶⁹. Additionally, we considered a shifted-window (Swin) transformer pretrained on ~15 × 10⁶ histology images (CTransPath)⁷⁰ and a large Vision Transformer (ViT-L) pretrained on ~100 × 10⁶ histology images (UNI)⁷¹. While UNI and ResNet50_IN were not pretrained on the WSIs used in this study, CTransPath was trained on TCGA (without subtype or mutation labels), which could lead to artificially inflating performance when testing on TCGA⁷¹. Next, we considered three common patch aggregator modules that differ in how relations between the patches are assumed, namely attention-based MIL (ABMIL) (patches are independent)⁷², clustering-constrained attention MIL (CLAM)⁴ and transformer-based MIL (TransMIL) (patch interactions learned)^11,73. Finally, we investigated two common bias mitigation strategies, namely importance weighting (IW)^74,75,76 (Fig. 1c) and adversarial regularization (AR)^77,78,79 (Fig. 1d).

To assess the effect of modeling choices and data processing techniques on the performance and fairness of classification models, we compared a few performance metrics, namely demographic-stratified ROC AUC and F1 score. We also compute TPR disparity between different races and the entire cohort population, under the framework of equalized opportunity^63,64,65, which has been used in other medical fairness and general studies^47,65,80,81. The AUC reflects the model’s ability to discriminate between binary classes, whereas the F1 score indicates the balance between the model’s precision and recall. However, the AUC and F1 score do not show class-specific error rates, which are crucial for understanding model weaknesses. Therefore, we examined TPR disparity to detect sensitivity variances among binary classes (refer to ‘Definition and quantification of the fairness metrics’ in Methods for more details). A TPR disparity of zero for a demographic group signifies similar group and population recall, whereas deviations signal higher or lower TPR from the population, indicating signs of unfairness. Clinically, sensitivity is paramount as it reflects the model’s success in identifying true cases. Thus, TPR disparity provides meaningful clinical insights into potential performance differences upon deployment. By evaluating race-stratified AUC, F1 scores and TPR disparities, we can understand both the performance and fairness of our model for each race, ensuring the model’s utility and equity in clinical settings.

Baseline race-stratified assessment

A standard study design for model development in computational pathology is randomly splitting a large public dataset (as such TCGA) into training and validation folds, which do not account for different patient demographic subgroups. To understand first to what degree this standard approach affects the subgroup-specific performance in subtyping and IDH1 mutation prediction tasks, we split the TCGA-BRCA, TCGA-lung and EBRAINS brain tumor atlas cohorts into 20 task label-stratified, Monte Carlo cross-validation (CV) folds and trained models using ABMIL⁷²—a popular weakly supervised slide-level classification algorithm used across many computational pathology tasks^4,72,82,83. We then assessed the performance on independent test cohorts, namely MGB-breast and MGB-lung for subtyping and TCGA-GBMLGG for IDH1 mutation prediction. To establish baselines from which different fairness-promoting strategies could be added progressively, we trained ABMIL with the ResNet50_IN^69,84, CTransPath⁷⁰ and UNI⁷¹ patch encoders without any bias mitigation strategies.

As measured by 20-fold average AUC on the independent test cohorts, ABMIL performs well in subtyping tasks, especially when paired with self-supervised patch encoders (Fig. 2a and Extended Data Figs. 1 and 2a,e,i); similar trends extend to IDH1 mutation prediction (Fig. 2a and Extended Data Fig. 3a,d,g). However, when using a less robust patch encoder such as ResNet50_IN (Fig. 2b), we consistently observed a performance gap (in ROC AUC) between Black and white patients across all tasks (3.0% for breast subtyping, 10.9% for lung subtyping and 16.0% for IDH1 mutation prediction) (Extended Data Figs. 1–3a). This gap was further supported by lower F1 scores, particularly for Black patients in lung subtyping and IDH1 mutation prediction (Fig. 2e,f), whereas the F1 score for white patients was lower in breast subtyping (Fig. 2d). Importantly, these performance disparities were substantially reduced when we used self-supervised patch encoders trained on histology images, such as CTransPath and UNI. For instance, when using UNI, the AUC gap between white and Black patients decreased to 2.2% for lung subtyping and to 12.3% for IDH1 mutation prediction (Fig. 2e,f), whereas it remained at 3.8% for breast subtyping (Fig. 2d). However, a closer examination of the high AUC values achieved with UNI revealed imbalances in the F1 score for race groups. Notably, the F1 scores for Black patients in the lung subtyping and IDH1 mutation prediction tasks remained notably lower than those for white patients (Fig. 2e,f). The AUC and F1 score do not pinpoint the specific error types contributing to lower performance seen among Black patients. Analyzing sensitivity for different classes stratified by race revealed that Black patients with LUAD and LUSC, along with Black patients with an IDH1 mutation, had notably poorer recall rates than the overall population, also evident from their negative TPR disparity values (Fig. 2c and Supplementary Data Tables 6 and 7). Moreover, in breast subtyping, in which AUC remained consistently high with CTransPath across Black, white and Asian patients, the TPR disparity revealed that Black patients with ILC and white patients with IDC underperformed compared to the general patient population (Extended Data Fig. 1e and Supplementary Data Table 5). In summary, our analysis highlights racial discrepancies in performance across subtyping and mutation prediction tasks.

Bias from data characteristics

With our baseline indicating the existence of notable disparities between different races, even with strong patch encoders, we next assessed whether existing approaches to mitigate variability and bias in the training and testing datasets help reduce the racial disparities. We explored approaches that are not directly race-aware (that is, site stratification and stain normalization) and an approach that is race-aware (that is, test set resampling).

Impact of site stratification

The TCGA-BRCA and TCGA-lung cohorts comprise digital histology slides from various tissue-contributing hospital sites. Due to differences in tissue preparation protocols and patient demographics across sites, a ‘dataset shift’ issue inevitably arises, in which models are developed and deployed on mismatched data distributions, a common failure mode of machine learning applications in healthcare^{17,21,46,85,86,87}. Site-stratified CV is a bias mitigation strategy that holds out a subset of sites to prevent models from learning spurious correlations between site-specific factors and diagnoses¹⁵. To test whether site-specific demographic variability contributes to the performance disparity, we trained the ABMIL model using a tenfold site-stratified CV and various patch encoders (Extended Data Figs. 1 and 2b,f,j). Site-stratified splits, when used with less robust patch encoder such as ResNet50_IN, were found to exacerbate existing disparities in lung subtyping. Specifically, as measured by AUC, these splits led to a 5.2% drop for Asian patients, a 2.8% drop for Black patients and a 2.7% drop in overall performance (Extended Data Fig. 2b). Correspondingly, F1 scores also decreased across races with the ResNet50_IN encoder (Fig. 2e). Using site-stratified splits with self-supervised patch encoders such as UNI and CTransPath showed improvements in subtyping AUC values for Black patients, albeit small ones (Extended Data Figs. 1 and 2f,j). For example, in breast subtyping, using site-stratified splits with the CTransPath encoder led to equalizing the AUC for white and Black patients (Extended Data Fig. 1f). Despite these improvements, the F1 score for Black patients remained lower than that for the overall population with the UNI encoder in lung subtyping (Fig. 2e). Recall for Black patients with ILC in breast subtyping (Supplementary Data Table 5) and those with the LUAD subtype in lung subtyping (Supplementary Data Table 6) was also lower. In summary, the effect of site stratification on performance is contingent on the patch encoder used. While it could exacerbate disparities with weaker encoders, it offered some disparity reduction with self-supervised encoders. Nevertheless, gaps persisted in lung subtyping, as measured by AUC, F1 score and TPR disparity. A similar investigation for IDH1 mutation prediction could not be performed as the EBRAINS brain tumor atlas does not provide information on the tissue source site.

Impact of stain normalization

In addition to site-stratified CV, stain normalization is a common domain-adaptation technique for reducing differences in staining variability and is advocated in tandem with site-stratified CV^{15,88,89,90,91}. We thus used stain normalization on both training and testing cohorts, with the tenfold site-stratified CV, to investigate whether the disparities are reduced. The impact of stain normalization on disparities was again dependent on the patch encoder used and the disease being investigated (Fig. 2d–f). Using ResNet50_IN and stain normalization in IDH1 mutation prediction led to increases in the race-stratified and overall AUC (Extended Data Fig. 3b). However, this intervention led to notable drops of 6.1% and 1.9% in Black patients’ AUC for breast and lung subtyping, respectively (Extended Data Figs. 1 and 2c). Conversely, stain normalization with stronger encoders did not offer any substantial TPR disparity reduction and also led to performance drops (Extended Data Figs. 1 and 2g,k). For example, with the CTransPath encoder, using stain normalization decreased Black patients’ AUC by 2.5% and 3.0% in breast and lung subtyping, respectively, while not notably affecting white patients’ performance. We note that, even with stain normalization, large gaps persisted in recall between Black patients and the overall population. For example, when using the UNI encoder, there was a mean TPR disparity of −0.060 (95% confidence interval (CI) −0.114, −0.020) for Black patients with LUAD (Supplementary Data Table 6) and −0.284 (95% CI −0.482, −0.086) for Black patients with an IDH1 mutation (Supplementary Data Table 7). Demographic gaps also persisted when applying stain normalization without site stratification (Extended Data Fig. 4 and Supplementary Data Table 8). Overall, depending on the disease being studied, stain normalization may be beneficial when used with less robust patch encoders.

Impact of test set resampling

Lastly, we investigated whether a disproportionate demographic composition in the test cohorts causes performance disparity. To this end, we constructed unbiased test cohorts by resampling with replacement an equal number of patients from each racial group and each class within the test cohorts⁹². We evaluated the models on resampled test cohorts, with stain normalization and site stratification still in effect. Nevertheless, performance disparities persisted among race groups irrespective of the patch encoder (Fig. 2d–f and Extended Data Figs. 1, 2d,h,l and 3c,f,i). In breast subtyping, a notable 9.2% AUC gap between Black and white patients was evident with the ResNet50_IN encoder (Fig. 2d). Although the gaps were smaller at 2.8% with the UNI encoder, low F1 scores and substantial TPR disparities among Asian and Black patients with ILC indicate lingering disparities in breast subtyping (Fig. 2c and Supplementary Data Table 5). In lung subtyping, both ResNet50_IN and UNI showed lower AUC and F1 scores for Black patients compared to the overall population and white patients, supported by negative TPR disparities among Black patients (Fig. 2e and Supplementary Data Table 6). In the prediction of IDH1 mutations, substantial gaps persisted between white and Black patients. For instance, using UNI, the AUC for Black patients was 14.0% lower than that for white patients, which is reflected in lower F1 scores for Black patients and negative TPR disparities among Black patients with an IDH1 mutation (Fig. 2f and Supplementary Data Table 7). Overall, despite correcting for imbalances in prevalence in the independent test sets, performance disparities continued to persist.

In summary, performance disparities among different racial groups persist even when accounting for data-related sources of disparities and dataset shifts. It is important to recognize that both the site stratification and stain normalization techniques have their pros and cons¹⁵. Site-stratified splits can aid in learning features that are resilient to site-specific image acquisition variations, potentially enhancing performance. However, they may also lead to performance declines, as the exclusion of certain sites could result in the exclusion or underrepresentation of specific demographic groups during training. Likewise, stain normalization can alleviate dataset shifts but may inadvertently remove staining distinctions that arise from the diverse underlying biology of individual patients with cancer. In general, we observed that self-supervised patch encoders, such as UNI, tend to remain indifferent to site-specific artifacts and staining variations, whereas weaker encoders remain more amenable to such techniques.

Bias from MIL model architectures

The rapid progress in computational pathology has led to the improvement of components from all stages of the typical DL pipeline used (Fig. 1b). We thus investigated the effect of different modeling choices for patch encoders (ResNet50_IN^69,84, CTransPath⁷⁰ and UNI⁷¹) and aggregators (ABMIL⁷², CLAM⁴ and TransMIL¹¹) on the disparities between different racial groups for breast subtyping, lung subtyping and IDH1 mutation prediction. We additionally implemented commonly used fairness-aware strategies to study whether these are effective in mitigating demographic biases. Our choice of bias mitigation strategies was governed by the ongoing debate in fair machine learning on whether DL model embeddings should differ based on protected attributes or remain agnostic to them^28,54. Specifically, we used IW^74,75, which emphasizes examples from underrepresented groups in the model’s training, and AR⁷⁷, which trains the model to be agnostic to information predictive of protected attributes.

Among all the modeling choices we investigated, the use of self-supervised patch encoders had the largest impact on the performance of breast and lung subtyping as well as IDH1 mutation prediction, irrespective of the patch aggregator and bias mitigation strategy used (Fig. 3b,d,f and Supplementary Data Tables 9–11). For instance, when we replaced ResNet50_IN with UNI in ABMIL, the F1 score for Black patients showed improvements of 2.4% in breast subtyping, 21.6% in lung subtyping and 10.3% in IDH1 mutation prediction (Supplementary Data Tables 9–11). Similar gains were observed in subtyping and mutation prediction AUC (Fig. 3b,d,f). Notably, any amount of pretraining on histology images rather than natural images helped improve race-stratified performance (Extended Data Fig. 5a–c). The performance of different patch aggregators exhibited task dependency. For instance, when using the ResNet50_IN encoder, more complex patch aggregators led to a reduction in overall performance in lung and breast subtyping (Fig. 3b,d), but they proved beneficial for IDH1 mutation prediction (Fig. 3f). Transformer-based patch aggregators such as TransMIL¹¹ capture patch relations, whereas ABMIL⁷² assumes patch independence; such model inductive biases could interact differently with various tasks and diseases.

**Fig. 3: Investigating bias from MIL model architectures and bias mitigation strategies.**

We observed that IW had an adverse impact on race-stratified performance across different patch encoders and tasks, as evident from reductions in both the AUC (Fig. 3b,d) and F1 score (Supplementary Data Tables 9 and 10). While some configurations showed an improvement in TPR disparities with IW, such as ABMIL with CTransPath in lung subtyping, this improvement came at the expense of lowering the race-stratified performance (Supplementary Data Table 10). In contrast, AR with self-supervised patch encoders offered marginal enhancements in race-stratified performance and reductions in TPR disparities. For instance, in breast subtyping, using CTransPath with TransMIL and AR resulted in high AUC values and F1 scores for Black and white patients and minimal TPR disparities for Black patients (Fig. 3a,b and Supplementary Data Table 9). However, in lung subtyping, the gains with AR were limited, and the standard ABMIL with the UNI encoder and no bias mitigation strategy emerged as an effective approach for reducing disparities while maintaining high performance (Fig. 3c,d and Supplementary Data Table 10).

Considering TPR disparity and AUC concurrently, we observed that, in breast and lung subtyping, improvements in performance for individual racial groups contributed to narrowing the performance gaps between them. This was evidenced by the TPR disparity for Black patients approaching zero with an increase in AUC (Fig. 3a–d). For example, in lung subtyping, the mean AUC for Black patients improved from 0.795 (95% CI 0.771, 0.823) for ABMIL with the ResNet50_IN encoder to 0.954 (95% CI 0.941, 0.967) for ABMIL with UNI (Fig. 3d), and the mean TPR disparity improved from −0.127 (95% CI −0.172, −0.092) to 0.012 (95% CI −0.007, 0.029) for Black patients with LUSC (Supplementary Data Table 10). However, in IDH1 mutation prediction, increases in AUC for Black patients were not accompanied by large improvements in TPR disparity for Black patients with an IDH1 mutation (Fig. 3e,f and Supplementary Data Table 11). This suggests that, although more robust patch encoders enhance performance for Black patients in predicting IDH1 mutations, the performance gaps between Black patients and the overall population persist. Hence, concurrently considering fairness and performance metrics provides insights into whether performance gains reduce disparities between race groups.

While numerous sophisticated bias mitigation and patch aggregation methods have been proposed, our findings indicate that, when combined with weaker patch encoders such as ResNet50_IN, these complex methods are not as effective in reducing disparities compared to simpler aggregators paired with strong self-supervised patch encoders. This underscores that, while patch aggregators and bias mitigation strategies have a valuable role, they may not be substituted for robust patch encoders. Instead, they can provide incremental performance enhancements, as evidenced by their successful application in breast subtyping with the CTransPath encoder and AR. Nevertheless, we note that, even with the UNI encoder and ABMIL without bias mitigation strategy, disparities of 4.4% and 9.4% in the F1 score persisted between white and Black patients in lung subtyping and IDH1 mutation prediction, respectively (Supplementary Data Tables 10 and 11). The same observation held true even when performance was assessed by race-stratified AUC (Fig. 3d,f) and TPR disparity (Supplementary Data Tables 10 and 11), indicating that more work is needed in mitigating demographic disparities in computational pathology.

Race prediction from pretrained MIL models

We further investigated whether the patients’ race can be predicted from the slide-level representations (equation (2)) from models used in subtyping and mutation prediction. Previous works have demonstrated that histology carries information about race in TCGA due to correlations of hematoxylin-and-eosin stain intensity with hospital site and, thus, demographic information¹⁵; however, we investigated whether slide embeddings used for clinical tasks can also be used for race prediction. We trained models for all possible combinations of patch encoders, patch aggregators and bias mitigation strategies on the TCGA-BRCA and TCGA-lung subtyping and EBRAINS IDH1 mutation prediction tasks; froze the patch aggregators; and used logistic regression to predict race on the task-associated independent test cohorts²⁰. We found that slide-level representations used for the subtyping and mutation prediction tasks were highly predictive of race and that the race prediction performance was positively correlated with the task performance on the test cohorts (Fig. 4a–c). This is in line with protected attributes being able to be predicted from other medical imaging modalities^20,93,94. We found that slide representations learned with stronger patch encoders were able to predict race better. For example, in IDH1 mutation prediction, ABMIL with the ResNet50_IN patch encoder had a mean race prediction AUC of 0.590 (95% CI 0.539, 0.619), whereas the same patch aggregator with UNI features had a mean race prediction AUC of 0.852 (95% CI 0.839, 0.865) (Supplementary Data Table 44). Additionally, we found that models trained with AR, which intends to remove race-predictive information from deep embeddings, had a high race prediction AUC. For example, ABMIL trained with AR on lung subtyping with UNI features had a mean race prediction AUC of 0.787 (95% CI 0.773, 0.795), which is comparable to the race prediction AUC with no mitigation strategy (0.784 (95% CI 0.773, 0.794)) (Supplementary Data Table 43). Overall, models showing high performance may encode more protected attribute information and popular bias mitigation strategies may not successfully overcome this. We want to emphasize that encoding protected attribute information should not be misconstrued as the information being used for downstream tasks. Our analysis demonstrates that the primary task and the protected attribute information may be related, but it does not establish a causal relation⁹².

**Fig. 4: Evaluating race information in embeddings.**

Bias from training data type, diversity and size

In computational pathology, it is common to train on the multisite TCGA data and test on data from independent institutions^13,95. To understand demographic disparities better, we also tried the inverse approach; that is, we trained models on MGB data for breast and lung cancer subtyping and then tested them on TCGA cohorts. We additionally examined the effects of training set diversity and size on performance disparity, creating sets that vary by the number of samples (namely, 5 samples per subtype (referred to as k = 10) and 25 samples per subtype (referred to as k = 50)) and by racial composition (white only, Asian and Black only, and a combination of all races). This approach was also applied reciprocally, with training on TCGA and testing on MGB. For each dataset size and composition, sampling was done ten times to create ten training folds (except for the ‘k = all’ category, in which 20 Monte Carlo folds were used). The UNI patch encoder and ABMIL aggregator were used for all experiments as CTransPath was pretrained on trained on TCGA⁷⁰, which may lead to data leakage. A similar investigation for IDH1 mutation prediction could not be performed as the EBRAINS brain tumor atlas does not provide patient race information.

Training on MGB and testing on TCGA

Training on the MGB-breast and MGB-lung cohorts and testing on the corresponding TCGA cohorts revealed that, although the ABMIL model discriminates between subtypes of breast and lung carcinomas with high efficacy for both white and Black patients, aligning with prior findings⁹⁶, the performance measured by AUC was notably lower for Asian patients. For instance, in breast subtyping done with all patients from all races, the mean AUC for TCGA white patients was 0.956 (95% CI 0.945, 0.967) compared to 0.889 (95% CI 0.874, 0.905) for Asian patients (Fig. 5a). Furthermore, in examining recall for each subtype, Asian patients demonstrated lower recall for the ILC subtype in breast subtyping and the LUSC subtype in lung subtyping than white patients, corroborated by race-stratified F1 scores and TPR disparities per subtype (Supplementary Data Tables 13 and 15). Despite similar subtyping AUC results for white and Black patients in TCGA-lung and TCGA-BRCA, this uniformity did not extend to other TCGA cohorts. For instance, Black patients exhibited a lower AUC compared to white patients in the TCGA-GBMLGG cohort for IDH1 mutation prediction, whereas the AUC for Asian patients was generally comparable to that for white patients (Fig. 2f). Internal TCGA and MGB cohorts also showed demographic disparities (Extended Data Fig. 6 and Supplementary Data Tables 31–34). These results indicate that demographic discrepancies in computational pathology extend beyond MGB cohorts.

**Fig. 5: Effect of training set diversity and size on disparities.**

Effect of training dataset size

When expanding the training cohort size from k = 10 to k = 50 while considering all races within the MGB cohorts and testing on TCGA data, we observed substantial enhancements in race-stratified AUC across all racial groups for both subtyping tasks (Fig. 5a,b). These improvements in AUC were consistently mirrored by similar enhancements in race-stratified F1 scores and reductions in TPR disparities (Supplementary Data Tables 13 and 15). Conversely, models trained on larger TCGA cohorts containing all races and tested on MGB cohorts also exhibited improved race-stratified performance (Fig. 5c,d). Notably, training with smaller cohorts resulted in demographic disparities within the AUC values. For instance, in lung subtyping trained on TCGA and tested on MGB with k = 10 patients from all races, white patients initially had a mean AUC of 0.840 (95% CI 0.831, 0.850), while Black patients had a lower mean AUC of 0.740 (95% CI 0.724, 0.757). However, after training on all available patients, these values improved to 0.985 (95% CI 0.979, 0.990) for white patients and 0.954 (95% CI 0.941, 0.967) for Black patients (Fig. 5d). Similarly, in breast subtyping, the mean AUC for Black patients improved from 0.636 (95% CI 0.619, 0.654) to 0.944 (95% CI 0.933, 0.944) when training size increased from k = 10 to all patients (Fig. 5c). Similar gains were seen in F1 scores and reductions in TPR disparities (Supplementary Data Tables 12 and 14). Thus, our findings suggest that training on large datasets is vital for disparity mitigation.

Effect of demographic group diversity in training datasets

Expanding the demographic diversity in large public consortia, typically used as training datasets, results in enhanced race-specific model performance on independent test cohorts. For example, in breast subtyping with k = 50, ABMIL trained on the TCGA dataset that combines white, Asian and Black patients had an improved mean Black AUC of 0.889 (95% CI 0.873, 0.905) compared to 0.850 (95% CI 0.831, 0.869) achieved when training only on white patients (Fig. 5c). These gains are corroborated by similar increases in the F1 score for Black patients in breast subtyping with k = 50 over models trained only on white patients in a similar configuration (Supplementary Data Table 12). Similarly, in lung subtyping with k = all, ABMIL trained on the TCGA dataset that combines white, Asian and Black patients saw a higher mean AUC for Black patients (Fig. 5d). These improvements in performance also led to reductions in TPR disparity. For example, in breast subtyping with k = 50, ABMIL trained on TCGA white patients had a mean TPR disparity of −0.092 (95% CI −0.124, −0.068) for Black patients with ILC, which decreased to −0.059 (95% CI −0.084, −0.039) when ABMIL was trained on patients from all races (Supplementary Data Table 12). Similar trends in TPR disparities were also seen for lung subtyping with k = 50, specifically for Black patients with LUSC (Supplementary Data Table 14). When training for breast and lung subtyping on MGB and testing on TCGA, we observed that the AUC and F1 score improved for Black patients when training on k = all patients from all races when compared to training on only white or Asian and Black patients only (Fig. 5a,b and Supplementary Data Tables 13 and 15). These enhancements were supported by improvements in TPR disparities for Black patients with IDC and LUAD in breast and lung subtyping, respectively.

Overall, we found that demographic disparities in subtyping exist when training on MGB and testing on TCGA. Further investigations into the training set (TCGA or MGB) showed that larger and demographically diverse training sets can help alleviate disparities, and efforts should be made to collect and curate demographically diverse public consortia.

Investigating disparities beyond race

For lung and breast subtyping, we investigated whether computational pathology models perform equally well for subgroups defined by other demographic variables, such as insurance type, postal code-inferred median household income (referred to as ‘income’ for simplicity) and age groups (refer to ‘Forming demographic subgroups’ in Methods for further details). In addition, we analyzed disparities within a subgroup by stratifying the constituents of the subgroup by other demographic factors, referred to as ‘intersectional analysis’ (refs. ^47,97). Notably, the ABMIL model with UNI features on 20 Monte Carlo folds and the original test sets were used for this analysis. For IDH1 mutation prediction, such investigations were limited to patient race and age.

We observed that discrepant performance extends beyond racial subgroups, primarily for postal code income groups in lung subtyping (Supplementary Data Table 21), insurance groups in both lung and breast subtyping (Supplementary Data Tables 22 and 23), and age groups in IDH1 mutation prediction (Supplementary Data Table 27). For example, in lung subtyping on MGB-lung, we found patients from middle- and high-income postal codes to have higher F1 scores than patients from low-income postal codes. Specifically, the F1 score of patients from low-income postal codes in lung subtyping was 2.4% and 2.1% lower than that of patients from middle-income and high-income postal codes, respectively (Supplementary Data Table 21). These trends were also reflected by TPR disparities (Fig. 6a). Using the F1 score as a measure of performance, we consistently found across breast and lung subtyping that patients without Medicare (American federal health insurance program for people 65 or older) had lower performance, thus indicating higher misdiagnosis rates (Supplementary Data Tables 22 and 23). In the IDH1 mutation prediction task, we found that patients in the ‘>40 and ≤60’ years age group had higher performance than younger patients (≤40 years) and older patients (>60 years) (Extended Data Fig. 7c and Supplementary Data Table 27). Overall, we found that demographic disparities in AI models may extend beyond self-reported race, which echoes the findings of numerous previous studies^98,99,100.

**Fig. 6: Investigating lung subtyping disparities beyond race.**

Through intersectional analysis, we can observe disparities among patients in demographic subgroups, even after adjusting for confounding factors such as age. For example, in lung subtyping, when considering patients within the same age group of ≤70 years, we found that Black patients had worse performance than the overall population (Supplementary Data Table 25). In breast subtyping, we conversely found white patients to have worse performance in the ≤62 years age category (Supplementary Data Table 24). In IDH1 mutation prediction, Black patients had worse performance than the other race groups in the ‘≤40 and >60’ years age groups (Supplementary Data Table 27). A similar analysis can be done for other demographic factors, such as postal code-inferred income. When lung cancer patients from low-income postal codes were stratified by race, we found that Black patients from this subgroup had lower recall for the LUAD subtype than white patients from the same subgroup (Fig. 6b and Supplementary Data Table 21). Differences by insurance type (Fig. 6c) and age (Fig. 6d) were also found for patients from low-income postal codes. Intersectional analysis also revealed that aggregating diverse individuals within coarse race groups can mask disparities^47,97. For example, when white lung cancer patients (Fig. 6e) are stratified by postal code-inferred income (Fig. 6g) and age (Fig 6h), differences in recall between subgroups are found, but no significant differences are found when stratifying by insurance type (Fig. 6f). For example, in IDH1 mutation prediction, when white patients were stratified by age, patients ≤40 years old and those >60 years old had worse performance (Extended Data Fig. 7b and Supplementary Data Table 26). This is in contrast to white patients overall having consistently high performance. Such intersectional analysis is not limited to one race group, as we found TPR disparities within Black patients by insurance, age and postal code income (Fig. 6i–k and Supplementary Data Table 19), which also extended to breast subtyping (Extended Data Fig. 8i–k and Supplementary Data Table 18).

Analysis of misclassified cases

To understand better the failure modes of one of the models, a board-certified pathologist analyzed misclassified cases. For this, we chose the lung subtyping task trained on TCGA and tested on MGB because (1) large racial disparities in lung subtyping were seen despite using the best data processing and modeling choices, and (2) the morphological and tissue architectural characteristics of the disease are well established. The pathologist examined the pathology reports and WSIs used in the study for cases that were misclassified on at least two or more folds for ABMIL with the UNI encoder trained on the TCGA-lung dataset in 20 Monte Carlo folds and tested on the MGB-lung cohort. For each case, the pathologist determined whether the histology on the slide matched that in the corresponding pathology report, what the reported grade of the case was, whether the case was a biopsy or resection specimen, and whether lineage-specific immunohistochemistry (IHC) or special stains were used to make the diagnosis (for example, thyroid transcription factor 1 IHC or mucicarmine staining to make a diagnosis of LUAD but not IHC for programmed death ligand 1 or anaplastic lymphoma kinase). For all cases, the morphology observed on the slide matched that in the pathology report—no misdiagnoses by the original pathologists were found in the subsequent review. In general, the misclassified cases tended to be poorly differentiated if the grade was provided in the report, there was morphological overlap with the other class (for example, solid architecture in LUAD), the samples were biopsy specimens (thus had limited tissue area) or lineage-specific stains were required to make the diagnosis, indicating that these cases were also difficult for the pathologists who originally signed out the cases, corroborating the difficulty for the model. Among these misclassified cases, those from white patients tended to require lineage-specific stains more often than did cases from Asian or Black patients (68.4% for white patients versus 50.0% and 42.4% for Asian and Black patients, respectively), although they were less often biopsy specimens (56.5% for white patients versus 63.2% and 72.7% for Asian and Black patients, respectively) (Supplementary Data Table 28). Moreover, within the misclassified cases, Asian and Black patients tended to be younger than white patients. Therefore, the decreased performance on cases from Asian and Black patients in this experiment could be due to the smaller tissue area (and, thus, fewer patches that may be informative for the diagnosis) available to the model, as opposed to differences in grade or morphology, at least as evidenced by similar proportions of cases across grades and lower proportions of cases requiring stains. When stratifying performance by the specimen type (resection or biopsy), biopsy specimens generally had lower performance in both subtyping tasks (Supplementary Data Tables 29 and 30). A similar analysis of misclassification of TCGA-lung subtyping using ABMIL and the UNI encoder also revealed that these cases were usually poorly differentiated; however, they were mainly resection cases and the patient reports did not often describe IHC testing (Supplementary Data Table 28). While such analysis posits one potential failure mode, the observation of disparities in larger tissue resections in IDH1 mutation prediction suggests that the root causes of these differences are not fully understood and warrant further investigation.

Discussion

In this study, we assessed the performance of state-of-the-art computational pathology approaches across different demographic subgroups, including racial and income groups, for binary classification of subtypes of breast and lung carcinomas and for predicting IDH1 mutations in gliomas. We observed variations in the performance of current computational pathology methods among different demographic subgroups, even after accounting for statistical differences and using site-specific CV techniques^15,92. Notably, these demographic disparities became more pronounced when we used weaker patch encoders, but they were reduced when self-supervised patch encoders were used. Additional disparity reduction was achieved when AR was used with self-supervised encoders, effectively mitigating disparities in breast subtyping. Hence, more robust patch encoders offer a promising avenue for mitigating disparities. However, the persistent gaps in lung subtyping and IDH1 mutation prediction, despite the use of state-of-the-art modeling choices, indicate that the issue of demographic disparities remains unresolved¹⁰¹. We additionally observed that increasing the diversity and size of demographic groups in the training dataset, rather than oversampling from underrepresented groups during training, such as through IW, can help reduce demographic disparities. This underscores the necessity for large and diverse public datasets rather than relying solely on post hoc modeling solutions to address biases. Our findings also indicate that models with higher performance on the test cohort often encode more protected attribute information. This occurred despite the use of AR, a technique aimed at actively removing race-predictive information from deep embeddings. Finally, we also show that demographic gaps can extend beyond racial categories to include variations by postal code income, age and insurance status. Our study, along with a previous work¹⁶, demonstrates that conventional, state-of-the-art computational algorithms can exhibit demographic biases on common diagnostic tasks in breast and lung carcinomas and gliomas.

We hope that our study underscores the importance of considering fairness and performance simultaneously when assessing algorithms for clinical deployment to avoid prioritizing one over the other, as improving performance at the cost of fairness, or vice versa, raises complex ethical considerations for scientists and clinicians. Notably, we found that IW often reduced TPR disparity in subtyping tasks, but at the cost of performance. Similar degradation of performance with the use of algorithmic fairness methods has been noted in other studies on fair machine learning for clinical use¹⁰². While self-supervised patch encoders increased fairness and performance in subtyping, we found that in predicting IDH1 mutations, increases in overall performance did not lead to large reductions in gaps between white and Black patients. Such considerations can help avoid the selection of unfair models that might be difficult to identify solely based on overall population performance, which poses a risk of exacerbating existing inequalities in healthcare. Thus, simultaneously measuring bias and performance must become standard practice for medical imaging AI algorithms deployed in clinical settings.

Our finding that AR, the active removal of features predictive of protected attributes, can affect performance in different ways for different diseases suggests an intricate interplay between demographic groups and phenotypic traits^103,104. Recently, there has been mounting evidence in population genetics and cancer genomics that genetic ancestry is an important biological determinant in cancer health disparities, which may also manifest as demographic-specific histological phenotypes due to the correlations between ancestry and race^105,106,107. For instance, in breast and lung cancers, innate immune variants and gene mutation frequencies are known to differ across people of European, African and Asian ancestry^108,109,110. This is in contrast to a prominent view in fairness literature, which proposes learning invariant representations to protected attributes such as race⁵⁴. While our findings suggest that AR combined with self-supervised encoders might reduce demographic disparities in breast subtyping, such techniques might preclude learning population-specific histological phenotypes in the cancer types we investigated. Nonetheless, further research is required to confirm such hypotheses with larger and more diverse cohorts^27,54,111.

As the identification and mitigation of bias are known to be difficult, our study has a few limitations. In IDH1 mutation prediction, the EBRAINS brain tumor atlas does not provide site or patient race information, limiting our understanding of the composition of the dataset and its contributions to disparities. Our external test datasets for breast and lung subtyping contain relatively few images, are derived from a single hospital system and mainly include insured patients. Moreover, the datasets used in the study were largely collected in North America and Europe, often excluding geographic regions such as Asia and Africa. Although the slide diagnosis labels in our external test cohort were reviewed by several pathologists, inconsistencies in self-reported race can introduce label noise. Moreover, coarse race groups, such as ‘Black’ or ‘African American’, might mask variations within demographic groups⁹⁷, which can be heterogeneous¹¹². Demographic labels can be influenced by numerous characteristics, such as age, socioeconomic status and levels of cultural assimilation¹⁰³. Mitigating such label noise remains challenging as traditional strategies of bias mitigation may not provide effective corrections, leading to inherent biases embedded in data and models. While the development of individual fairness criteria to eliminate label noise remains an ongoing challenge¹¹³, our findings may have implications across multiple healthcare fields, including radiology, dermatology and genome-wide association studies^20,46,92,114. However, because we investigated only binary classification problems in this study, caution should be taken when attempting to generalize our findings to other machine learning tasks such as regression¹¹⁵, ranking¹¹⁶ and generative models¹¹⁷.

As slide classification via MIL has largely been approached using out-of-domain natural image pretrained patch encoders^4,7,11, upgrading to in-domain, histopathology encoders developed using self-supervised learning is an intuitive solution for improving the general performance of pathology AI algorithms. However, although we observed self-supervised pretrained encoders to have a large impact on mitigating disparities, we refrain from suggesting that they are a complete remedy for fairness. While self-supervised pretraining of encoders on extensive histology data continues to be a rapidly developing area, such models have limited public availability, primarily due to the proprietary patient data used during training. Hence, our choice of foundation models investigated was limited to Vision Transformer architectures trained on histology images, as these are commonly used and available in the field of computational pathology. We encourage future studies to investigate how convolutional neural networks and other architectures pretrained on large-scale histology data affect disparities. Furthermore, the foundation models considered are trained on extensive data from various disease types, as large-scale single-disease data are often limited. Future work should explore the effects of pretraining on large-scale data from a single disease, as opposed to diverse cancer types, on disparities. Finally, as more histology data are collected and organized, the training scale of foundation models considered here will inevitably be eclipsed^118,119. Future work could investigate the effect of pretraining dataset size, demographic composition (which is often unavailable) and training strategies on disparities in healthcare.

Although our analysis used a variety of performance and fairness metrics to establish disparities, it is important to acknowledge the limitations of using TPR disparity. TPR can be influenced by population and prevalence biases⁵¹, and dataset resampling has been suggested to address these issues^92,120. Our findings revealed that TPR disparities persist even after mitigating prevalence shifts through resampling, corroborated by other performance metrics. However, it is essential to recognize that resampling provides only an approximation of ideal data. We performed resampling using patient race, which may not account for other confounders such as age. Future research should explore how a causal understanding of data generation processes and clinical covariates can inform resampling to yield more realistic yet unbiased test datasets^121,122. Additionally, group-specific threshold selection has been proposed to reduce TPR differences^47,63,64, especially when dealing with prevalence shifts. However, this technique has notable drawbacks. It requires an intersection between ROC curves of different demographics^63,64, which may not always be feasible, particularly with an increasing number of demographic groups. Furthermore, race is a social construct, and there can be noise in self-reported race and other demographic variables, making boundaries between groups ambiguous¹²³. More importantly, using non-biological factors such as race to define clinical thresholds may lead to disparities in healthcare settings^{27,28,124,125,126}. With respect to clinical deployment, implementing group-specific thresholds necessitates knowing demographic variables, such as race, at the time of deployment, which may not always be possible. Finally, selecting fairness metrics and definitions is crucial because simultaneously fulfilling them may not be possible^64,127, and they can also often conflict¹²⁸. Nevertheless, we believe that striving for an equally high TPR across groups is essential to ensure that DL-based solutions maintain high clinical sensitivity across all subgroups.

While our study notes variability in the performance of AI models across different demographic groups, the exact role of self-reported race as a potential causal factor for these variations is far from definitive. Current research indicates that medical outcomes can be influenced by social determinants of health^129,130,131 and disparities in healthcare access^132,133,134, which have complex interplays with race and other demographic variables such as education level and sex. The independent test cohorts used in our investigation encompass a broad spectrum of demographic characteristics, including insurance status, age, income and race. Performance disparities in such a diverse setting suggest that a complex combination of both social and biological factors, including but not limited to self-reported race, may contribute to the observed differences in model performance. While our intersectional analysis aims to decipher bias from different demographic factors, we currently consider broad demographic groups to ensure sufficiently large sample sizes. Moreover, controlling for one factor, such as age, may not account for other factors, such as sex. Future research on larger patient cohorts should explore intersectional groups involving multiple specific demographic factors simultaneously. Finally, while our findings show that there is persistent low performance in lung subtyping and IDH1 mutation prediction for Black patients, we caution against generalizing this finding as indicative of a universal trend for a particular group because one must recognize the substantial biological heterogeneity inherent within coarse racial categorizations^{52,106,135,136}. While we do not claim that computational pathology systems consistently underperform in any single demographic group, we highlight the existence of notable demographic variances across numerous datasets. Such differences warrant careful rectification before clinical application to ensure equitable healthcare outcomes.

Recent investigations into demographic disparities within algorithms used in healthcare are more than theoretical inquiries; they carry profound implications, directly contributing to the enhancement of healthcare equity and quality across all populations^{18,27,28,54,137,138,139,140}. However, algorithms are currently approved without necessitating the provision of test cohort demographics or the explicit reporting of their performance across different demographic groups (Supplementary Data Table 1). In upcoming years, frameworks for auditing AI algorithms will likely have an important role in clinical deployment. This study, with support from an extensive body of literature^{20,46,47,48,49,80,141}, underscores that algorithms do not perform equally well across different demographic categories. If left unchecked, such failure modes may amplify existing healthcare inequities^142,143,144. We encourage medical regulatory agencies to consider these findings and make reporting of demographic-stratified metrics necessary when evaluating models for clinical deployment and public use. This aligns with reporting guidelines such as CONSORT-AI (Consolidated Standards of Reporting Trials extension for AI interventions)¹⁴⁵ and STARD-AI (AI-specific version of the Standards for Reporting of Diagnostic Accuracy Studies checklist)¹⁴⁶, which advocate for transparent reporting of AI performance assessments. Such measures can increase the trust in medical imaging AI and lead to more effective adoption by clinicians. Overall, we hope that our study serves as an entry point for investigations into the complex entanglements between demographic factors, DL and the clinical practice of pathology and for the implementation of policies by relevant stakeholders to ensure that AI algorithms are developed to be safe and effective for patients across diverse demographic backgrounds.

Methods

Dataset description

The MGB institutional review board approved the retrospective analysis of pathology slides and corresponding pathology reports. Research conducted in this study involved a retrospective analysis of pathology slides, and the participants were not directly involved or recruited for the study. The requirement for informed consent for analyzing archival pathology slides was waived. Before scanning and digitization, all pathology slides were deidentified to ensure anonymity. Likewise, all digital data, which encompassed WSIs, pathology reports and electronic medical records, underwent deidentification before being subjected to computational analysis and model development. Sample sizes were determined by data availability, and all in-house data used in the research were dated between 2016 and 2022.

Our overall dataset comprised a total of 7,313 WSIs, including both publicly available and in-house datasets, amounting to 8.0 terabytes of raw data. The population demographics of each dataset are provided in Supplementary Data Tables 2–4. Our overall dataset consisted of the following sources.

The Cancer Genome Atlas

The TCGA dataset is a public and comprehensive collection of genomic and clinical data from various cancer types. In this study, we used the TCGA-BRCA (breast invasive carcinoma collection), TCGA-LUAD, TCGA-LUSC, TCGA-GBM and TCGA-LGG cohorts. We refer to the combined set of TCGA-LUAD and TCGA-LUSC as the TCGA-lung cohort. We refer to the combined set of TCGA-GBM and TCGA-LGG as the TCGA-GBMLGG cohort. We used 1,049 WSIs from the TCGA-BRCA cohort sourced from 40 different tissue-contributing sites. Of the 1,049 TCGA-BRCA WSIs, 838 are WSIs of IDC and 211 are WSIs of ILC. The TCGA-lung cohort comprised 1,043 lung WSIs from 73 distinct tissue-contributing sites in the TCGA dataset. The TCGA-lung cohort had 531 cases of LUAD and 512 cases of LUSC. We used 1,123 WSIs from TCGA-GBMLGG, which were collected from 37 tissue-contributing sites. TCGA-GBMLGG comprised 698 WSIs of IDH1 wild-type cancers and 425 WSIs of IDH1 mutant cancers. For TCGA-BRCA, TCGA-lung and TCGA-GBMLGG, only representative formalin-fixed, paraffin-embedded diagnostic slides that had tumor tissues present were included. No independent test cohorts contributed data to TCGA. The TCGA WSIs and associated clinical data can be accessed through the National Cancer Institute’s Genomics Data Commons portal (https://portal.gdc.cancer.gov/) and the cBioPortal (https://www.cbioportal.org). Any clinical data missing on cBioPortal were acquired from pathology reports provided by Genomics Data Commons portal.

In-house data

The in-house data collected from MGB consisted of 3,225 WSIs corresponding to the same number of cases. To select patients for the study, we queried our in-house database of pathology slides for patients from 2016 to 2022. We selected all patient cases within this period with available slides that met the following inclusion criteria: (1) have a lower-magnification downsample for segmenting and processing the tissue image and (2) have nonzero tumor content. Cases with missing slides were excluded. Cases with blurry scans were rescanned and included in the study. This dataset consists of cases of invasive breast carcinoma, which we call the MGB-breast cohort, and cases of adenocarcinoma and squamous cell carcinoma of the lung, which we collectively refer to as the MGB-lung cohort. The MGB-breast cohort comprised 1,265 invasive breast cancer cases, including 982 cases of IDC (MGB-IDC) and 283 cases of ILC (MGB-ILC). The MGB-lung cohort comprised 1,960 cases, consisting of 1,626 cases of LUAD (MGB-LUAD) and 334 cases of LUSC (MGB-LUSC). These slides were scanned either at 20× magnification using an Aperio GT450 scanner or at 40× magnification using a Hamamatsu S210 scanner (and included ×20 and ×10 pyramid downsamples). For MGB-breast, 208 cases were scanned using the Hamamatsu S210 scanner (n_Asian = 92, n_Black = 116), whereas 1,057 cases were scanned using the Aperio GT450 scanner (n_white = 904, n_Asian = 50, n_Black = 48, n_{nonreporting/other} = 55). For the MGB-lung cohort, 134 cases were scanned with the Hamamatsu S210 scanner (n_Asian = 74, n_Black = 60), and 1,826 cases were scanned with the Aperio GT450 scanner (n_white = 1,630, n_Asian = 67, n_Black = 68, n_{nonreporting/other} = 61). Extended Data Fig. 9 compares the hue and saturation of slides corresponding to different races; we found no statistically significant differences in the hue and saturations between the slides of different races. For in-house patients, the diagnosis assigned was based on a comprehensive review conducted by multiple pathologists. Protected patient information, including self-reported race categories (‘white’, ‘Asian’ and ‘Black’), age at diagnosis, postal code and patient insurance type, was collected from electronic medical records. No slides from MGB were contributed to TCGA’s data collection initiative. For data availability, refer to ‘Data availability’; for institutional review board approval, refer to ‘Dataset description’ above.

EBRAINS brain tumor atlas

The EBRAINS brain tumor atlas⁶⁷ is a public dataset collected by digitizing a considerable portion of a large, dedicated brain tumor bank based at the Division of Neuropathology and Neurochemistry of the Medical University of Vienna, covering brain tumor cases from 1995 to 2019 (ref. ⁶⁷). Slides were digitized using a Hamamatsu NanoZoomer 2.0 HT scanner at 40× magnification. At least two experienced neuropathologists checked each slide scan to ensure conformity of the diagnosis with the current revised 4th edition of the World Health Organization Classification of Tumours of the Central Nervous System and to ensure sufficient scan quality. Ambiguous cases were excluded, and WSIs of inferior quality were rescanned. We selected 873 cases with known IDH1 mutation status. There were 540 IDH1 wild-type cases and 333 IDH1 mutant cases. There were 508 GBM cases and 365 LGG cases. No information on patient race, insurance and income is provided, whereas the patients’ age and sex are known. The EBRAINS brain tumor atlas is available publicly from the official EBRAINS data portal (https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994).

Processing of digital histology slides

As WSIs are exceptionally large (often spanning 150,000 × 150,000 pixels), it is computationally infeasible to use raw WSIs directly in DL pipelines¹⁴⁷. In line with previous work^4,95,148, we first segmented the tissue from the background, divided the tissue into smaller nonoverlapping tiles (referred to as patches, in which all patches from a WSI comprise a bag of patches) and further downsampled them using pretrained neural networks. The details of these steps are as follows.

Tissue segmentation

Tissue from WSIs was segmented using the CLAM⁴ library at 20× magnification for each slide. First, the image was converted from RGB (red, green, blue) to HSV (hue, saturation, value) color space. Binary thresholding was then applied to the ‘saturation’ channel of the HSV-space image to compute a binary mask for tissue regions. To refine further the segmented tissue contours and address potential artifacts such as small gaps and holes, we used a combination of median blurring and morphological closing techniques. To obtain the final tissue segmentation, we subjected the approximate contours of the detected tissue and tissue cavities to a filtering process.

Patching

The segmented tissue was cropped into 256 × 256 patches (no overlap). This was performed at 20× magnification if available in the image pyramid range; otherwise, 512 × 512 patches were cropped from the 40× magnification and resized to 256 × 256 (ref. ⁴).

Feature extraction

As each bag of patches representing WSIs can be extremely large (for example, >10,000 patches¹⁴⁹), we encoded each patch into a compact low-dimensional feature vector using a pretrained neural network, which is referred to as the ‘patch encoder’. Because the patch encoder is frozen during training, the choice of pretraining data and strategy is essential. In this study, we examined the use of three patch encoders: a ResNet50 convolutional neural network⁶⁹ pretrained on natural images (ImageNet)⁸⁴, a Swin transformer¹⁵⁰ trained on approximately 15.6 million histology images (from TCGA and PAIP (Pathology AI Platform, http://wisepaip.org/paip)) using MoCo v3 (refs. ^70,151), and a ViT-L/16 transformer¹⁵² trained on approximately 100 million histology images using a DINOv2 self-supervised pretraining scheme^71,153. For the ResNet patch encoder, the adaptive mean-spatial pooling after the third residual block of the network was used to convert each patch into a compact 1,024-dimensional feature vector. For the Swin transformer and ViT-L patch encoders, each patch was first resized to a 224 × 224 image and then the provided model weights were loaded in the architecture to convert each patch into a 768-dimensional and 1,024-dimensional feature vector, respectively. The ResNet encoder is referred to as ResNet50_IN; the Swin transformer is referred to as CTransPath; and the ViT-L/16 encoder is referred to as UNI. To increase the speed of the feature extraction step, we used three NVIDIA 3090Ti graphics processing units (GPUs) with a batch size of 384 per GPU. To test the effect of the pretraining scale on demographic disparities, we also used a ViT-L/16 transformer¹⁵² on approximately 1 million and 16 million histology images using a DINOv2 self-supervised pretraining scheme^71,153, naming the encoders UNI– and UNI-, respectively. While none of UNI, UNI– and UNI- was pretrained on any data used in this study, we note that CTransPath was originally pretrained on TCGA (without any subtype or mutation labels). Thus, MIL models using CTransPath trained on EBRAINS and evaluated on TCGA may unfairly inflate performance due to data leakage from pretraining. Finally, the demographic composition of the pretraining datasets for the encoders used is not available and often challenging to collect, as in the case of CTransPath trained on public datasets that organize histology images from worldwide institutions where demographic data may not have been collected (http://wisepaip.org/paip).

Stain normalization

To extract stain-normalized features, we applied Macenko stain normalization to individual image patches before they were input into the patch encoder, using the implementation from https://slideflow.dev/slide_processing/ (ref. ¹⁵⁴).

Weakly supervised classification

In this study, we trained MIL-based weakly supervised WSI classification algorithms for three binary classification tasks: IDC versus ILC (we refer to this task as ‘breast subtyping’) and LUAD versus LUSC (we refer to this task as ‘lung subtyping’), and IDH1 wild-type versus IDH1 mutant (we refer to this task as ‘IDH1 mutation prediction’). DL models have been shown to perform with high accuracy on these tasks^{3,71,155,156,157,158,159}, eliminating the need to optimize the models’ performance on the classification task and instead allowing to focus on the models’ performance within demographic groups. For slide-level classification of pathology images, three MIL approaches were implemented: ABMIL⁷², CLAM⁴ and TransMIL¹¹. These approaches were chosen because they can perform slide-level classification without any region-of-interest extraction or patch-level annotations, can be adapted to both tumor biopsy and resection cases, and have previously demonstrated strong performances on the TCGA dataset and independent test cohorts¹⁴⁸. The implementation details of ABMIL, CLAM and TransMIL are now covered.

ABMIL and CLAM

To learn histology-specific feature representations, we passed the patch embeddings extracted by the patch encoder through three fully connected layers. These layers are respectively parameterized by ${W}_{1}\in {{\mathbb{R}}}^{768\times 1,024},{W}_{2}\in {{\mathbb{R}}}^{512\times 768}$ and ${W}_{3}\in {{\mathbb{R}}}^{512\times 512}$. The bias terms are implied and omitted for simplicity. Each fully connected layer is followed by rectified linear unit activation. Thus, each patch embedding ${z}_{k}\in {{\mathbb{R}}}^{1,024}$ is mapped to ${h}_{k}\in {{\mathbb{R}}}^{512}$, which serves as the input to downstream patch aggregation. ABMIL uses an attention module to learn to rank the relative importance of each image patch to classify the WSI. The attention module takes in each patch embedding h_k and learns the weights ${V}_{a}\in {{\mathbb{R}}}^{256\times 512},{U}_{a}\in {{\mathbb{R}}}^{256\times 512}$ and $W\in {{\mathbb{R}}}^{1\times 256}$ to score the patch (a_k):

$${a}_{k}=\frac{\exp \{{W}_{a}(\tanh ({V}_{a}\cdot {h}_{k}))\odot \,{{\mbox{sigmoid}}}\,({U}_{a}\cdot {h}_{k})\}}{\exp\left\{\mathop{\sum }\nolimits_{j = 1}^{K}{W}_{a}(\tanh ({V}_{a}\cdot {h}_{j}))\odot \,{{\mbox{sigmoid}}}\,({U}_{a}\cdot {h}_{j})\right\}}.$$

(1)

The slide-level representation, ${h}_{\textrm {slide}}\in {{\mathbb{R}}}^{512}$, is then the sum of the patches weighed by the attention weights:

$${h}_{\textrm {slide}}=\mathop{\sum }\limits_{j=1}^{K}{a}_{j}\cdot {h}_{j}.$$

(2)

A dropout layer (P = 0.25) is used after each layer in the attention backbone for model regularization. The deep slide features h_slide are then fed to a fully connected layer (defined by ${W}_{\textrm {c}}\in {{\mathbb{R}}}^{1\times 512}$), which is followed by the softmax operator to produce slide-level binary class prediction probabilities, p_slide:

$${p}_{\textrm {slide}}=\,{{\mathrm{softmax}}}\,\{{W}_{\textrm{c}}\cdot {h}_{\textrm {slide}}\}.$$

(3)

CLAM, in addition to slide-level classification, also performs instance-level clustering as additional supervision to constrain similar diagnostic image regions with similar importance weights⁴.

TransMIL

TransMIL¹¹ approximates self-attention with Nyström attention¹⁶⁰ to perform self-attention-based pooling of individual patches in a WSI. Specifically, TransMIL first squares a sequence of patch embeddings, applies different-sized convolutions to encode spatial information, flattens the sequence of patches, attaches a class token (called CLS token) and then uses multihead self-attention (a linear approximation of self-attention provided by Nyström attention¹⁶⁰) to learn the correlations between patches and the encoded spatial information.

Data-splitting strategies

Two different strategies were used for creating training and validation cohorts. We first used the common strategy of Monte Carlo CV, in which the dataset is randomly split into training and validation data in a fixed ratio (stratified by subtype or mutation label); this process is repeated for a fixed number of folds. In our study, we used 90% of the data for training and 10% for validation over 20 folds^4,10. Previous work has shown that there are site-specific digital histology signatures in the TCGA dataset¹⁵, and different tissue-contributing sites have different racial compositions. Thus, if the submitting sites within a dataset are randomly split into equal-sized groups for Monte Carlo CV, it is likely that a feature of interest would not be evenly represented among these groups, resulting in biased estimates of accuracy. Hence, we used the quadratic programming solution from Howard et al.¹⁵ to generate tenfold site-stratified splits for the TCGA-BRCA and TCGA-lung cohorts, which led to approximately 90% data for training and 10% for validation. The public Python package for site-stratified split generation was accessed at https://github.com/fmhoward/PreservedSiteCV. Ten was the maximum number of folds that the quadratic programming solution provided by Howard et al.¹⁵ could converge on for the TCGA-lung cohort. Site-stratified splits could not be made for the EBRAINS brain tumor atlas as it does not provide information on tissue source sites. Multiple training folds are used for all tasks to avoid bias towards certain data. Validation splits are used for saving model checkpoints. Models from all folds are tested on task-specific independent test cohorts and mean metrics are reported. Slides from same case were not distributed between training and validation splits.

Training details

The training of all models was done using the AdamW optimizer¹⁶¹. Following previous studies^4,95,148, we trained the ABMIL and CLAM models using a learning rate of 1 × 10⁻⁴ and an L₂ weight decay of 1 × 10⁻⁵. Following Shao et al.¹¹, we trained TransMIL using a learning rate of 2 × 10⁻⁴ and an L₂ weight decay of 1 × 10⁻⁵. ABMIL and TransMIL were trained by minimizing the cross-entropy loss. CLAM was trained using a weighted loss of the cross-entropy loss for slide classification and the smooth top-1 support vector machine loss¹⁶² for distinguishing high- and low-attention patches in instance-level clustering. Following CLAM, the weights were set to be 0.7 and 0.3 for the slide-level loss and instance-level loss, respectively, and the temperature scaling parameter α and the margin parameter τ were both set to 1.0. The weights and bias parameters of all the models were initialized randomly. During training, unless otherwise stated, slides were randomly sampled from the training cohort and provided to the model with a mini-batch size of 1. All models were trained for a maximum of 20 epochs. After the initial ten epochs, if the loss on the respective validation fold has not decreased for five consecutive epochs, early stopping was triggered and model weights were saved. For each fold, the model checkpoint with the lowest validation loss was used for evaluation on the respective independent test cohort.

Bias mitigation strategies

Common preprocessing and in-processing bias mitigation strategies were applied to investigate their ability to reduce differences between different demographic groups. These strategies included IW^{74,75,79,163,164} from preprocessing and AR⁷⁷ from in-processing. While IW tries to improve a model’s performance on underrepresented samples by weighted sampling (inversely proportional to a group’s size), AR encourages the model not to use information correlated with protected attributes. The bias mitigation techniques, which need access to protected attributes, were applied only to the TCGA-BRCA and TCGA-lung cohorts during the training phase, and no mitigation technique was applied to the MGB independent test set. Both of the bias mitigation techniques are model agnostic. Bias mitigation strategies could not be applied to the IDH1 mutation prediction problem as race information for the EBRAINS brain tumor atlas is not provided. We now cover their implementation details.

Importance weighting

In IW (Fig. 1c), samples from the underrepresented groups in the dataset are shown more frequently to the model, giving them higher importance and thereby improving the performance of the model on such groups¹⁶⁵. To implement IW, we first calculated the proportions of different races in the overall TCGA-lung and TCGA-BRCA datasets. We considered the ‘white’, ‘Asian’ and ‘Black’ race categories. As nonreporting patients account for approximately 10% of the overall dataset, we also considered nonreporting group as a category to not substantially reduce training dataset size. During the training of the subtyping models, each patient from the training dataset was randomly sampled with a probability that was inversely proportional to the representation of the patient’s race in the overall dataset. Thus, underrepresented patients, such as ‘Black’ and ‘Asian’ patients, were sampled more often and shown more frequently to the model than overrepresented ‘white’ patients. Such weighted random sampling was not applied to the validation and test splits; the model was evaluated on each sample from the validation split and independent test dataset only once.

Adversarial regularization

To make the model agnostic to the information related to the sensitive attribute (that is, race), we first passed the slide-level representation (the same representation used to learn the main task of subtyping) through a fully connected layer to predict the attribute of the patient (Fig. 1d). Cross-entropy loss was used to calculate the attribute prediction discrepancy. As the aim was to make the model invariant to the sensitive attribute, the negative of the attribute prediction loss was back-propagated, making the model poor in predicting the attribute. The attribute classifier was trained with the same hyperparameters as the subtyping model, and its weights were updated with the same frequency as the subtyping model. The implementation of the attribute classifier and the training updates were adapted from https://github.com/ShenYanUSC/Multimodal_Fairness. In addition to ‘white’, ‘Asian’ and ‘Black’ race categories, nonreporting group was also considered as to not reduce the training dataset size substantially.

Evaluation

Evaluation metrics

Subtyping and mutation prediction tasks are binary classification tasks. These tasks were evaluated in both overall and race-stratified manners. In the overall evaluation, all samples of the dataset were considered as patients from nonreporting group make a substantial portion of the dataset and the patient population. In the race-stratified evaluation, metrics for individual races were calculated, while not calculating metrics on the nonreporting group as these patients could be of any demographic. For both the overall and race-stratified settings, we report the ROC AUC. The ROC curve plots the TPR against the false-positive rate as the classification threshold is varied. For individual classes, we also report the overall and race-stratified recall, which measures the proportion of positive instances (for example, true positive (TP) and false negative (FN)) that the model correctly identified as positive, indicating the model’s ability to find all relevant positive samples:

$${\textrm {Recall}}=\frac{{\mathrm {TP}}}{{\mathrm{TP}}+{\mathrm{FN}}}.$$

(4)

We also report the macro-averaged F1 score for the overall and race-stratified settings. The macro-averaged F1 score is computed by calculating the F1 score (the harmonic mean of precision and recall (equation (5)) for each class independently. Then, these individual F1 scores are averaged together. Here, ‘FP’ stands for false positive.

$$\frac{2\times {{\mathrm{TP}}}}{2\times {{\mathrm{TP}}+{\mathrm{FP}}+{\mathrm{FN}}}}.$$

(5)

For the multiclass classification problem of race prediction, the macro-averaged one-versus-rest (OVR) AUC is reported. The macro-averaged OVR AUC generalizes the AUC to the multiclass case by averaging over the ROC AUC of all pairwise combinations of classes.

Selection of cutoffs

When testing a model on any independent test cohort, we used the Youden J statistic method¹⁶⁶ to find the optimal cutoffs. Specifically, for a fold, to convert the model’s predicted logits on the independent test cohort into positive and negative classes, we used the Youden J statistic from the model’s corresponding validation fold. The Youden J statistic finds the optimal balance between sensitivity and specificity on the validation fold. The same method was applied to both subtyping and mutation prediction¹⁶⁷. When testing on the internal TCGA and MGB cohorts, we used the validation set to determine the threshold.

Definition and quantification of the fairness metrics

To characterize the fairness of AI oncology models, we followed the group fairness metrics and estimated algorithmic fairness as defined by the separation criterion⁶⁴, also known as ‘equality of opportunity’ (equalized opportunity)⁶³. Equalized opportunity is a condition for classification parity that suggests that TPR should be equalized across protected subgroups for model fairness and nondiscrimination.

As there do not exist obvious ‘positive’ or ‘negative’ classes in the tasks considered, we assessed equalized opportunity for each class. Formally, for a binary prediction Ŷ made on a sample X with a protected subgroup A (for example, race) and ground-truth outcome Y, Ŷ satisfies equalized opportunity if the TPR is equalized across all protected subgroups.

For example, for subgroups of white, Asian and Black patients in our race-stratified evaluation, equalized opportunity is satisfied if

$$\begin{array}{l}P(\hat{Y}=1| A=r,Y=1)=P(\hat{Y}=1| A={r}^{{\prime} },Y=1),\quad\\\qquad\qquad\qquad\forall r,{r}^{{\prime} }\in \{\,{{\textrm{white, Asian, Black}}}\,\},r\ne {r}^{{\prime} }.\end{array}$$

With the separation framework, we established our fairness metric as ‘TPR disparity’, which measures the difference in TPR between protected subgroups and the population TPR. Negative TPR disparity in a subgroup indicates that the model misdiagnoses patients in that subgroup at a greater rate than in the overall population. The same definition can be extended to the other protected attributes considered in the study, such as income, insurance and age groups. During intersectional analysis, we compared the sensitivity of the intersectional group to that of the general population.

Test set resampling

The goal of test set resampling is to create an unbiased test set. We did this by sampling (with replacement) 500 patients from each of the protected attribute subgroups for each subtype of lung and breast cancer. The same method was also used for creating unbiased test sets for the IDH1 mutation prediction task. Test set resampling was applied only to the independent test cohorts for all tasks and to none of the training and validation datasets. No considerable effect was found by varying the number of samples sampled per subgroup (Extended Data Fig. 10).

Review of Food and Drug Administration-approved medical imaging AI algorithms

Documentations (510(k) and de novo approvals) submitted by companies that develop medical imaging AI algorithms to the US Food and Drug Administration (FDA) between January 2017 and December 2020 were reviewed. Only algorithms that work with medical imaging modalities (computed tomography (CT), positron emission tomography–CT, magnetic resonance imaging, radiography, microscopy, autofluorescence imaging and ultrasonography) were considered. The list of devices and the FDA approval numbers for the algorithms were accessed at https://ericwu09.github.io/medical-ai-evaluation/ (ref. ¹⁶⁸). For each algorithm, the FDA approval number was used to access the publicly available documentation used for approving the algorithm. The documentation, acquired as a single PDF file, was manually reviewed to determine whether the company recorded the exact demographics of their test set (age, sex, ethnicity and race). Next, the documentation was reviewed to determine whether the company reported any performance metrics (sensitivity, specificity, AUC or accuracy) for the different demographics of their test set. If the company reported that no significant differences were found by demographics, this was considered as reporting demographic-stratified performance metrics. In addition to a manual review of the documentation, a keyword search was done to ensure that the reporting of demographics or their metrics was not missed. Keywords included the following terms: age, old, young, sex, gender, male, female, race, ethnicity, ancestry, white, Caucasian, Black, African American, Asian, European, demographics and subgroups. Paige AI was approved in 2021. It was included in the analysis because its algorithm is highly relevant to the imaging modality (microscopy/histology) and application domain (oncology) of our study. Our analysis was based only on publicly available documentation that can be accessed at https://www.fda.gov.

Forming demographic subgroups

To divide patients with breast and lung cancer into subgroups by age at diagnosis, we used the national average age at diagnosis for breast and lung cancer patients. The national average age at diagnosis is 62 years for breast cancer patients¹⁶⁹ and 70 years for lung cancer patients¹⁷⁰. To convert postal code information to median household income in the postal code (simply referred to as ‘income’ in the study), we used the database at https://pypi.org/project/uszipcode/. Specifically, we used income from the 2010 US census, as this was the most recent income at the postal code level available at the time of the study through the software. To create the three income groups (low, middle and high), we used the 33rd and 66th percentiles to divide the patients into approximately three equal subgroups. Anytime a reference to patients from an income group is made, one must know that this is the median household income of the postal code neighborhood the patient has self-reported and may not be the patient’s actual household income. Regarding insurance, we categorized patients into (1) those with no public insurance (we call this group ‘not on Medicare’) and (2) those with some form of public insurance, namely Medicare (we call this group ‘on Medicare’). The ‘on Medicare’ group included patients who were on Medicare only and those who were on some private insurance and Medicare. MGB-breast had seven white uninsured patients (six with IDC and one with ILC). MGB-lung had ten white patients with unknown insurance (five with LUAD and five with LUSC), nine Black patients with unknown insurance (five with LUAD and four with LUSC), and ten Asian patients with unknown insurance (five with LUAD and five with LUSC). As the number of patients with unknown insurance was small, this category was not considered. TCGA-GBMLGG provides the age at diagnosis for patients but not income or insurance information. As GBM and LGG have different prevalences by age^171,172, we used the 33rd and 66th percentiles to divide the patients into approximately three equal subgroups (≤40, >40 and ≤60, >60 years). Any demographic subgroup that had fewer than three patients or had patients from only one subtype or mutation class was excluded.

Predicting protected attributes from embeddings used for primary tasks

In this study, we predicted protected attributes (that is, race) from the embeddings used for breast and lung subtyping and IDH1 mutation prediction. After models were trained on a training dataset, we froze all layers of the model except the final, fully connected classification layer. We replaced the classification layer with a logistic regression model. Specifically, the slide-level representation from equation (2) was used for ABMIL and CLAM, whereas the CLS token was used for TransMIL. In a fivefold CV study (folds stratified by race and label), we trained the logistic regression model to predict the protected attribute on the independent test cohort for the task. The logistic regression model and the CV study were developed using sklearn¹⁷³. The logistic regression model was trained for 1,000 iterations with lbfgs (limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm) solver, L₂ penalty and C = 0.5 (the inverse of regularization strength). Macro-averaged OVR ROC AUC was used to measure the accuracy of protected attribute predictions. Note that we did not train weakly supervised AI models to predict race directly from WSIs. Rather, we only investigated whether the primary task and race prediction are related by predicting race from the slide-level embeddings learned for subtyping or mutation prediction (equation (2)).

Analyzing the impact of training dataset size and composition on disparities

To understand the impact of the training dataset size and the demographic subgroups forming it, we systematically varied the dataset size and diversity on subtyping tasks. Specifically, we created training sets that vary by the number of samples (from 5 per subtype (referred to as k = 10) to 25 per subtype (referred to as k = 50)) and by racial composition (white only, Asian and Black only, and a combination of all races including the nonreporting group). We applied this approach to (1) creating training sets from TCGA and using MGB cohorts as the test sets and (2) creating training sets from MGB and using TCGA cohorts as the test sets. For each dataset size and composition, sampling was done ten times to create ten training folds, and the UNI patch encoder and ABMIL aggregator were used for all experiments. Samples that were not sampled for training for a fold were used for validation. When training with all patients of a specific subgroup composition (k = all), we created 20-fold Monte Carlo CV splits with 90% data for training to stay consistent with previous experiments using data from all race groups. To test for demographic disparities in internal TCGA and MGB cohorts, we used the k = 50 set composed of all races and reported the disparities on the associated validation set. This was done to ensure that sufficient samples of each subtype from all races were present in the training and validation sets.

Statistical analysis

Hypothesis testing

To compare the TPR disparity for different demographic groups within a class for a single experiment, we conducted two-sided paired nonparametric permutation tests^{71,174,175,176}. Specifically, we tested the following null hypothesis: ‘a weakly supervised AI model performs equally well on demographic group₁ of a class relative to patients of demographic group₂ of that class in terms of their TPR disparity’. For example, consider the white and Black race groups for the LUAD subtype for the ABMIL aggregator with a UNI encoder trained on TCGA-lung and tested on MGB-lung without any bias mitigation strategy. To perform the permutation test, we first pooled the data from both race groups and then randomly assigned them to either the first or second sample. Then, the statistic (difference of means of samples) was calculated. This process was performed repeatedly, n_permutations = 10,000 times, generating a distribution of the statistic under the null hypothesis. The statistic of the original data was compared to this distribution to determine the P value. The raw P values for comparisons between demographic groups within a class for a single experiment were then corrected for multiple-hypothesis testing using Benjamini–Hochberg correction¹⁷⁷ with a false discovery rate set at 0.05 (ref. ¹⁵). Groups varied based on the demographic variable (race, insurance type, income and age) or the intersection of demographic variables being considered. P > 0.05 was considered not significant.

Confidence intervals

While reporting mean metrics, we estimated 95% CIs using all of the folds. To calculate the 95% CI across all folds, we selected the model from each fold, resampled the test set with replacement to maintain its original size and evaluated the selected model on the resampled test set, repeating this procedure over all folds. The resulting metrics were averaged to represent one point in the bootstrap distribution. This process was repeated for 1,000 iterations (that is, 1,000 nonparametric bootstrap iterations), thus defining the bootstrap distribution for the metric. Subsequently, we calculated the 95% CI using this bootstrapped distribution^71,148.

Correlation

To calculate correlations between variables, we used Spearman correlation coefficients implemented by SciPy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)¹⁷⁸. The 95% CI for the Spearman correlation coefficient was calculated by converting to z score using the method outlined by Lane¹⁷⁹.

Computing hardware and software

We used Python (version 3.8.13) and PyTorch¹⁸⁰ (version 2.0.0, CUDA 11.7) (pytorch.org) for all experiments and analyses in the study. All downstream experiments were performed on three 24-GB NVIDIA 3090 GPUs. All WSI processing was supported by OpenSlide (version 4.3.1), openslide-python (version 1.2.0) and CLAM (http://github.com/mahmoodlab/CLAM). Pillow (version 9.3.0) and OpenCV-python were used to perform basic image processing tasks. We use scikit-learn¹⁸¹ (version 1.2.1) for its implementation of logistic regression. We used SciPy¹⁷⁸ (version 1.11.4) to calculate correlation coefficients. Implementations of other visual pretrained encoders benchmarked in the study are available at the following links: ResNet50_IN with ImageNet transfer (https://github.com/mahmoodlab/CLAM), CTransPath (github.com/Xiyue-Wang/TransPath) and UNI (https://arxiv.org/pdf/2308.15474.pdf). For extracting features, multi-GPU code was implemented using PyTorch’s distributed data-parallel module. For training weakly supervised ABMIL models, we adapted the training code from the CLAM codebase (https://github.com/mahmoodlab/CLAM). Matplotlib (version 3.7.1) and Seaborn (version 0.12.2) were used to create plots and figures. Numpy (version 1.24.4) and pandas (version 1.5.3) were used for numerical operations. Stain normalization was performed using Slideflow¹⁵⁴ (version 2.1.0). The code used for this study has been made publicly available at https://github.com/mahmoodlab/CPATH_demographics. Usage of other miscellaneous Python libraries is listed in the Reporting Summary.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Public data from TCGA, including digital histology and the clinical annotations used, are available at https://portal.gdc.cancer.gov/ and https://cbioportal.org. The EBRAINS brain tumor atlas can be accessed at https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994. Restrictions apply to the availability of the in-house data, which were used with institutional permission for the current study and are thus not publicly available. We note that these data were not specifically collected for this study. All requests for data may be addressed to the corresponding author and will be promptly evaluated based on institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations. Internal data can only be shared for noncommercial, academic purposes and will require a data user agreement.

Code availability

All code was implemented in Python using PyTorch as the primary deep learning package. Code and scripts to reproduce the training experiments of this paper are available at https://github.com/mahmoodlab/CPATH_demographics.

References

Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
Article Google Scholar
van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Article PubMed Google Scholar
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020).
Article CAS PubMed Google Scholar
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. in Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).
Shao, Z. et al. TransMIL: transformer based correlated multiple instance learning for whole slide image classification. in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 2136–2147 (Curran Associates, 2021).
Chan, T. H., Cendra, F. J., Ma, L., Yin, G. & Yu, L. Histopathology whole slide image analysis with heterogeneous graph representation learning. in Proc. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15661–15670 (IEEE, 2023).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Leo, P. et al. Computer extracted gland features from H&E predicts prostate cancer recurrence comparably to a genomic companion diagnostic test: a large multi-site study. NPJ Precis. Oncol. 5, 35 (2021).
Article CAS PubMed PubMed Central Google Scholar
Howard, F. M. et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun. 12, 4423 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chatterji, S. et al. Prediction models for hormone receptor status in female breast cancer do not extend to males: further evidence of sex-based disparity in breast cancer. NPJ Breast Cancer 9, 91 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dehkharghanian, T. et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn. Pathol. 18, 67 (2023).
Article PubMed PubMed Central Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Article CAS PubMed Google Scholar
Mhasawade, V., Zhao, Y. & Chunara, R. Machine learning and algorithmic fairness in public and population health. Nat. Mach. Intell. 3, 659–666 (2021).
Article Google Scholar
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4, e406–e414 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med. 27, 136–140 (2021).
Article CAS PubMed Google Scholar
Population Estimates, July 1, 2022 (V2022). U.S. Census Bureau QuickFacts https://www.census.gov/quickfacts/fact/table/US/PST045222 (2022).
Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff. (Millwood) 37, 780–785 (2018).
Article PubMed Google Scholar
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).
Article CAS PubMed PubMed Central Google Scholar
Spratt, D. E. et al. Racial/ethnic disparities in genomic sequencing. JAMA Oncol. 2, 1070–1074 (2016).
Article PubMed PubMed Central Google Scholar
Khor, S. et al. Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Netw. Open 6, e2318495 (2023).
Article PubMed PubMed Central Google Scholar
van der Burgh, A. C., Hoorn, E. J. & Chaker, L. Removing race from kidney function estimates. JAMA 325, 2018 (2021).
Article PubMed Google Scholar
Diao, J. A. et al. Clinical implications of removing race from estimates of kidney function. JAMA 325, 184–186 (2021).
Article PubMed Google Scholar
Marmot, M. Social determinants of health inequalities. Lancet 365, 1099–1104 (2005).
Article PubMed Google Scholar
Dietze, E. C., Sistrunk, C., Miranda-Carboni, G., O’Reagan, R. & Seewaldt, V. L. Triple-negative breast cancer in African-American women: disparities versus biology. Nat. Rev. Cancer 15, 248–254 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cormier, J. N. et al. Ethnic differences among patients with cutaneous melanoma. Arch. Intern. Med. 166, 1907–1914 (2006).
Article PubMed Google Scholar
Rubin, J. B. The spectrum of sex differences in cancer. Trends Cancer 8, 303–315 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lara, O. D. et al. Pan-cancer clinical and molecular analysis of racial disparities. Cancer 126, 800–807 (2020).
Article CAS PubMed Google Scholar
Heath, E. I. et al. Racial disparities in the molecular landscape of cancer. Anticancer Res. 38, 2235–2240 (2018).
CAS PubMed PubMed Central Google Scholar
Gucalp, A. et al. Male breast cancer: a disease distinct from female breast cancer. Breast Cancer Res. Treat. 173, 37–48 (2019).
Article PubMed Google Scholar
Dong, M. et al. Sex differences in cancer incidence and survival: a pan-cancer analysis. Cancer Epidemiol. Biomarkers Prev. 29, 1389–1397 (2020).
Article PubMed Google Scholar
Butler, E. N., Kelly, S. P., Coupland, V. H., Rosenberg, P. S. & Cook, M. B. Fatal prostate cancer incidence trends in the United States and England by race, stage, and treatment. Br. J. Cancer 123, 487–494 (2020).
Article PubMed PubMed Central Google Scholar
Zavala, V. A. et al. Cancer health disparities in racial/ethnic minorities in the United States. Br. J. Cancer 124, 315–332 (2021).
Article PubMed Google Scholar
Ngan, H.-L., Wang, L., Lo, K.-W. & Lui, V. W. Y. Genomic landscapes of EBV-associated nasopharyngeal carcinoma vs. HPV-associated head and neck cancer. Cancers (Basel) 10, 210 (2018).
Article PubMed Google Scholar
Singh, H., Singh, R., Mhasawade, V. & Chunara, R. Fairness violations and mitigation under covariate shift. in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 3–13 (Association for Computing Machinery, 2021).
Maity, S., Mukherjee, D., Yurochkin, M. & Sun, Y. Does enforcing fairness mitigate biases caused by subpopulation shift? in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 25773–25784 (Curran Associates, 2021).
Giguere, S. et al. Fairness guarantees under demographic shift. in Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Schrouff, J. et al. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. in Advances in Neural Information Processing Systems Vol. 35 (eds. Koyejo, S. et al.) 19304–19318 (Curran Associates, 2022).
Chen, S. et al. Machine learning-based pathomics signature could act as a novel prognostic marker for patients with clear cell renal cell carcinoma. Br. J. Cancer 126, 771–777 (2022).
Article CAS PubMed Google Scholar
US Food and Drug Administration. Evaluation of automatic class III designation for Paige Prostate. www.accessdata.fda.gov/cdrh_docs/reviews/DEN200080.pdf (2021).
Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Pac. Symp. Biocomput. 26, 232–243 (2021).
PubMed Google Scholar
Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
Article CAS PubMed PubMed Central Google Scholar
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Risk of bias in chest X-ray foundation models. Preprint at https://arxiv.org/abs/2209.02965v1 (2022).
Beheshtian, E., Putman, K., Santomartino, S. M., Parekh, V. S. & Yi, P. H. Generalizability and bias in a deep learning pediatric bone age prediction model using hand radiographs. Radiology 306, e220505 (2023).
Article PubMed Google Scholar
Röösli, E., Bozkurt, S. & Hernandez-Boussard, T. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci. Data 9, 24 (2022).
Article PubMed PubMed Central Google Scholar
Bernhardt, M., Jones, C. & Glocker, B. Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms. Nat. Med. 28, 1157–1158 (2022).
Article CAS PubMed Google Scholar
Mukherjee, P. et al. Confounding factors need to be accounted for in assessing bias by machine learning algorithms. Nat. Med. 28, 1159–1160 (2022).
Article CAS PubMed Google Scholar
Meng, C., Trinh, L., Xu, N., Enouen, J. & Liu, Y. Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset. Sci. Rep. 12, 7166 (2022).
Article CAS PubMed PubMed Central Google Scholar
Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
Article PubMed Google Scholar
Madras, D., Creager, E., Pitassi, T. & Zemel, R. Learning adversarially fair and transferable representations. in Proc. 35th International Conference on Machine Learning 3384–3393 (PMLR, 2018).
Wang, R., Chaudhari, P. & Davatzikos, C. Bias in machine learning models can be significantly mitigated by careful training: evidence from neuroimaging studies. Proc. Natl Acad. Sci. USA 120, e2211613120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Soltan, A. A., Eyre, D. W. & Clifton, D. A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat. Mach. Intell. 5, 884–894 (2023).
Article PubMed PubMed Central Google Scholar
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Article CAS PubMed PubMed Central Google Scholar
Burlina, P., Joshi, N., Paul, W., Pacheco, K. D. & Bressler, N. M. Addressing artificial intelligence bias in retinal diagnostics. Transl. Vis. Sci. Technol. 10, 13 (2021).
Article PubMed PubMed Central Google Scholar
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Distinct lung cancer subtypes associate to distinct drivers of tumor progression. Oncotarget 9, 35528–35540 (2018).
Article PubMed PubMed Central Google Scholar
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Abandoning the notion of non-small cell lung cancer. Trends Mol. Med. 25, 585–594 (2019).
Article PubMed Google Scholar
Yan, H. et al. IDH1 and IDH2 mutations in gliomas. N. Engl. J. Med. 360, 765–773 (2009).
Article CAS PubMed PubMed Central Google Scholar
Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. in Advances in Neural Information Processing Systems Vol. 29 (eds. Lee, D. D. et al.) 3315–3323 (Curran Associates, 2016).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT Press, 2023); fairmlbook.org/pdf/fairmlbook.pdf
Chouldechova, A. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5, 153–163 (2017).
Article PubMed Google Scholar
Wang, X. et al. Characteristics of The Cancer Genome Atlas cases relative to U.S. general population cancer cases. Br. J. Cancer 119, 885–892 (2018).
Article PubMed PubMed Central Google Scholar
Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource. Sci. Data 9, 55 (2022).
Article PubMed PubMed Central Google Scholar
Maron, O. & Lozano-Pérez, T. A framework for multiple-instance learning. in Advances in Neural Information Processing Systems Vol. 10 (eds. Jordan, M. I. et al.) 570–576 (MIT Press, 1998).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).
Article PubMed Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed Google Scholar
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. in Proc. 35th International Conference on Machine Learning 2127–2136 (PMLR, 2018).
Jaume, G., Song, A. H. & Mahmood, F. Integrating context for superior cancer prognosis. Nat. Biomed. Eng. 6, 1323–1325 (2022).
Article CAS PubMed Google Scholar
Kamiran, F. & Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33, 1–33 (2012).
Article Google Scholar
Krasanakis, E., Spyromitros-Xioufis, E., Papadopoulos, S. & Kompatsiaris, Y. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. in Proc. 2018 World Wide Web Conference 853–862 (International World Wide Web Conferences Steering Committee, 2018).
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N. & Varshney, K. R. Optimized pre-processing for discrimination prevention. in Advances in Neural Information Processing Systems Vol. 30 (eds. Guyon, I. et al.) 3995–4004 (Curran Associates, 2017).
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. in Proc. 30th International Conference on Machine Learning 325–333 (PMLR, 2013).
Zafar, M. B., Valera, I., Rodriguez, M. G. & Gummadi, K. P. Fairness beyond disparate treatment and disparate impact: learning classification without disparate mistreatment. in Proc. 26th International Conference on World Wide Web 1171–1180 (International World Wide Web Conferences Steering Committee, 2017).
Celis, L. E. & Keswani, V. Improved adversarial learning for fair classification. Preprint at https://arxiv.org/abs/1901.10443 (2019).
Zhong, Y. et al. MEDFAIR: benchmarking fairness for medical imaging. in Proc. International Conference on Learning Representations (ICLR, 2023).
Yang, Y., Zhang, H., Katabi, D. & Ghassemi, M. Change is hard: a closer look at subpopulation shift. in International Conference on Machine Learning (ICML, 2023).
Breen, J. et al. Efficient subtyping of ovarian cancer histopathology whole slide images using active sampling in multiple instance learning. in Proc. SPIE 12471 (eds. Tomaszewski, J. E. & Ward, A. D.) 1247110 (Society of Photo-Optical Instrumentation Engineers, 2023).
Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N. & Huang, J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med. Image Anal. 65, 101789 (2020).
Article PubMed Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article Google Scholar
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
PubMed Google Scholar
Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283–286 (2021).
Article PubMed PubMed Central Google Scholar
Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).
Article CAS PubMed PubMed Central Google Scholar
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. in Proc. 6th IEEE International Conference on Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009).
Janowczyk, A., Basavanhally, A. & Madabhushi, A. Stain Normalization using Sparse AutoEncoders (StaNoSA): application to digital pathology. Comput. Med. Imaging Graph. 57, 50–61 (2017).
Article PubMed Google Scholar
Ciompi, F. et al. The importance of stain normalization in colorectal tissue classification with convolutional networks. in Proc. 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) 160–163 (IEEE, 2017).
Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. Image Anal. 58, 101544 (2019).
Article PubMed Google Scholar
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 89, 104467 (2023).
Article PubMed PubMed Central Google Scholar
Adleberg, J. et al. Predicting patient demographics from chest radiographs with deep learning. J. Am. Coll. Radiol. 19, 1151–1161 (2022).
Article PubMed Google Scholar
Yi, P. H. et al. Radiology ‘forensics’: determination of age and sex from chest radiographs using deep learning. Emerg. Radiol. 28, 949–954 (2021).
Article PubMed Google Scholar
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Article CAS PubMed Google Scholar
Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains. Nat. Commun. 11, 5727 (2020).
Article CAS PubMed PubMed Central Google Scholar
Movva, R. et al. Coarse race data conceals disparities in clinical risk score performance. in Machine Learning for Healthcare Conference 443–472 (PMLR, 2023)
Mamary, A. J. et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr. Pulm. Dis. 5, 177–184 (2018).
PubMed PubMed Central Google Scholar
Sun, T. Y. et al. Exploring gender disparities in time to diagnosis. in Machine Learning for Healthcare Conference (Curran Associates, 2020).
Gianfrancesco, M. A., Tamang, S., Yazdany, J. & Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 178, 1544–1547 (2018).
Article PubMed PubMed Central Google Scholar
Glocker, B., Jones, C., Roschewitz, M. & Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. Artif. Intell. 5, e230060 (2023).
Article PubMed PubMed Central Google Scholar
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
Article PubMed Google Scholar
Borrell, L. N. et al. Race and genetic ancestry in medicine—a time for reckoning with racism. N. Engl. J. Med. 384, 474–480 (2021).
Article PubMed PubMed Central Google Scholar
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Article PubMed PubMed Central Google Scholar
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Article PubMed PubMed Central Google Scholar
Bamshad, M., Wooding, S., Salisbury, B. A. & Stephens, J. C. Deconstructing the relationship between genetics and race. Nat. Rev. Genet. 5, 598–609 (2004).
Article CAS PubMed Google Scholar
Bhargava, H. K. et al. Computationally derived image signature of stromal morphology is prognostic of prostate cancer recurrence following prostatectomy in African American patients. Clin. Cancer Res. 26, 1915–1923 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shi, Y. et al. A prospective, molecular epidemiology study of EGFR mutations in Asian patients with advanced non-small-cell lung cancer of adenocarcinoma histology (PIONEER). J. Thorac. Oncol. 9, 154–162 (2014).
Article CAS PubMed PubMed Central Google Scholar
Martini, R. et al. African ancestry-associated gene expression profiles in triple-negative breast cancer underlie altered tumor biology and clinical outcome in women of African descent. Cancer Discov. 12, 2530–2551 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, G. et al. Characterization of frequently mutated cancer genes in Chinese breast tumors: a comparison of Chinese and TCGA cohorts. Ann. Transl. Med. 7, 179 (2019).
Article PubMed PubMed Central Google Scholar
McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit. Health 2, e221–e223 (2020).
Article PubMed Google Scholar
Sung, H., DeSantis, C. E., Fedewa, S. A., Kantelhardt, E. J. & Jemal, A. Breast cancer subtypes among Eastern-African-born black women and other black women in the United States. Cancer 125, 3401–3411 (2019).
Article CAS PubMed Google Scholar
Li, X., Wu, P. & Su, J. Accurate fairness: improving individual fairness without trading accuracy. in Proc. 37th AAAI Conference on Artificial Intelligence Vol. 37 (eds. Williams, B. et al.) 14312–14320 (Association for the Advancement of Artificial Intelligence, 2023).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y., Zha, K., Chen, Y., Wang, H. & Katabi, D. Delving into deep imbalanced regression. in Proc. 38th International Conference on Machine Learning 11842–11851 (PMLR, 2021).
Morik, M., Singh, A., Hong, J. & Joachims, T. Controlling fairness and bias in dynamic learning-to-rank. in Proc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 429–438 (Association for Computing Machinery, 2020).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Article CAS PubMed Google Scholar
Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. Preprint at https://arxiv.org/abs/2309.07778 (2023).
Dippel, J. et al. RudolfV: a foundation model by pathologists for pathologists. Preprint at https://arxiv.org/abs/2401.04079 (2024).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Article Google Scholar
Pfohl, S. R. et al. Understanding subgroup performance differences of fair predictors using causal models. in NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models (2023).
Cai, T. T., Namkoong, H. & Yadlowsky, S. Diagnosing model performance under distribution shift. Preprint at https://arxiv.org/abs/2303.02011 (2023).
Morning, A. The racial self-identification of South Asians in the United States. J. Ethn. Migr. Stud. 27, 61–79 (2001).
Article Google Scholar
Chadban, S. J. et al. KDIGO clinical practice guideline on the evaluation and management of candidates for kidney transplantation. Transplantation 104, S11–S103 (2020).
Article PubMed Google Scholar
Eneanya, N. D., Yang, W. & Reese, P. P. Reconsidering the consequences of using race to estimate kidney function. JAMA 322, 113–114 (2019).
Article PubMed Google Scholar
Zelnick, L. R., Leca, N., Young, B. & Bansal, N. Association of the estimated glomerular filtration rate with vs without a coefficient for race with time to eligibility for kidney transplant. JAMA Netw. Open 4, e2034004 (2021).
Article PubMed PubMed Central Google Scholar
del Barrio, E., Gordaliza, P. & Loubes, J.-M. Review of mathematical frameworks for fairness in machine learning. Preprint at http://arxiv.org/abs/2005.13755 (2020).
Binns, R. On the apparent conflict between individual and group fairness. in Proc. 2020 Conference on Fairness, Accountability, and Transparency 514–524 (Association for Computing Machinery, 2020).
Braveman, P., Egerter, S. & Williams, D. R. The social determinants of health: coming of age. Annu. Rev. Public Health 32, 381–398 (2011).
Article PubMed Google Scholar
Walker, R. J., Williams, J. S. & Egede, L. E. Influence of race, ethnicity and social determinants of health on diabetes outcomes. Am. J. Med. Sci. 351, 366–373 (2016).
Article PubMed PubMed Central Google Scholar
Link, B. G. & Phelan, J. Social conditions as fundamental causes of disease. J. Health Soc. Behav. 35, 80–94 (1995).
Article Google Scholar
Richardson, L. D. & Norris, M. Access to health and health care: how race and ethnicity matter. Mt. Sinai J. Med. 77, 166–177 (2010).
Article PubMed Google Scholar
Yearby, R. Racial disparities in health status and access to healthcare: the continuation of inequality in the United States due to structural racism. Am. J. Econ. Sociol. 77, 1113–1152 (2018).
Article Google Scholar
van Ryn, M. Research on the provider contribution to race/ethnicity disparities in medical care. Med. Care 40, I140–I151 (2002).
Article PubMed Google Scholar
George, S., Ragin, C. & Ashing, K. T. Black is diverse: the untapped beauty and benefit of cancer genomics and precision medicine. JCO Oncol. Pract. 17, 279–283 (2021).
Article PubMed Google Scholar
Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Article CAS PubMed PubMed Central Google Scholar
Bonham, V. L., Green, E. D. & Pérez-Stable, E. J. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA 320, 1533–1534 (2018).
Article PubMed PubMed Central Google Scholar
Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Article PubMed PubMed Central Google Scholar
Zou, J., Gichoya, J. W., Ho, D. E. & Obermeyer, Z. Implications of predicting race variables from medical images. Science 381, 149–150 (2023).
Article CAS PubMed Google Scholar
Chen, I. Y., Johansson, F. D. & Sontag, D. Why is my classifier discriminatory? in Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).
Puyol-Antón, E. et al. Fairness in cardiac magnetic resonance imaging: assessing sex and racial bias in deep learning-based segmentation. Front. Cardiovasc. Med. 9, 859310 (2022).
Article PubMed PubMed Central Google Scholar
US Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf (2019).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
Article PubMed PubMed Central Google Scholar
Zhu, S., Gilbert, M., Chetty, I. & Siddiqui, F. The 2021 landscape of FDA-approved artificial intelligence/machine learning-enabled medical devices: an analysis of the characteristics and intended use. Int. J. Med. Inform. 165, 104828 (2022).
Article PubMed Google Scholar
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).
Article PubMed PubMed Central Google Scholar
Sounderajah, V. et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 11, e047709 (2021).
Article PubMed PubMed Central Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lipkova, J. et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 28, 575–582 (2022).
Article CAS PubMed PubMed Central Google Scholar
Smith, B., Hermsen, M., Lesser, E., Ravichandar, D. & Kremers, W. Developing image analysis pipelines of whole-slide images: pre- and post-processing. J. Clin. Transl. Sci. 5, e38 (2020).
Article PubMed PubMed Central Google Scholar
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002 (IEEE, 2021).
Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. in Proc. International Conference on Learning Representations (ICLR, 2021).
Oquab, M. et al. DINOv2: Learning robust visual features without supervision. in Transactions on Machine Learning Research 2835–8856 (TMLR, 2024).
Dolezal, J. M. et al. Slideflow: deep learning for digital histopathology with real-time whole-slide visualization. Preprint at https://arXiv.org/abs/2304.04142 (2023).
Kriegsmann, M. et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel) 12, 1604 (2020).
Article CAS PubMed Google Scholar
Janßen, C. et al. Multimodal lung cancer subtyping using deep learning neural networks on whole slide tissue images and MALDI MSI. Cancers (Basel) 14, 6181 (2022).
Article PubMed Google Scholar
Celik, Y., Talo, M., Yildirim, O., Karabatak, M. & Acharya, U. R. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recognit. Lett. 133, 232–239 (2020).
Article Google Scholar
Han, Z. et al. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7, 4172 (2017).
Article PubMed PubMed Central Google Scholar
Srikantamurthy, M. M., Rallabandi, V. P. S., Dudekula, D. B., Natarajan, S. & Park, J. Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23, 19 (2023).
Article PubMed PubMed Central Google Scholar
Xiong, Y. et al. Nyströmformer: a Nyström-based algorithm for approximating self-attention. in Proc. AAAI Conference on Artificial Intelligence Vol. 35 14138–14148 (Association for the Advancement of Artificial Intelligence, 2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019).
Berrada, L., Zisserman, A. & Kumar, M. P. Smooth loss functions for deep top-k classification. in Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Jiang, H. & Nachum, O. Identifying and correcting label bias in machine learning. in Proc. 23rd International Conference on Artificial Intelligence and Statistics Vol. 108 702–712 (PMLR, 2020).
Chai, X. et al. Unsupervised domain adaptation techniques based on auto-encoder for non-stationary EEG-based emotion recognition. Comput. Biol. Med. 79, 205–214 (2016).
Article PubMed Google Scholar
Fang, T., Lu, N., Niu, G. & Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. in Advances in Neural Information Processing Systems Vol. 33 (eds. Larochelle, H. et al.) 11996–12007 (Curran Associates, 2020).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Article CAS PubMed Google Scholar
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
Article PubMed PubMed Central Google Scholar
Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
Article CAS PubMed Google Scholar
American Cancer Society. Key statistics for breast cancer—how common is breast cancer? www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html (2024).
American Cancer Society. Key statistics for lung cancer—how common is lung cancer? www.cancer.org/cancer/types/lung-cancer/about/key-statistics.html (2024).
Kim, M. et al. Glioblastoma as an age-related neurological disorder in adults. Neurooncol. Adv. 3, vdab125 (2021).
PubMed PubMed Central Google Scholar
Cao, J., Yan, W., Zhan, Z., Hong, X. & Yan, H. Epidemiology and risk stratification of low-grade gliomas in the United States, 2004–2019: a competing-risk regression model for survival analysis. Front. Oncol. 13, 1079597 (2023).
Article PubMed PubMed Central Google Scholar
scikit-learn developers. 1.1. Linear models. scikit-learn scikit-learn.org/stable/modules/linear_model.html (2022).
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. https://doi.org/10.2202/1544-6115.1585 (2010).
Article PubMed Google Scholar
Ernst, M. D. Permutation methods: a basis for exact inference. Stat. Sci. 19, 676–685 (2004).
Article Google Scholar
Fisher, R. The Design of Experiments Vol. 6 (Hafner, 1951).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lane, D. M. Confidence Interval on Pearson’s Correlation (Rice Univ., 2018); onlinestatbook.com/2/estimation/correlation_ci.html
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. in Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar

Download references

Acknowledgements

This work was supported in part by the Brigham and Women’s Hospital (BWH) President’s Fund, BWH and Massachusetts General Hospital Pathology, and National Institute of General Medical Sciences R35GM138216 (to F.M.). R.J.C. was supported by the National Science Foundation Graduate Fellowship. Y.Y. was supported by the Takeda Fellowship. M.Y.L. was supported by the Siebel Scholars program. D.F.K.W. was supported by the National Institutes of Health/National Cancer Institute Ruth L. Kirschstein National Service Award (T32CA251062). The content is solely the responsibility of the authors and does not reflect the official views of the funding sources.

Author information

These authors contributed equally: Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson.

Authors and Affiliations

Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Anurag Vaidya, Richard J. Chen, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Health Sciences and Technology, Harvard–MIT, Cambridge, MA, USA
Anurag Vaidya
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Richard J. Chen
Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, GA, USA
Drew F. K. Williamson
Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
Yuzhe Yang & Ming Y. Lu
School of Data Science, University of Virginia, Charlottesville, VA, USA
Thomas Hartvigsen
T.H. Chan School of Public Health, Harvard University, Cambridge, MA, USA
Emma C. Dyer
Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA
Faisal Mahmood

Authors

Anurag Vaidya
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Drew F. K. Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Andrew H. Song
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Jaume
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhe Yang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hartvigsen
View author publications
You can also search for this author in PubMed Google Scholar
Emma C. Dyer
View author publications
You can also search for this author in PubMed Google Scholar
Ming Y. Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jana Lipkova
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Shaban
View author publications
You can also search for this author in PubMed Google Scholar
Tiffany Y. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Mahmood
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.V., R.J.C., and F.M. conceived the study. All authors designed the experiments. A.V., R.J.C., M.Y.L., D.F.K.W., T.Y.C., J.L. and M.S. performed data collection and cleaning. A.V. and R.J.C conducted the experimental analysis with assistance from all coauthors. D.F.K.W. analyzed the misclassified cases. A.V., D.F.K.W., R.J.C., A.H.S., G.J., T.H., Y.Y., E.C.D. and F.M. prepared the paper with input from all coauthors. F.M. supervised the research.

Corresponding author

Correspondence to Faisal Mahmood.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Effects of data processing strategies on disparities in breast subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for breast subtyping trained on TCGA-BRCA (n = 1,049 slides) and tested on MGB-breast (n = 1,265 slides) with: A-D. ResNet50_IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2.

Extended Data Fig. 2 Effects of data processing strategies on disparities in lung subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for lung subtyping trained on TCGA-lung (n = 1,043 slides) and tested on MGB-lung (n = 1,960 slides) with: A-D. ResNet50_IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 3.

Extended Data Fig. 3 Effects of data processing strategies on disparities in IDH1 mutation prediction.

Race stratified ROC curves and true positive rate disparity for ABMIL models for IDH1 mutation prediction trained on EBRAINS brain tumor atlas (n = 873 slides) and tested on TCGA-GBMLGG cohort (n = 1,123 slides) with: A-C. ResNet50_IN feature encoder D-F. CTransPath patch encoder G-I. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, D, G) (ii) ABMIL trained using stain-normalized features (B, E, H) (iii) With stain normalization, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian, with 500 slides per class for each race) (C, F, I). ROC curves show mean curve (n = 20 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 4.

Extended Data Fig. 4 Effect of stain normalization on disparities.

Race stratified ROC curves and true positive rate disparity for ABMIL model trained in 20-fold study ResNet50_IN and UNI patch encoders with Macenko stain normalization for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. ABMIL trained on TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) cohorts for breast and lung subtyping and tested on resampled MGB-breast lung cohorts, respectively. For IDH1 mutation prediction, ABMIL trained on EBrains (n = 873 slides) and tested on resampled TCGA-GBMLGG. All unbiased test cohorts have 1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2–4.

Extended Data Fig. 5 Effect of pre-training dataset size on demographic disparities.

Race-stratified and overall ROC AUC for ABMIL models with patch encoders pre-trained on natural images and histology image datasets of varying sizes for: A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. All models were trained on 20-fold Monte Carlo splits on TCGA-BRCA (n = 1,049 slides), TCGA-Lung (n = 1,043 slides), and EBRAINS brain tumor atlas (n = 873 slides) and tested on resampled MGB-breast, MGB-lung, and TCGA-GBMLGG (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race) for breast subtyping, lung subtyping, and IDH1 mutation prediction, respectively. The number of images used for pre-training of each encoder is shown in brackets under the encoder name. Refer to Methods for details of each encoder. Error bars in bar plots indicate 95% CI, with the center being the mean value (n = 20 folds).

Extended Data Fig. 6 Demographic stratified performance of internal validation cohorts.

Race stratified breast and lung subtyping ROC curves and true positive rate disparity for ABMIL models trained and tested on: (A) TCGA-BRCA (B) TCGA-lung (C) MGB-breast (D) MGB-lung. To create training splits, 25 examples from each subtype were sampled 10 times to create 10 folds, and the rest of the data was used for validation. ABMIL with UNI patch encoder used. ROC curves show mean (n = 10 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Tables 2 and 3.

Extended Data Fig. 7 Investigating breast subtyping disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the TCGA-GBMLGG test cohort (n = 1,123 slides) for ABMIL model trained with UNI features on the EBRAINS brain tumor atlas cohort (n = 873 slides) in a 20-fold study for IDH1 mutation prediction. A. TPR disparity for different race groups. B. The TPR disparity is computed for white IDH1 wild-type (WT) and mutant (MT) patients (n = 983 slides), stratified by age. C. TPR disparity for different age groups (years). D. The TPR disparity is computed for IDH1 wild-type and mutant patients aged ≤40 (n = 303 slides), stratified by race. Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained in one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 4.

Extended Data Fig. 8 Investigating IDH1 mutation prediction disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the MGB-breast test cohort (n = 1,265 slides) for ABMIL model trained with UNI patch encoder on the TCGA-BRCA cohort (n = 1,049 slides) in a 20-fold study for breast subtyping. A. TPR disparity for different postal code inferred income groups. (B-D) The TPR disparity is computed for subgroups of IDC and ILC patients from low-income postal codes (n = 407 slides), stratified by other demographic variables. B racial groups, C insurance groups, D age groups (years). E. TPR disparity for different racial groups. (F-H) The TPR disparity is computed for subgroups of the white IDC and ILC patients (n = 904 samples), stratified by other demographic variables. F insurance groups, G income groups inferred from postal code, H age groups. (I-K) The TPR disparity is computed for subgroups of the Black IDC and ILC patients (n = 164 samples), stratified by other demographic variables. I insurance groups, J income groups inferred from postal code, K age groups (years). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 2.

Extended Data Fig. 9 Stain distributions by races for different scanners.

For both the MGB-breast and lung cohorts, we randomly sampled 50 slides per scanner and per race, segmented the tissue from background, and patched the tissue into 256 × 256 tiles at 20x magnification. We sampled 1,000 patches from each slide, converted them from RGB to HSV space, and calculated their average hue and saturations. We compare the distributions of hue and saturation by race for A. Overall MGB-breast cohort B. Slides in MGB-breast cohort scanned on Aperio GT450 scanner C. Slide in MGB-reast cohort scanned on Hamamatsu S210 scanner. We compare the distributions of hue and saturation by race for D. Overall MGB-lung cohort E. Slides in MGB-lung cohort scanned on Aperio GT450 scanner F. Slide in MGB-lung cohort scanned on Hamamatsu S210 scanner. We do not find any statistically significant difference in the hues or saturations of whole slide images by race for both scanners and overall category as compared by two-sided non-parametric paired permutation tests. Boxes indicate quartile values of metric shown by respective axis (n = 50 whole slide images) with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot represents a unique slide.

Extended Data Fig. 10 Effect of resampling sample size on TPR disparities.

We trained ABMIL models with UNI patch encoder on 20-fold Monte Carlo splits on TCGA-BRCA and lung subtyping and EBRAINS IDH1 mutation prediction and tested them on original and resampled MGB-breast and lung and TCGA-GBMLGG cohorts, respectively. We show different resampling variants of the test set (no resampling/ original, 500 and 1,000 slides per class and per race) for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. Resampling is done for each disease class and race (See Methods for more details). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for the task in Supplementary Data Table 2–4.

Supplementary information

Supplementary Information

Supplementary Data Tables 1–44.

Reporting Summary

Supplementary Table 1

Analysis of FDA-approved algorithms: names of FDA-approved medical imaging algorithms, their approval number and year, the modality they are intended for, their risk group, and (1) whether the company reports the demographics of their test sets or (2) whether demographic-stratified metrics are reported on the test set. ‘1’ indicates that metrics are present or the approval documentation states that no differences were found by demographics. ‘0’ indicates that such metrics are not present.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vaidya, A., Chen, R.J., Williamson, D.F.K. et al. Demographic bias in misdiagnosis by computational pathology models. Nat Med 30, 1174–1190 (2024). https://doi.org/10.1038/s41591-024-02885-z

Download citation

Received: 03 September 2023
Accepted: 23 February 2024
Published: 19 April 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41591-024-02885-z

This article is cited by

Using unlabeled data to enhance fairness of medical AI
- Rajiv Movva
- Pang Wei Koh
- Emma Pierson
Nature Medicine (2024)