Introduction

The breast cancer susceptibility gene 1 (BRCA1) and 2 (BRCA2) are tumour suppressor genes required in pathways responsible for repairing damaged DNA, transcriptional regulation, and maintaining genomic stability, as these are crucial mechanisms for cells to avoid apoptosis and chromosomal rearrangement1. Consequently, variants in these genes can predispose to multiple types of cancer2.

Genetic testing is widely used in the clinic to identify individuals at high risk of developing breast, ovarian, and other types of cancers and these individuals are frequently carriers of germline pathogenic variants that disrupt BRCA1 and BRCA2 DNA repair function3.

Germline variants in BRCA1 and BRCA2 contribute to 20–25% of hereditary breast and ovarian cancer4, while BRCA1/2 somatic variants account for 5%–7% of ovarian cancers5 and up to 10% of breast cancers6. Individuals with BRCA1/2 variants have an increased risk of developing both breast (84% increased risk) and ovarian (45% increased risk) cancers6,7. Pathogenic variants of BRCA1/2 genes are associated with approximately 15–40% of hereditary breast cancers8. Individuals carrying BRCA1 pathogenic variants have a 59% elevated risk of developing breast cancer and a 34% of developing ovarian cancer by age 70. In contrast, carriers of BRCA2 pathogenic variants have a 51% risk of breast cancer and 11% risk of ovarian cancer at the age of 80 years9. Even though characterising a missense variant definitive pathogenicity status can better inform treatment, prevention and clinical management4, most missense variants identified by clinical genetic testing reported in public databases are listed as variants of uncertain significance (VUS)10. Thus, there is a need for accurate approaches to establish and predict variant pathogenicity and its impact on protein function.

Failure to precisely predict the consequences of missense variants in BRCA1 and BRCA2 genes confounds our understanding of sequencing data and impacts clinical care. To date, as only a limited number of missense variants have been functionally evaluated experimentally, the interpretation of variant pathogenicity has relied on applying in silico tools for predicting functional effects together with family-based data11.

Despite significant effort dedicated over the years to the development of accurate and general computational methods capable of identifying deleterious variants at genomic scale12,13,14,15, these have presented variable performance and reliability at a gene level12,16,17,18. In a particular example of BRCA1/2, Ernst et al. suggested after evaluating the performance of Align-GVGD19,20, SIFT12, PolyPhen-215 and MutationTaster221 on a set of well-characterized BRCA1/2 variants, that the results obtained using in silico tools are insufficient to be applied as stand-alone evidence in clinical diagnostics18. Thus, the availability of experimentally characterized effects of variants would allow us to overcome this limitation by tailoring gene-specific predictive methods to uncover mutation-structure–function relationships.

With advances in bioinformatics and computational biology, several computational attempts have been made to explore the functional impacts of missense variants in BRCA1 and BRCA2 genes. Hart et al. implemented an in silico model BRCA-ML for understanding the functional impact of missense variants in BRCA1 and BRCA2 genes and VUS classification11. In addition, Arshad and colleagues investigated the structural and functional consequences of BRCA1 variants on cellular mechanisms by applying well-established in silico approaches22. Finally, Ernst et al. evaluated the reliability of employing computational tools to predict the pathogenicity of BRCA1 and BRCA2 missense variants as the basis for clinical decision-making18. They analysed performance improvement effects by combining various in silico prediction approaches on a data set of well-characterized BRCA1/2 missense variants in comparison to stand-alone tools.

Here we have developed a new machine learning method capable of accurately predicting the functional effect of missense variants in the BRCA1 and BRCA2 genes and implemented a computational saturation mutagenesis approach to classify all VUSs within these genes. We believe that our predictive models could be valuable for interpreting BRCA1 and BRCA2 variants and overcoming the challenge of classifying variants of uncertain significance, in addition to improving the clinical utility of genetic testing on these genes.

Results

Variant distribution in BRCA1 and BRCA2

In order to visualize the distributions of missense variants curated from ClinVar10 BRCA1 and BRCA2, lollipop plots were generated and are depicted in Fig. 1. Most pathogenic variants observed were concentrated at well-known functional domains (BRCT and RING domains of BRCA1 and the DNA binding domain of BRCA2) of both genes, consistent with the previous findings4. Benign variants were uniformly distributed across both genes, covering 62% and 74% of BRCA1 and BRCA2 residues, respectively.

Figure 1
figure 1

The distributions of BRCA1 and BRCA2 missense variants shown as lollipop plots. Benign and likely benign variants are represented by blue circles and red circles depict pathogenic and likely pathogenic variants. The mapped BRCA1 and BRCA2 missense variants are ranked for their impact at the protein level, particularly nonsynonymous missense variants.

Exploring the functional consequences of BRCA variants using statistical analysis and feature engineering

To distinguish between pathogenic and benign variants, we performed a qualitative analysis to investigate the relationship between different molecular properties with variant consequences. These included protein stability effects upon mutation, amino acid biophysical properties, effects on post-translational modifications and evolutionary conservation. A total of 197 features were calculated (Suppl. Table 1).

We conducted a Welch Two Sample t-test to identify features that could differentiate between the two classes, pathogenic and benign, in both BRCA1 (Suppl. Figure 1) and BRCA2 (Suppl. Figure 2) genes. For BRCA1, one of the most descriptive attributes was sequence conservation given as ConSurf scores23 (p < 2.2e-16), indicating that pathogenic variants tend to frequently occur in conserved regions, consequently leading to function impairment, in agreement with previous studies24. Other features highlighting the molecular differences between the two classes include amino acid physicochemical properties25. Particularly, features representing statistical potentials (KESO980102: p =  < 6.6e-06, MIRL960101: p =  < 1.1e-05 and MIYT 79,010: p =  < 1.1e-05) presented a significant difference between benign and pathogenic variants.

For BRCA2, highly discriminating features included sequence evolutionary conservation properties (PANTHER26 : p < 6.9e-13, ConSurf 23: p < 2.3e-15), suggesting that pathogenic variants tend to occur in conserved positions, as previously observed24. The stability analysis by SAAFEC-SEQ27 tool (p < 0.007) revealed that pathogenic variants were likely to be highly destabilizing, as shown before24. Furthermore, pathogenic variants displayed differential patterns in terms of amino acid physicochemical properties25 in comparison to benign variants (MUET020101: p < 0.003). These properties highlight the importance of considering a range of properties when assessing the functional impacts of variants on protein function.

For model optimization, Welch’s t-test was also conducted on all the features used in the final model (BRCA1/2 combined) to provide biological insight into which distinct features characterize functional consequences of BRCA1 and BRCA2 upon single amino acid substitutions (Fig. 2). Among the most differentiating attributes were sequence-based conservation scores (PolyphenScore28) and amino acids physicochemical properties22: HENS920101 (represents the BLOSUM45 substitution matrix), WEIL970101 (represents amino acid comparative profiles) and LUTR910107 (represents mutation matrices for the various protein secondary structure classes22).

Figure 2
figure 2

Distribution of the top discriminative features between the pathogenic and benign variants. Selected features incorporated sequence conservation and amino acids physicochemical properties. (PolyphenScore, HENS920101, WEIL970101 and LUTR910107). The selected features are significantly different between the two classes (p < 0.001). Statistical significance was measured using the Welch sample t-test.

Following the elimination of redundant features, a greedy feature selection approach was performed, based on Matthew’s Correlation Coefficient (MCC). Our final optimal model included 15 features (Suppl. Table 2). These representative features of the varied classes considered involved conservative scores from Provean29 and PolyphenScore28. In addition, MetaSVM_score, MPC-rankscore, MutationTaster_score, ClinPred-score28, and physicochemical amino acid properties (AA-index)22 were included, as well as functional annotation scores from the AWESOME tool (predicting the effect of SNP on the level of the post-translational modification), including ubiquitination, acetylation and AWESOME Score30.

Notably, while AA-index22 provides a measure of numerical indices that represent different physicochemical properties of amino acids, only six of these features were selected by the greedy feature selection approach: MIYS990107 and THOP960101 are representations of the amino acid pair-wise contact potentials, while LUTR910107, HENS920101, WEIL970102, and WEIL970101 denote amino acid mutation matrices.

Developing BRCA1 and BRCA2 gene-specific pathogenicity predictors

Different supervised learning algorithms were assessed to build gene-specific predictive models that can accurately identify pathogenic variants in BRCA1 and BRCA2 genes.

After greedy features selection, the best performing models were obtained using the Random Forest classifier (n_estimators = 300) for both genes. While for BRCA1-combined and BRCA2-combined (where pathogenic or likely pathogenic variants were grouped as pathogenic, and benign or likely benign variants were grouped as benign), the models with the best performances were the ensemble classifiers: Extra Trees (n_estimators = 300) and Gradient Boosting (n_estimators = 300), respectively.

BRCA1 and BRCA2 gene-specific predictors achieved a range of Matthew’s Correlation Coefficient (MCC) varying from 0.89 to 0.98 across tenfold cross-validation and comparable performance of up to 0.89 across independent, non-redundant blind tests (Table 1). Furthermore, the final classification models achieved an AUC of up to 0.99 across tenfold cross-validation (Fig. 3) and comparable performance of up to 0.98 on independent, non-redundant blind tests.

Table 1 Comparative performance of BRCA1/2 models across cross-validation and non-redundant blind test sets.
Figure 3
figure 3

Receiver Operating Characteristic (ROC) curves for BRCA1 (top) and BRCA2 (bottom). Our predictive models accurately identified pathogenic variants with AUC > 0.92 on cross-validation and blind tests.

When comparing the predictions made by BRCA1 and BRCA2 gene-specific models, the BRCA1 correctly identified 94 out of 97 pathogenic variants, and it wrongly classified 5 out of 150 benign missense variants. In contrast, we found that the BRCA2 model was more accurate in classifying benign variants; it misclassified only one benign variant as pathogenic.

Predicting the clinical significance of BRCA1/2 variants using ENIGMA data

We build gene-specific predictive models to increase the reliability and evaluate the clinical impact of BRCA1/2 missense variants. Therefore, we assessed different supervised learning algorithms to train (a binary classifier) and optimise the predictive capability of each model in classifying pathogenic variants in BRCA1 and BRCA2 genes using the Evidence-based Network for the Interpretation of Germline Mutant Alleles (ENIGMA) data31.

After greedy features selection, the models with the best performances were obtained using the ensemble classifier Gradient Boosting (n_estimators = 300) for both genes.

BRCA1 and BRCA2 gene-specific predictors performed a range of Matthew's Correlation Coefficient (MCC) ranging from 0.82 to 0.96 across tenfold cross-validation and comparative performance of up to 1.00 across independent, non-redundant blind tests (Table 1). Similarly, the final classification models achieved an AUC of up to 0.99 across tenfold cross-validation (Suppl. Figure 3) and equivalent performance of up to 1.00 on independent, non-redundant blind tests.

When looking closely at the predictions made by BRCA1 and BRCA2 (ENIGMA) gene-specific models, the BRCA1 model accurately categorised 28 out of 29 pathogenic variants, and it incorrectly classified 1 out of 112 benign missense variants.

The misclassified variant, S1715R, is located in the BRCT domain of BRCA1 and has been previously revealed to disrupt BRCA1 interaction with Abraxas, BRIP1 and CtIP2932. It was also misclassified by other tools, including polyphen215 and Align-GVGD19,20, highlighting that potentially including structural information into these predictions could further improve accuracy by capturing additional molecular consequences of variants.

Developing a general BRCA1/2 pathogenicity predictor

We investigated whether a general predictive tool could be developed to accurately classify pathogenic variants in BRCA1 and BRCA2 genes by combining all missense variants of both genes.

For the general BRCA1/2 predictor (where all variants of both genes were combined), the final model with the best performance was obtained using the Random Forest classifier (n_estimators = 300). It achieved an accuracy of 0.96 on tenfold cross-validation, with an AUC of 0.96, MCC of 0.91, and precision of 0.96. This was comparable with the performance across the non-redundant blind test, achieving an AUC of 0.95, MCC of 0.76, and precision of 0.93, providing confidence in the generalizability of the final model (Table 1 and Suppl. Figure 4). When tested on all BRCA1/2 variants in the training BRCA1/2-combined combined dataset (n = 406), our initial model’s performance in classifying pathogenic and benign variants was 91% and 98%, respectively.

Table 1 shows the performance of the classification models across tenfold cross-validation and blind test sets. The performance of our BRCA1 and BRCA2 gene-specific and general pathogenicity predictors was consistent on both tenfold cross-validation and blind test sets highlighting the robustness of the predictive models, and their capability of accurately differentiating between pathogenic and benign variants.

To better guide the interpretation of novel variants, we tested the applicability of our general model to predict the likelihood of pathogenicity of the Variants of Unknown Significance (VUS, n = 5716) in BRCA1 and BRCA2. It was observed that our model predicted 13% of these as pathogenic and 87% as benign. According to the BRCA1/2- general model, the total proportion of all potential pathogenic variants in BRCA1 and BRCA2 is ~ 3% (891 out of 30,616). Nearly all of them are located in well-known functional domains (BRCT and RING domains of BRCA1 and the DNA binding domain of BRCA2), consistent with the previous findings4.

Interestingly, our model predicted the variant W31S located in the PALB2-binding domain of BRCA2 as pathogenic, which is consistent with a recent study finding33. The tryptophan residue at position 31 of BRCA2 is one of the essential residues for BRCA2 interaction with PALB2, as it is known to create a polar bridge with Ser1065 of PALB234. Consequently, changing tryptophan to Serine would abolish BRCA2 binding to PALB2, as demonstrated previously in vitro34,35.

BRCA1/2- combined predicted scores for all possible single-nucleotide variation (SNVs) are provided in Supplementary Data Set 1.

Using the molecular consequences of BRCA variants to identify distinguishing features

The main purpose of this study was to build an accurate and efficient model that can predict BRCA1/2 pathogenic variants. Therefore, identifying a set of informative features is crucial for adequate model performance and improving our understanding of the molecular basis of variant pathogenicity.

The final features acquired via greedy feature selection resembled the initial results of the qualitative analysis. To assess how each of the feature categories contributed to the final model, we trained a predictive model using different feature subsets: evolutionary conservation, missense variant prediction models from dbNSFP28, physicochemical properties, changes in post-translational modifications.

MCC values representing the performance on the blind test for each subset model were compared (Suppl. Table 3). Noticeably, physicochemical properties WEIL970102 and HENS92010125 (MCC = 0.76) were the main contributing features to the final model (BRCA1/2 combined), followed by other features that contributed to a moderate extent: changes in post-translational modifications30 (MCC = 0.75), ClinPred_score and MutationTaster_score28(MCC = 0.74).

As a final analysis, we explored the feature importance of the combined BRCA1/2 model. Suppl. Figure 5 shows that the sequence conservation feature PANTHER26 is the most contributing feature followed by PolyphenScore28 (a deleterious scoring method). On the other hand, most measures of conservation (SIFT10, SNAP236, and Provean29) contributed to a moderate extent to the final model.

Validation of BRCA1/2 general pathogenicity predictor using Functional Data

To evaluate the robustness of the BRCA1/2-general model, several types of functional data reported by Hart4,11, Startia37, and Findlay38 comprising BRCA1 and BRCA2 variants and their functional scores (with previously established cut-off points for pathogenic variants) were applied as an independent blind test set to validate our model. The combined experimental functional data contained 2,882 BRCA1 SNVs from RING and BRCT domains evaluated using different functional assays4,37,38 and 229 BRCA2 SNVs from the DNA binding domain assessed using the HDR assay4,11. 2,906 out of the 3,135 BRCA1/2 variants reported in the previously mentioned studies were not present in our training dataset.

Our model achieved an accuracy of 92% and F1-score of 0.93 for those variants not incorporated in the training data (2,906 variants), highlighting the robustness of our predictive model, and providing confidence in the generalizability of the final model. Figure 4 shows the confidence scores distribution of the functionally evaluated pathogenic and benign VUSs in BRCA1/2, demonstrating a good separation between classes.

Figure 4
figure 4

Distribution of probability scores predicted by our final model for functionally assessed VUSs in BRCA1 and BRCA2.

To showcase the performance of our method, we have assessed two variants. P34L and T1684P are currently classified as VUSs and were predicted as pathogenic at very high probabilities (of 0.88 and 0.91, respectively). Following the present results, a previous study demonstrated that these two variants were designated non-functional, based on functional scores obtained by saturation genome editing functional assay38. Furthermore, the P34L and T1684P variants are present in the Ring and BRCT domains of BRCA1, respectively. The P34L variant is predicted to destabilise the structure (-0.84 kcal/mol—mCSM-Stability39), with the conversion to Leucine (Leu) altering the backbone structure, leading to loss of rigidity and steric clashes to accommodate Leu (Suppl. Figure 6a). Interestingly, the T1684P variant was also predicted to cause destabilisation of the protein (-0.23 kcal/mol—mCSM-Stability39). The proline substitution could disturb the α-helical conformation by intervening intramolecular H-bonding loss of the main-side H-bond and flexibility by eliminating the amide hydrogen required for hydrogen bonding (Suppl. Figure 6b). Suppl. Figure 6 illustrates the interatomic interaction of P34L and T1684P variants.

It was illustrated in a previous study using a multiplex HDR reporter assay that (amino acids 2–96) tended to have the highest proportion of non-functional variants, as the RING domain is encoded almost totally by these positions that are involved in the stability, folding, and function of the full-length protein37,38. Additionally, BRCA1 missense variants that are known to predispose to cancer map to either the RING or BRCT domain37.

Comparison with other available methods

We compared the performance of our model (on both cross-validation and blind test sets) with well-established predictors designed to predict the functional effects of missense variants (PolyPhen-215, SIFT12 , Align-GVGD19,20 , REVEL13 and CADD40). Additionally, we compared the performance of our models with other available studies 11,38,41,42. The comparative prediction performance of the classification models on cross-validation is shown in Table 2. Our models significantly outperformed alternative approaches, with the BRCA1 model obtaining an accuracy of 0.96 compared to 0.75 for MLR-CAGI42, while the BRCA2 model achieved 0.97. Table.3 illustrates the comparative performance of the classification models on blind test sets. Our BRCA1/2 general model obtained an AUC of 0.96 and 0.95 on cross-validation and blind test sets, respectively, outperforming PolyPhen-215 (0.66,0.77),SIFT12 (0.78,0.79) and Align-GVGD19,20 (0.53,0.59) , REVEL13 (0.79,0.86) and CADD40 (0.84,0.79). The predictive models show a significant improvement in the robustness and predictive power compared to previous methods in both data sets (Table2,3).

Table 2 Comparative Performance on cross-validation between BRCA1/2 classification models and other available approaches.
Table 3 Comparative Performance on blindtest sets between BRCA1/2 classification models and other alternative predictors.

Comparison with alternative approaches that rely on genetic data

As in our study we aim at predicting the molecular consequences of coding variants (missense variants) in BRCA1 and BRCA2, we compared the performance of our BRCA1 and BRCA2 models with other studies that solely rely on genetic data and likelihood ratios to identify pathogenic variants.

Easton et al.43 built a logistic regression model to evaluate the clinical significance of 1,433 sequence variants of unknown significance (VUSs) in BRCA 1 and 2, reporting an AUC of 0.80 and 0.70 on their BRCA1 and BRCA2 models, respectively. In a similar way, many previous studies (Lindor, 20119; Parsons, 2019)31 employed a Multifactorial Probability-Based Model (posterior probability model) for classifying VUSs in BRCA1 and BRCA2 that incorporate different forms of genetic evidence. For instance, Parsons et al.31 achieved an AUC of 0.78 and an accuracy of 0.80 on their BRCA1/2 model. In comparison, Lindor et al. (2011)9 obtained an AUC of up to 0.93 and an accuracy of up to 0.92 on their BRCA1 and BRCA2 models.

Similarly, MS et al. (2020)3 employed logistic regression to indicate carrier level based on personal and family history of cancer and calculate likelihood ratios denoting pathogenicity. By analysing ~ 138,000 individuals carrying 2,383 BRCA1/2 variants tested by multigene panel testing (MGPT), their model achieved an AUC of up to 0.83 for BRCA1 and up to 0.70 for BRCA2.

Our models significantly outperformed alternative approaches, BRCA1 model obtaining an AUC of 0.98 and an accuracy of 0.96, while the BRCA2 model achieved an AUC of 0.97 and an accuracy of 0.98. The considerably higher performance of our method highlights the necessity to consider protein information to predict pathogenic variants in BRCA1/2.

Comparison of BRCA1/2-general predictor with ACMG/AMP classification

To demonstrate the robustness of our final model (BRCA1/2 general) in classifying VUSs, we compared our final model classification results with the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pa-theology (AMP) scoring44, by applying a bioinformatics tool, InterVar45.

It was possible to compare most of the BRCA1/2 missense variants with Intervar45. Among the BRCA1 and BRCA2 (VUSs) classified as pathogenic by our final model, none were categorised into this class by Intervar45. In contrast, the missense variants classified pathogenic were categorised as either likely pathogenic or likely benign by InterVar or remained VUSs.

Noticeably, only ~ 2% of BRCA1/2 missense variants (VUSs) classified as benign by our final model were categorised as likely pathogenic by Intervar45. On the other hand, the prevalence of additional missense variants classified as benign remained VUSs or likely benign with the InterVar tool45.

We observed many dissimilarities between our final model prediction and the InterVar tool ACMG/AMP variants scoring. This observance is in line with a recent study33 that revealed distinctions between their classification established on a multifactorial model and ACMG/AMP scoring.

Discussion

Achieving reliable estimations of cancer risk and functional consequence of BRCA1 and BRCA2 sequence variants represent a potential to improve management, diagnosis, and clinical decisions of inherited breast and ovarian cancers38,46 and computational approaches can enable and support these estimations.

Our study aims to classify and comprehensively estimate the functional consequences of missense variants in BRCA1/2 genes. We have shown that incorporating machine learning approaches with general pathogenicity scoring systems and mutation physicochemical properties is an effective strategy to obtain accurate predictive models for identifying deleterious missense variants in BRCA1 and BRCA2, which might lead to assisting classification of variants of uncertain significance that currently restrict the interpretation of genomic testing data. The final models obtained for each gene presented statistically significant improvements in comparison with other available approaches.

Wide-scale experimental mutational scanning methods, as in the cases illustrated by Findlay38 and Starita37 have provided a broader view of the mutational landscape in BRCA1/2. Although these studies functionally classified thousands of variants (1056 and 3893, respectively), there are still over 12,520 and 22,772 possible unclassified missense variants in BRCA1 and BRCA29, that can be investigated efficiently using computational tools.

There are, however, still many limitations to applying these models. The number of experimentally validated deleterious variants in BRCA1 and BRCA2, necessary for model development, is limited, imposing a challenge for machine learning methods and restrains generalization capabilities. In addition, training data are restricted to defined variants that are in protein regions identified to be involved with impaired DNA repair. For instance, the only BRCA2 missense variants, which are known to be disease-causing, are in the DNA-binding domain. It is not understood whether variants located in other domains, which our model predicted, and others predict to be disease-causing, can repress DNA repair.

Nevertheless, the BRCA1/2 combined model used for predicting the functional impact of all possible missense variants in BRCA1 and BRCA2 demonstrated a sensitivity of 96% and 98% specificity, implying that extrapolation beyond the identified domains could be possible. Employing additional pathogenic and neutral measures could determine whether other components of these genes reflect pathogenicity as well as predict their functional impacts.

Here we demonstrate that our final model (BRCA1/2 combined) is a reliable approach to classify thousands of missense variants in a clinically actionable gene. We anticipate that the in-silico saturation mutagenesis methods would become applicable and reliable for interpreting variants of unknown significance, as well as for providing direct functional estimations for newly observed variants. Moreover, the improved performance in our predictive models could assist researchers in prioritising potential SNVs in BRCA1 and BRCA2 for further exploration and validation. The results of the computational saturation mutagenesis were made available to researchers (see Supplementary Data Set 1 for all potential SNVs in both genes).

Methods

Data sets

To build a gene-specific model as well as a generic model for predicting the functional impact of missense variants in BRCA1 and BRCA2, variants of both genes reviewed by an expert panel (3 stars) and had no conflicting interpretations were curated from the ClinVar10 database. In this study, two different datasets were used for each gene to build and train a predictive model, comprising 247 missense variants (pathogenic:97; benign:150) for BRCA1, and a total of 189 missense variants (pathogenic:43; benign:146) for BRCA2 as the primary datasets. Moreover, the benign or likely benign variants retrieved from ClinVar (with no conflicting interpretations) in the combined datasets were grouped into the benign category, and variants interpreted as pathogenic or likely pathogenic were grouped as pathogenic. In comparison, the combined datasets consisted of a total of 335 missense variants for BRCA1 and a total of 297 missense variants for the BRCA2 gene.

Furthermore, we have utilised BRCA1/2 missense variants that ENIGMA31 quantitatively and qualitatively classified as pathogenic/benign to increase the reliability of our gene-specific models. The classification of these variants was initially derived based on a multifactorial model and causality scores ranking to assess their clinical significance. We included missense variants if they fulfilled the following standards: pathogenic or benign labels, posterior probability score from multifactorial likelihood analysis ≥ 0.99 (pathogenic) or < 0.99 (benign), or International Agency for Research on Cancer (IARC) class1 (benign) and 5 (pathogenic)47. (See Supplementary Data Set 2 for more details on the variants used and analysed in the calculations).

The ENIGMA datasets used comprised 141 missense variants (pathogenic:29; benign:112) for BRCA1 and a total of 118 missense variants (pathogenic:11; benign:107) for the BRCA2 gene. The functional validation datasets used in our study were from Hart4,9, Starita37, and Findlay38. Notably, we have only kept the variants that had a functional impact at the protein level, i.e., nonsynonymous missense variants, excluding splicing variants.

All datasets were divided into a training (85%) and blind test (15%) to train and evaluate the predictive/generalisation performance of the predictive models used for the classification task.

Feature engineering and selection

In this study, a range of features was calculated using different in silico tools to evaluate and predict the molecular and functional consequences of missense variants in BRCA1 and BRCA2.These features incorporated distinct categories, including, evolutionary conservation, protein post-translational modifications (PTMs) changes, sequence properties, biophysical characterization, and variants deleteriousness and pathogenicity evaluation. Supplementary Table 1 summarises the list of investigated features.

  1. 1.

    Conservation and sequence-based: We estimated the degree of residue conservation using ConSurf23. Substitution matrices (PAMs, BLOSUMs)48 and aaindex25 were calculated to account for the evolutionary conservation scores and physicochemical attributes of amino acids, respectively. Sequence-based Scores from SAAFEC-SEQ27 were measured to evaluate the impacts of single point mutations on protein stability and thermodynamics. Additionally, we applied the Missense Tolerance Ratio (MTR)49 to measure the deleteriousness of a missense variant by considering its surrounding regional intolerance.

  2. 2.

    Protein post-translational modifications (PTMs) changes: We used the AWESOME 30 tool to systematically assess the functional mechanism underlying missense variants and their impact on PTMs that include ubiquitination phosphorylation, glycosylation, methylation, and acetylation.

  3. 3.

    Biophysical characterization: The Align-GVGD19,20 version applied can be found at http://agvgd.hci.utah.edu/agvgd_input.php, which explicitly classifies missense substitutions into neutral or deleterious by combining the biophysical properties of amino acids and protein multiple sequence alignments and does not incorporate splicing.

  4. 4.

    Prediction based on Deleteriousness and pathogenicity scoring methods: Deleteriousness scoring methods from dbNSFP28 (Suppl. Table 1) were employed to quantify the deleterious effects of missense variants. We estimated the functional consequences of each variant using pathogenicity-based features SNAP236, PANTHER26, SIFT12, and Provean29.

Selecting the best set of features to train predictive models is known to be a challenging problem. A bottom-up greedy feature selection method was employed to reduce the noise of dimensionality. This approach considers each feature independently and iteratively, keeping only the set of features with the best performance50.

Qualitative analysis

To statistically catalog features that differentiate between the two classes (pathogenic and benign) two-sided Welch sample t-test was carried out on the primary and combined datasets by applying a cutoff p-value of < 0.05, employing the ggsignif package in Rstudio.

Machine learning approaches

To obtain predictive classification models, we first evaluated several classification algorithms, including Random Forest, Extremely Randomized Trees, Gradient Boosting, and Adaboost using the implementation available on the Scikit-learn library51. The predictive models were trained using stratified tenfold cross-validation and evaluated on non-redundant blind tests.

Model evaluation metrics

The performance of classification models was evaluated using well-established evaluation metrics, including the Area Under the ROC curve (AUC), Matthew’s Correlation Coefficient (MCC), Precision, F1 Score, Sensitivity, and Specificity. AUC is an effective measure to evaluate the performance of a model in a classification task at various threshold settings. Higher AUC means that the model is robust and capable of distinguishing between the two classes: pathogenic and benign. AUC ranges from 0 and 1. Therefore, a perfect model would have an AUC of 1, and an AUC of 0.5 indicates that the model is a random classifier. MCC is a balanced metric for assessing a classifier’s performance. The MCC returns values that range between − 1 and 1, where total disagreement in predictions is represented as -1, and a coefficient of 1 indicates a perfect prediction. F1 score is the harmonic mean of Precision and Recall (Sensitivity) of a classifier. Precision is the proportion between the correctly classified as positive and all positives. Recall represents the number of correctly predicted positive observations to all positives (pathogenic) in a dataset. Sensitivity (True Positive Rate) and specificity (True Negative Rate) are statistical measures used to estimate the proportion of positive (pathogenic) and negative (benign) classes that are correctly classified, respectively.