Introduction

It has been long acknowledged that the immunohistochemical (IHC) detection of Ki67 positive tumor cells provides important clinical information in breast cancer1. More recently, Ki67 gained clinical utility in the T1-2, N0-1, estrogen receptor-positive (ER) and HER2-negative patient group by allowing to identify those patients that are unlikely to benefit from adjuvant chemotherapy2. However, Ki67 has not been consistently adopted for clinical care due to unacceptable reproducibility across laboratories3,4,5.

Therefore, the International Ki67 in Breast Cancer Working Group (IKWG) originally published consensus recommendations in 2011 for best practices in the application of Ki67 IHC in breast cancer6. According to this consensus, parameters that predominantly influence Ki67 IHC results can be grouped into pre-analytical (type of biopsy, tissue handling), analytical (IHC protocol), interpretation and scoring, and data analysis steps6. As the scoring method was the largest contributor to test variability7, the IKWG has undertaken serious efforts to standardize the Ki67 scoring method of pathologists8,9. Although in multi-institutional studies, standardized Ki67 scoring methods reached pre-defined thresholds for adequate reproducibility9,10, this was only after completing calibration training and by using tedious counting methods. In this context, recently updated guidelines by the IKWG now recommend Ki67 IHC for clinical adoption in specific situations, including the identification of very low (<5) or very high proliferation (>30) indices, that render more expensive gene expression tests unnecessary2.

An important additional issue that can cause variability in Ki67 measurements is the type of specimen (core biopsy vs excision) and its effect on Ki67 scoring in a multi-center setting2. Indeed, the IKWG recommended use of core biopsies (CB), based on apparent superior results for Ki67 when visual evaluation was compared to that of whole sections (WS).

In this multi-observer and multi-institutional study, we aimed to investigate the comparability of Ki67 measurements across corresponding core biopsy and resection specimens from the same breast cancer cases, when evaluated using a calibrated, automated reading system. Furthermore, we assessed between-(consecutive) section differences in Ki67 scoring as no difference between sections will facilitate the selection of the tumor-block to perform the IHC staining on.

Materials and methods

Patients

Thirty cases of ER-positive breast cancer used in phase 3 of IKWG initiatives collecting 15 cases from the UK and 15 cases from Japan designed to cover a range of Ki67 scores9 were employed in this study. No outcome data were collected for this cohort. Patient selection was irrespective of patients’ age at diagnosis, grade, tumor size or lymph node status. The clinicopathological characteristics of these 30 cases can be found in our previous publications9,10.

Tissue preparation and immunohistochemistry (IHC)

Tissues from UK patients, both core biopsies and surgical resections were collected according to ASCO/CAP guidelines, while patients’ tissues from Japan were collected following ISO (International Organization for Standardization) 15189 approved by the Japan Accreditation Board. Preparation of the Ki67 slides of the first cohort has been previously described9. Briefly, the corresponding core-cut biopsy and surgical resection blocks were centrally cut and stained with Ki67, resulting in 60 Ki67 slides from 30 cases. The IHC was performed using monoclonal antibody MIB-1 at dilution 1:50 (DAKO UK, Cambridgeshire, UK) using an automated staining system (Ventana Medical Systems, Tucson, AZ, USA) according to the consensus criteria established by the International Ki67 Working Group6. Sections from the same block were stained in a single immunohistochemistry run, except for four cases where the staining was performed in two different runs. This approach effectively controls for any technical variation in staining.

Sample distribution

Twenty volunteer pathologists from 15 countries, most of whom participated in the previous Phase 3A study, were invited to participate. Four adjacent sections from each of the 60 blocks were centrally stained as follows: the first section with haematoxylin and eosin (H&E), the second with p63 (a myoepithelial marker, to assist the distinction of DCIS from invasive breast cancer) and the third to fourth with Ki67 (designated as slide sets 1–2).

The Aperio ScanScope XT platform was used at 20× magnification to digitize the slides (pixel size: 0.4987 µm × 0.4987 µm), which were uploaded to a server and distributed as digital images. Seventeen pathologists successfully completed the study (Fig. 1).

Fig. 1: Study design.
figure 1

Thirty patients of ER-positive breast cancer were enrolled comprising 15 cases from UK and 15 cases from Japan. Corresponding core-cut biopsy and surgical resection blocks were centrally cut two adjacent sections per case and stained with Ki67. Seventeen pathologists from 15 countries were given 60 slides (30 Core cut biopsy slides and 30 surgical resection specimen slides) of Ki67 to score.

Digital image analysis (DIA)

The QuPath open-source software platform was used to build automated Ki67 scoring algorithms for breast cancer11. A detailed guideline for setting up and building an automated Ki67 scoring algorithm was sent to the participating labs. All the participating labs were requested to build their own Ki67 scoring algorithm following the instructions and apply them on these 60 slides. The complete step by step instructions are available in Supplementary File 1. The reason why we asked each lab to build their own algorithm instead of using the same pre-trained and locked down Ki67 algorithm was to mimic clinical practice. As of the date of the study, no generalizable Ki67 scoring algorithm was available that provides whole slide scoring. Thus, theoretically, all the labs would need to adjust/ optimize any such DIA approach to their lab characteristics (different fixation, different antibodies and IHC protocols etc.) necessitating a lab-specific DIA approach. Calibration of the DIA method/guideline was performed in our previous studies demonstrating very good reproducibility among users12,13. Briefly, after the whole invasive cancer area on a digitized slide was annotated, hematoxylin and DAB stain estimates for each case were refined using the “estimate stain vectors” command. We used watershed cell detection14 to segment the cells in the image with the following settings: Detection image: Optical density sum; requested pixel size: 0.5 µm; background radius: 8 µm; median filter radius: 0 µm; sigma: 1.5 µm; minimum cell area: 10 µm2; maximum cell area: 400 µm2; threshold: 0.1; maximum background intensity: 2. In order to classify detected cells into tumor cells, immune cells, stromal cells, necrosis and others (false detections, background) (Supplementary File 1), we used random trees as a supervised machine-learning method. The features used in the classification are described in Supplementary Table 1. After setting the optimal color deconvolution and cell segmentation, two independent classifiers were trained on a randomly selected, pre-defined core biopsy (CB classifier) and a resection specimen slide (WSI classifier). Both CB and WSI classifiers were run on both CB slides and resection specimen slides in order to adjust for potentially different characteristics of the two specimen types (Fig. 2).

Fig. 2: Digital Image Analysis.
figure 2

Representative pictures of digital image analysis (DIA) masks on a resection specimen (A, B) and a core biopsy case (C, D). Blue corresponds to Ki67 negative tumor cells, red indicates Ki67 positive tumor cells, green indicates stromal cells and purple marks immune cells. Black corresponds to necrosis and yellow marks other detections (false cell detections, noise).

Statistical analysis

For statistical analysis, SPSS 22 software (IBM, Armonk, USA) software was used. Degree of agreement was evaluated by Bland–Altman plot and linear regression. To assess differences between specimen type the Wilcoxon signed-rank test was applied, since the data were not normally distributed. Data were visualized using boxplot, spaghetti plot, and dot-plot.

Results

Between-(consecutive) section difference in Ki67 scoring

Very high correlation and no systematic error (bias: −0.6%; p = 0.08) was found between the two consecutive (serial) sections regarding Ki67 scores. If the Ki67 score is higher for a given case, the difference between the sections tends to be also greater (proportional error p = 0.002, Fig. 3.), however this difference (0.6% mean difference) does not reach clinical relevance.

Fig. 3: Between-(consecutive) section difference in Ki67 scoring.
figure 3

Bland–Altman plot comparing Ki67 scores between consecutive sections (A). Orange dashed line corresponds to the expected mean zero difference between Ki67 scores of the two sections. Red line represents the observed mean difference between Ki67 scores of the two sections, namely the observed bias (red dashed lines are the CI of the observed mean difference). Blue lines illustrate the range of agreement (lower and upper limit of agreement) based on 95% of differences (blue dashed lines are the CI of the limits of agreement). Black line is the fitted regression line to detect potential proportional error (black dashed lines are the CI of the regression line). B represents the scatter plot with fitted regression between the Ki67 scores of the two consecutive sections.

Specimen type (CB vs resection specimen) difference in Ki67 scoring

A low correlation was found between core biopsy and whole section excision images (Fig. 4). Ki67 scores were higher when determined on core biopsy slides compared to paired whole sections (p ≤ 0.001; median difference: 5.31%; IQR: 11.50%) from subsequent surgical excisions of the same tumor. Systematic error occurred between specimens from the same patient as core biopsy Ki67 scores were greater, with a clinically relevant mean difference of 6.6% (bias p = 0.001). The limits of agreement also have to be considered wide from a clinical perspective (between −13.7 and 27). Furthermore, Ki67 scores on CB were even higher compared to WS on cases with higher Ki67 scores (proportional error p = 0.001). Moreover, the variability of differences in Ki67 scores between CB and WS showed an increasing trend, proportional to the magnitude of Ki67 score (Fig. 4). The same results were found irrespective of the origin of the specimens (CB vs WS p < 0.001 for both UK and Japan cases Fig. 5).

Fig. 4: Between-specimen (CB vs resection specimen) difference in Ki67 scoring.
figure 4

Bland–Altman plot comparing Ki67 scores between specimens (A). Orange dashed line corresponds the expected mean zero difference between Ki67 scores of the two sections. Red line represents the observed mean difference between Ki67 scores of the two sections, namely the observed bias (red dashed lines are the CI of the observed mean difference). Blue lines illustrate the range of agreement (lower and upper limit of agreement) based on 95% of differences (blue dashed lines are the CI of the limits of agreement). Black line is the fitted regression line to detect potential proportional error (black dashed lines are the CI of the regression line). B shows the distributions of Ki67 scores of the two specimens. The bottom/top of the boxes represent the first (Q1)/third (Q3) quartiles, the bold line inside the box represents the median and the two bars outside the box represent the lowest/highest datum still within 1.5× the interquartile range (Q3–Q1). C represents the scatter plot with fitted regression between the Ki67 scores of the two specimens.

Fig. 5: Between-specimen (CB vs resection specimen) difference in Ki67 scoring by case and by origin of the cases.
figure 5

A represents cases collected in the United Kingdom with representative Ki67 IHC images of corresponding CB and resection specimens. B represents cases collected in Japan with representative Ki67 IHC images of corresponding CB and resection specimens. The bottom/top of the boxes represent the first (Q1)/third (Q3) quartiles, the bold line inside the box represents the median and the two bars outside the box represent the lowest/highest datum still within 1.5× the interquartile range (Q3–Q1). Outliers are represented with circles, extreme outliers with asterisk.

Discussion

In this study, we observed that clinically relevant and systematic discrepancies occurred in Ki67 scores between core biopsy and corresponding surgical specimens when evaluated with an automated reading system. Overall, Ki67 scores were higher on CB compared to WS samples. Furthermore, this discrepancy was even more pronounced in tumors that expressed higher levels of Ki67 in general.

Ki67 is one of the most promising yet controversial biomarkers in breast cancer with limited adoption into clinical practice due to its high inter- and intra-laboratory variability3,15. However, Ki67 is widely used in many countries, there is wide variability in its use (to distinguish luminal A-like vs B-like tumors; to determine whether to decide for gene-expression profiling or not; as an adjunct to mitotic counts, etc.), with still no uniformity between clinicians on how to use this biomarker, let alone which cut-off to use. Although the IKWG set up a guideline in 2011 to improve pre-analytical and analytical performance, inter-laboratory protocols still demonstrated low reproducibility related to different sampling, fixation, antigen retrieval, staining and scoring methods6,7. As the latter was the largest single contributor to assay variability, the IKWG has undertaken multi-institution efforts that have standardized visual scoring of Ki67 in a manner which requires on-line calibration tools and careful scoring of several hundred cells, which may or may not be ideal for pathologists in daily practice with time-constraints8,9. This result suggests that digital solutions may still be required to address this issue.

The rise of digital image analysis (DIA) platforms has improved capacity and automation in biomarker evaluation16,17. DIA platforms are able to assess nuclear IHC biomarkers such as Ki67, and numerous studies have been conducted to compare human visual scoring with DIA platforms12,18,19,20,21,22,23,24,25,26,27,28. Although the latest guideline of IKWG recommends Ki67 for clinical practice in specific situations, the type of specimen as a potential pre-analytical factor contributing to Ki67 variability was not specifically investigated in a multi-operator/multi-center setting. In this study we aimed to address these biospecimen questions including assessment by specimen type and between serial sections.

One explanation for our finding would be the presence of tumor heterogeneity, and the broader field of review in a whole section from resection specimen. However, one would expect that this cause of discrepancy would result in random discordance, not the consistent finding that Ki67 scores on core biopsies are higher than that of on resection specimens. Rather, we conclude that lower Ki67 in resection specimens is more easily explained by pre-analytical factors. For example, since longer times to fixation occur with resection specimens compared to CB, persistent cell division will occur even in an unfixed, hypoxic environment. Further, epitope degradation also occurs with prolonged time to fixation29,30,31.

In addition, one can expect that hot spot scoring might lead to less discrepancy between CB and WS because it considers only the hottest area of Ki67 positivity (highest percentiles of Ki67 distribution) on both specimen types, while global assessment evaluates the total Ki67 distribution which can be variable10. However, there remains a fundamental issue of exact hot spot definition and where pathologists set its boundaries. Moreover, the International Ki67 Working Group has recommended global scoring over hot spot as it did show a consistent trend towards increased reproducibility in both core biopsy9 and excision10 specimens.

Additional support for the conclusion that the difference in Ki67 between CB and WS is provided by the observation of clinically relevant differences between specimens in cases from different institutions used in this study, independently scored multiple times by 17 pathologists. Although many studies focused on assessing the level of agreement between CB and resection samples in Ki67 scoring; consensus was not possible due to lack of standardization32.

Our results are consistent with previous results showing poor/moderate concordance (κ = 0.195–0.814) occurring between CB and resection specimen in Ki67 scoring1,33,34,35,36,37,38,39,40,41,42,43,44,45,46. However, some studies showed higher Ki67 scores on resection samples35,36,38. This discrepancy among studies may be due to lack of standardization in methodology leading to different scoring methods, which we have previously demonstrated to be highly variable2. Moreover, inter-institutional discrepancies could also be the result of different antibodies and protocols used to detect Ki67, different tissue handling/fixation protocols and at some point tumor Ki67 heterogeneity since Ki67 is heterogeneous in tumors6. Thus, our findings provide further support to the latest IKWG recommendations and provide a consensus that Ki67 should be ideally tested on CB samples because it minimizes many fixation problems as Ki67 IHC is more sensitive than ER or HER2 to variabilities in fixation2. Since pre-analytical factors are critical in diagnostic pathology, the IKWG recommends that breast cancer samples for Ki67 testing should be processed in line with ASCO/CAP guidelines2.

There are a number of limitations in this study. This study only focused on analytical and preanalytical questions, therefore we cannot demonstrate the clinical validation of the calibrated tool. There are many other studies that address the prognostic or predictive value of this test, and that goal was beyond the scope of this effort. For the same reason, further clinical studies are needed to demonstrate how does this consistent difference in Ki67 between corresponding core-cuts and resection specimen impact on prognostic value or its clinical implication on the assessment of neoadjuvant endocrine therapy benefit. Furthermore, the low correlation suggests a critical difference between a core biopsy score and a whole section excision score, which can undermine the use of data on outcome, derived predominantly from resection samples, to identify patients at high risk using a score derived from a core biopsy. Therefore, this study suggests caution in this approach given that even without intervening therapy a clinically relevant change in Ki67 may occur. Further, the Ki67 assessments were based on biospecimens from only 2 central sites. While the participating pathologists within the IKWG represented 15 countries, specimens were centrally acquired and stained. Whereas other investigators have compared specimens from multiple different sites5,7,47 we limited the number of sites to remove the variables associated with the technical aspect of the stain. Finally, while the core cut biopsy and resection are from the same case, only a single core was assessed. Thus, we could be missing heterogeneity seen in larger resection specimens. The effect of heterogeneity could be decreased by taking multiple core cuts when clinical situation allows. However, since examination of a single core cut represents the clinical standard of care in several countries, we did not pursue multiple cores.

In conclusion, while we find no significant difference in digitally-assessed Ki67 index between serial sections, we do find a systematic discrepancy between core biopsy and corresponding whole sections – core biopsy samples yield higher scores (likely due to pre-analytical factors including more standard and prompt tissue handling, fixation, etc.). Therefore, this work suggests that Ki67 IHC tested on core biopsy samples should be preferred to excision specimens in clinical decision-making, because doing so will preclude many pre-analytical factors.