- Research article
- Open Access
Independent validation of machine learning in diagnosing breast Cancer on magnetic resonance imaging within a single institution
Cancer Imaging volume 19, Article number: 64 (2019)
As artificial intelligence methods for the diagnosis of disease advance, we aimed to evaluate machine learning in the predictive task of distinguishing between malignant and benign breast lesions on an independent clinical magnetic resonance imaging (MRI) dataset within a single institution for subsequent use as a computer aid for radiologists.
Computer analysis was conducted on consecutive dynamic contrast-enhanced MRI (DCE-MRI) studies from 1483 breast cancer and 496 benign patients who underwent MRI examinations between February 2015 and October 2017; with the age ranges of the cancer and benign patients being 19 to 77 and 16 to 76 years old, respectively. Cases were separated into a training dataset (years 2015 & 2016; 1444 cases) and an independent testing dataset (year 2017; 535 cases) based solely on MRI examination date. After radiologist indication of the lesion, the computer automatically segmented and extracted radiomic features, which were subsequently merged with a support-vector machine (SVM) to yield a lesion signature. Area under the receiving operating characteristic (ROC) curve (AUC) with 95% confidence intervals (CI) served as the primary figure of merit in the statistical evaluation for this clinical classification task.
In the task of distinguishing malignant and benign breast lesions DCE-MRI, the trained predictive model yielded an AUC value of 0.89 (95% CI: 0.858, 0.922) on the independent image set. AUC values of 0.88 (95% CI: 0.845, 0.926) and 0.90 (95% CI: 0.837, 0.940) were obtained for mass lesions only and non-mass lesions only, respectively. Compared with actual clinical management decisions, the predictive model achieved 99.5% sensitivity with 9.6% fewer recommended biopsies.
On an independent, consecutive clinical dataset within a single institution, a trained machine learning system yielded promising performance in distinguishing between malignant and benign breast lesions.
Breast cancer is the most common cancer and the second leading cause of cancer death in women in western countries . In Chinese women, breast cancer is the most common cancer diagnosed, and it alone is expected to account for 15% of all new cancers in women . Dynamic contrast enhanced (DCE) magnetic resonance imaging (MRI) of the breast is being used increasingly for a variety of clinical purposes, including screening of women at high risk for developing breast cancer, evaluating of the extent of malignant disease, and post-treatment evaluation [3,4,5]. DCE-MRI has emerged as a modality that is complementary to mammography and ultrasonography because of the additional three-dimensional spatial and temporal information about the lesion that it yields.
While there is diagnostic value of DCE-MRI characterization in the differentiation of malignant from benign lesions , the MRI assessment of breast cancer cases may be hindered by inter-observer and intra-observer variations, labor-intensive interpretation methods, and limited clinical interpretation guidelines [7, 8]. To aid radiologists in diagnostic classification, various investigators are developing computerized image analysis methods for characterization, i.e., computer-aided diagnosis (CADx)/radiomics [9,10,11,12,13,14,15]. The purpose of this study was to evaluate the potential of quantitative MRI radiomics and machine learning in the task of distinguishing between malignant and benign breast lesions on an independent, consecutive clinical dataset within a single institution for ultimate use as a computer aid to radiologists in the workup of breast lesions. To our knowledge, our study is the largest such independent study in the field.
Breast DCE-MRI database
Our study initially involved 4704 patients presenting for breast DCE-MRI examinations as recorded in the Department of Breast Imaging of the Tianjin Medical University Cancer Institute and Hospital. As this study was a retrospective and anonymized machine learning study, informed consent was waived and the study was deemed exempt. Patient’s MRIs and clinical data were collected consecutively for our study within the years of 2015–2017. Exclusion criteria included patients with either previous surgical excision, systemic hormone therapy, chemotherapy or the patients without final pathology results. A total of 1979 patients were ultimately included in our study (Fig. 1).
We conducted a retrospective review of the breast MRI images from the 1483 histopathology-proven breast cancer patients and the 496 histopathology-proven benign patients who had underwent diagnostic breast MRI examinations between February 2015 and October 2017. All histopathology was based on surgical specimens. The age range of the cancer patients was between 19 and 77 years old with an average of 48.1 years with a standard deviation of 9.9 years and a median of 47 years. The age range of the benign patients was between 16 and 76 years old with an average of 42.1 years with a standard deviation of 9.8 years and a median of 43 years. The breast MRI databases consisted of 1494 lesions from the 1483 cancer patients, including 8 bilateral breast cancer patients and 3 bifocal breast cancer patients, and 496 primary lesions from 496 benign patients.
MR images had been obtained with a 3 T GE system using a dedicated 8-channel phased-array breast coil (Discovery 750, GE Medical Systems, Milwaukee, WI). Sagittal dynamic contrast-enhanced MRI (DCE-MRI) was obtained with the volume imaging for breast assessment (VIBRANT) bilateral breast imaging technique, with TR = 6.1 ms, TE = 2.9 ms, flip angle = 15°, matrix size = 256 × 128, field of view = 26 cm × 26 cm, NEX = 1, slice thickness = 1.8 mm. The temporal resolution for each dynamic acquisition was 90 s. Before injection of the contrast agent, serial mask images were obtained. Successively, the contrast agent (Gd-DTPA, 0.1 mmol/kg body weight, flow rate 2.0 ml/s) was injected using an automatic MR-compatible power injector, and followed by flushing with the same total dose of saline solution. Dynamic MRI acquisitions were started immediately after the injection. The acquisition was repeated five times, and each phase took 90 s.
In order to not incur bias in case selection as well as to mimic a development-then-clinical-use scenario, our database was divided into a training dataset and a testing dataset based solely on the date of the MRI examinations. The training data set included the breast MRIs acquired within February 2015 through December 2016, and the test dataset included the breast MRIs acquired within January 2017 through October 2017. Note that the cases were unique in that no patients were within both the training and testing sets.
The clinicopathological characteristics of the breast cancer and benign patients of the two datasets are shown in Table 1, including the BI-RADS classifications. Invasive ductal carcinomas composed the majority of malignant lesions, whereas fibroadenomas were the most common benign lesion (Fig. 2). During the patients’ clinical workup, BI-RADS ratings had been recorded by the MRI radiologist using the Breast Imaging Reporting and Data System (BI-RADS) . Note that all of the patients in this study underwent pathological examination, even those with MRI-BI-RADS categories 1 or 2 or 3 when their mammographic or their sonographic findings were judged to be suspicious or highly suggestive for cancer, and the actual clinical decisions were made according to the multimodality medical imaging interpretations.
Computerized analysis of breast lesions on MRI images
We analyzed the DCE-MRIs using an existing quantitative radiomics machine learning workstation from the University of Chicago, which had been previously developed to characterize suspicious breast lesions on MRI as benign or malignant (Fig. 3) [11, 17,18,19]. With the workstation, a breast lesion is first manually located on the MRI by the study radiologist (YJ), a breast radiologist with 5 years of experience in breast DCE-MRIs. The computer then automatically conducted three-dimensional segmentation of the tumor and extraction of radiomic features, including those from six categories: size, shape, morphology, enhancement texture, kinetics, and enhancement-variance kinetics.
The output from this established workstation was subsequently used for the machine learning predictive model to perform classification—that is, calculation of a malignancy score related to the likelihood of malignancy for each lesion.
During training of the predictive model on the training set, stepwise feature selection using linear discriminant analysis with a Wilks lambda cost function  was conducted in order to identify the subset of features that performed effectively in the classification of malignant and benign lesions . Then a support-vector machine (SVM) classifier  was trained yielding a lesion score, related to the likelihood of malignancy.
The diagnostic performance was evaluated using the trained predictive model on the independent test set – for (a) all cases, both mass and non-mass lesions, (b) only mass lesions, and (c) only non-mass lesions. In order to assess the robustness of the trained system, only the one trained system was used in all three evaluations. Such evaluations were deemed to mimic the clinical situation where the mass/non-mass status of a lesion is unknown.
Performance evaluation and statistical analyses
Receiver operating characteristic (ROC) analysis was used to assess overall classification performance on the independent test set for the task of differentiating between malignant and benign lesions: (a) for all lesions; i.e., both mass and non-mass lesions, (b) only mass lesions, and (c) only non-mass lesions. Area under the ROC curve (AUC) served as the primary figure of merit in these tasks [23, 24]. Secondary performance metrics calculated were sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) .
Note that the BI-RADS had been used by the radiologist during the actual clinical interpretation in which all available MR images were used. And although BI-RADS categories 1, 2 and 3 are considered benign and categories 4 and 5 are considered malignant, clinically, all lesions had been sent to biopsy. Therefore, the clinical performance could be characterized as having 100% sensitivity and 0% specificity.
Thus, for comparison of the machine learning system to the actual clinical findings, the threshold value of the computer-generated malignancy score that resulted in 100% sensitivity on the training set was determined and subsequently applied to the testing set to obtain sensitivity, specificity, PPV, and NPV values. Resulting performance values at different threshold values were also calculated. PPV is calculated as the percentage of true positives over all lesions that had been classified as positive (i.e., malignant) by the trained predictive model, i.e., the probability that a case with a malignant computer output actually has cancer. NPV is the percentage of true negatives over all lesions that had been classified as negative (i.e., benign) by the trained predictive model, i.e., the likelihood that a case with a benign computer output actually is cancer free.
All statistical analyses were performed using SPSS software (version 19.0, SPSS). The reported p-values were two-sided. A p-value less than 0.05 was set as the threshold for statistical significance given that a single performance evaluation was conducted. In addition, confidence intervals were calculated using ROC software.
Radiomic features, which had been selected and merged into the lesion signature during training included 2 shape phenotypes, 1 morphological phenotype, 3 enhancement texture phenotypes, and 4 kinetic curve assessments (Table 2).
On the independent test dataset including both mass and non-mass lesions, the trained machine learning system yielded an AUC value of 0.89 (95% CI: 0.858, 0.922) in the task of distinguishing between malignant and benign mass lesions (Fig. 4).
For mass lesions in the test dataset, the trained system yielded an AUC value of 0.88 (95% CI: 0.845, 0.926). For non-mass lesions in the test dataset, the trained system yielded an AUC value of 0.90 (95% CI: 0.837, 0.940).
Summary of sensitivity, specificity, PPV, and NPV values at different threshold values of the malignancy score in test set are given in Table 3. At the threshold value that had yielded 100% sensitivity on the training set, the machine learning system on both mass and non-mass lesions demonstrated on the test set a higher PPV (i.e., 80.3%, 419/522) than the actual clinical decisions (78.7%, 421/535) (P > 0 .05), that is, it suggested eleven fewer unnecessary benign biopsies (i.e., 9.6%, 11/114). However, it erroneously would have not recommended biopsy of two cancers (i.e., 0.5%, 2/421). These two cases were both invasive ductal carcinomas and were initially classified by the radiologist as BI-RADS 5, and thus, would have gone to biopsy.
Compared with non-mass lesions, the machine learning system demonstrated a lower sensitivity (P > 0.05) and higher specificity on mass lesions (P > 0.05).
Some representative breast DCE-MRI studies from the independent consecutive test set as classified by the trained MRI machine learning system are presented in Fig. 5.
Our results demonstrated that a computer workstation, initially developed with datasets from the US for automatic 3D lesion segmentation and radiomic feature extraction, has the potential to distinguish between malignant and benign breast lesions from Chinese populations. It is important to note that the statistical power of the current study was limited by the modest size of the database, even though, to our knowledge, this is the largest database of this type in this breast radiomics field. Our results demonstrate that machine learning analysis of DCE-MRI may potentially provide clinically-useful information to distinguish benign and malignant lesions in Chinese databases obtained from a single institution.
While we cannot compare directly to the reported results from others due to the use of different databases, we can note that the performance level of the computer workstation was similar, and often higher, than other reported AUCs in this diagnostic task [26,27,28]. We also note that our performance was higher than that reported in Shimauchi et al. , which indicated that use of the computer aid resulted in a statistically significant improvement in radiologists’ performances.
The American College of Radiology (ACR) BI-RADS MRI lexicon  is used worldwide for describing the morphologic and kinetic features of breast lesions. It allows for standardization of the terminology used in describing the findings and categorization of the study. Subsequent descriptors of other lesion features, such as shape, distribution, margins, enhancement pattern are also used, which differ depending on the type of enhancement, i.e., mass enhancement or non-mass enhancement. Most previous investigations have reported on masses and rarely for lesions presenting as non-mass enhancement, primarily because of the challenges in defining the lesion extent for computer-based analysis. In our study, in order to mimic clinical practice, a single and independently-trained machine learning model was used for all the lesion types (masses and non-mass enhancements), and our result demonstrated that the classification model was stable in the task of distinguishing between malignant and benign for mass and non-mass lesions.
Note that in clinical practice, radiologists’ performance is based on multiparametric breast MR images, including DCE, T2-weighted, and diffusion-weighted images, as well as mammography and ultrasound. In our study, the computer only analyzed dynamic contrast-enhanced MR images to yield the predictive lesion signature. One would expect improved performance by using multiparametric breast MR images and multimodality medical images; thus, we will analyze those in the future.
The imaging technique used in our study involves acquisition of one pre-contrast and a series of post-contrast images of both breasts at a temporal resolution of roughly 90 s. This type of breast MRI acquisition sequence has the advantage of being able to provide both morphological and kinetic information from one MRI examination, and was representative of early dynamic MRI protocols . In addition, our large clinical database came from a single institution, thus, handling the problem that the image acquisition protocols across breast MRIs might not be standardized. However, that also limits statements on generalizability of the findings.
Patient motion during image acquisition may introduce inaccuracies in the computer-extracted kinetic features [31, 32]. Cases with abrupt and large patient movements between dynamic series had been clinically treated as acquisition failure and were clinically excluded from our datasets. In our datasets, only patient respiratory motion was observed. The motion mostly resulted in additional blurring rather than actual displacement of image structure. However, it is important to note that image alignment of breast volumes at different time frames may improve the accuracy of our analyses.
There are some limitations of this study, First, this was a retrospective analysis of images from a single vendor acquired at a single institution, although the analysis was conducted with independent training and testing sets with unique patients. It will be critical to evaluate whether the present findings generalize to other vendor images and external data. A future multicenter study may help address this question. Second, all the cases had gone to biopsy, thus, we could not assess the system on benign lesions that were deemed benign solely by follow-up. Also, the study findings cannot be used to determine whether the radiologists’ performances with the computer aid system are significantly improved in comparison with their performances without computer aid, even though we analyzed the DCE-MRI diagnostic results by the clinical radiologists. A clinical observer study is necessary. We note that we previously demonstrated in an observer study that use of computer-aided diagnosis with MRI improves the performance of radiologists in the task of differentiating malignant and benign lesions .
In conclusion, we have validated a machine-learning radiomics method for DCE-MRI on an independent, consecutive patient test set, suggesting a potentially useful aid for radiologists in the task of distinguishing between malignant and benign breast lesions during diagnostic workup of breast lesions.
Availability of data and materials
Area under the receiver operating characteristic curve
Breast Imaging Reporting and Data System
Dynamic contrast-enhanced magnetic resonance imaging
Ductal carcinoma in situ
Invasive ductal carcinoma
Negative predictive value
Positive predictive value
Receiver operating characteristic
Siegel RL, Miller KD, Jemal A. Cancer statistics, 2018. CA Cancer J Clin. 2018;68(1):7–30.
Chen W, Zheng R, Baade PD, et al. Cancer statistics in China, 2015. CA Cancer J Clin. 2016;66(2):115–32.
Newstead GM. MR imaging in the management of patients with breast cancer. Semin Ultrasound CT MR. 2006;27(4):320–32.
Demartini W, Lehman C. A review of current evidence-based clinical applications for breast magnetic resonance imaging. Top Magn Reson Imaging. 2008;19(3):143–50.
Saslow D, Boetes C, Burke W, et al. American Cancer Society breast Cancer advisory group. American Cancer Society guidelines for breast screening with MRI as an adjunct to mammography. CA Cancer J Clin. 2007;57(2):75–89.
Wiener JI, Schilling KJ, Adami C, et al. Assessment of suspected breast cancer by MRI: a prospective clinical trial using a combined kinetic and morphologic analysis. AJR Am J Roentgenol. 2005;184(3):878–86.
Mussurakis S, Buckley DL, Coady AM, et al. Observer variability in the interpretation of contrast enhanced MRI of the breast. Br J Radiol. 1996;69(827):1009–16.
Kim SJ, Morris EA, Liberman L, et al. Observer variability and applicability of BI-RADS terminology for breast MR imaging: invasive carcinomas as focal masses. AJR Am J Roentgenol. 2001;177(3):551–7.
Giger ML, Chan HP, Boone J. Anniversary paper: history and status of CAD and quantitative image analysis: the role of medical physics and AAPM. Med Phys. 2008;35(12):5799–820.
Giger ML, Karssemeijer N, Schnabel JA. Breast image analysis for risk assessment, detection, diagnosis, and treatment of Cancer. Annu Rev Biomed Eng. 2013;15(1):327–57.
Chen W, Giger ML, Lan L, et al. Computerized interpretation of breast MRI: investigation of enhancement-variance dynamics. Med Phys. 2004;31(5):1076–82.
Li H, Zhu Y, Burnside ES, et al. Quantitative MRI radiomics in the prediction of molecular classifications of breast cancer subtypes in the TCGA/TCIA data set. NPJ Breast Cancer. 2016;2:16012.
Antropova N, Abe H, Giger ML. Use of clinical MRI maximum intensity projections for improved breast lesion classification with deep CNNs. J Med Imaging (Bellingham). 2018;5(1):014503.
Gallego-Ortiz C, Martel AL. A graph-based lesion characterization and deep embedding approach for improved computer-aided diagnonsis of nonnmass breast MRI lesions. Med Image Anal. 2019;51:116–24.
Parekh VS, Jacobs MA. Integrated radiomic framework for breast cancer and tumor biology using advanced machine learning and multiparametric MRI. NPJ Breast Cancer. 2017;3:43.
Morris EA, Comstock CE, Lee CH, et al. ACR BI-RADS magnetic resonance imaging. In: Sickles EA, Mendelson EB, et al., editors. D’Orsi CJ. ACR BI-RADS atlas, breast imaging reporting and data system. Reston, Va: American College of Radiology; 2013. p. 125–43.
Gilhuijs KG, Giger ML, Bick U. Automated analysis of breast lesions in three dimensions using dynamic magnetic resonance imaging. Med Phys. 1998;25(9):1647–54.
Chen W, Giger ML, Bick U, et al. Automatic identification and classification of characteristic kinetic curves of breast lesions on DCE-MRI. Med Phys. 2006;33(8):2878–87.
Chen W, Giger ML, Li H, et al. Volumetric texture analysis of breast lesions on contrast-enhanced magnetic resonance images. Magn Reson Med. 2007;58(3):562–71.
Neha B, Giger ML, Jansen SA, et al. Cancerous breast lesions on dynamic contrast-enhanced MR images: computerized characterization for image-based prognostic markers. Radiology. 2010;254(3):680–90.
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence. Los Altos: Morgan Kaufmann Publishers Inc; 1995. p. 1137–43.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Investig Radiol. 1989;24(3):234–45.
Metz CE, Herman BA, Roe CA. Statistical comparison of two ROC-curve estimates obtained from partially-paired datasets. Med Decis Mak. 1998;18(1):110–21.
Altman DG, Bland JM. Diagnostic tests. 1: sensitivity and specificity. BMJ. 1994;308(6943):1552.
Reig B, Heacock L, Geras KJ, Moy L. Machine learning in breast MRI. J Magn Reson Imaging. 2019.
Sheth D, Giger ML. Artificial intelligence in the interpretation of breast cancer on MRI. J Magn Reson Imaging. 2019.
Truhn D, Schrading S, Haarburger C, Schneider H, Merhof D, Kuhl C. Radiomic versus convolutional neural networks analysis for classification of contrast-enhancing lesions at multiparametric breast MRI. Radiology. 2019;290(2):290–7.
Shimauchi A, Giger ML, Bhooshan N, et al. Evaluation of clinical breast MR imaging performed with prototype computer-aided diagnosis breast MR imaging workstation: reader study. Radiology. 2012;23(1):696–704.
Kuhl CK, Schild HH. Dynamic image interpretation of MRI of the breast. J Magn Reson Imaging. 2000;12(6):965–74.
Schultz CL, Alfidi RJ, Nelson AD, et al. The effect of motion on two-dimensional Fourier transformation magnetic resonance images. Radiology. 1984;152(1):117–21.
Ehman RL, McNamara MT, Brasch RC, et al. Influence of physiologic motion on the appearance of tissue in MR images. Radiology. 1986;159(3):777–82.
This work was partially funded by National Institutes of Health (NIH) CA195564, and National Natural Science Foundation of China (81801781). MLG is a stockholder in R2 Technology/Hologic and was a cofounder, equity holder, and scientific advisor in Quantitative Insights, which is now Qlarity Imaging. MLG receives royalties from Hologic, GE Medical Systems, MEDIAN Technologies, Riverain Medical, Mitsubishi, and Toshiba. It is the University of Chicago Conflict of Interest Policy that investigators disclose publicly actual or potential significant financial interest that would reasonably appear to be directly and significantly affected by the research activities.
This work was partially funded by National Institutes of Health (NIH) CA195564, and National Natural Science Foundation of China (81801781).
Ethics approval and consent to participate
Our study initially involved 4704 patients who presented initially clinically for breast DCE-MRI examinations as recorded in the Department of Breast Imaging of the Tianjin Medical University Cancer Institute and Hospital. As this study involved a retrospective collection of images and data and an anonymized machine learning study, informed consent was waived and the study was deemed exempt by the IRB at the University of Chicago.
Consent for publication
We consent for publication.
MLG is a stockholder in R2 Technology/Hologic and a cofounder, equity holder, and scientific advisor in Quantitative Insights. MLG receives royalties from Hologic, GE Medical Systems, MEDIAN Technologies, Riverain Medical, Mitsubishi, and Toshiba. It is the University of Chicago Conflict of Interest Policy that investigators disclose publicly actual or potential significant financial interest that would reasonably appear to be directly and significantly affected by the research activities.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
- Computer-aided diagnosis
- Breast cancer
- Quantitative MRI
- Machine learning
- Artificial intelligence (AI)
- Independent statistical testing