Independent validation of machine learning in diagnosing breast Cancer on magnetic resonance imaging within a single institution

Background As artificial intelligence methods for the diagnosis of disease advance, we aimed to evaluate machine learning in the predictive task of distinguishing between malignant and benign breast lesions on an independent clinical magnetic resonance imaging (MRI) dataset within a single institution for subsequent use as a computer aid for radiologists. Methods Computer analysis was conducted on consecutive dynamic contrast-enhanced MRI (DCE-MRI) studies from 1483 breast cancer and 496 benign patients who underwent MRI examinations between February 2015 and October 2017; with the age ranges of the cancer and benign patients being 19 to 77 and 16 to 76 years old, respectively. Cases were separated into a training dataset (years 2015 & 2016; 1444 cases) and an independent testing dataset (year 2017; 535 cases) based solely on MRI examination date. After radiologist indication of the lesion, the computer automatically segmented and extracted radiomic features, which were subsequently merged with a support-vector machine (SVM) to yield a lesion signature. Area under the receiving operating characteristic (ROC) curve (AUC) with 95% confidence intervals (CI) served as the primary figure of merit in the statistical evaluation for this clinical classification task. Results In the task of distinguishing malignant and benign breast lesions DCE-MRI, the trained predictive model yielded an AUC value of 0.89 (95% CI: 0.858, 0.922) on the independent image set. AUC values of 0.88 (95% CI: 0.845, 0.926) and 0.90 (95% CI: 0.837, 0.940) were obtained for mass lesions only and non-mass lesions only, respectively. Compared with actual clinical management decisions, the predictive model achieved 99.5% sensitivity with 9.6% fewer recommended biopsies. Conclusion On an independent, consecutive clinical dataset within a single institution, a trained machine learning system yielded promising performance in distinguishing between malignant and benign breast lesions.


Background
Breast cancer is the most common cancer and the second leading cause of cancer death in women in western countries [1]. In Chinese women, breast cancer is the most common cancer diagnosed, and it alone is expected to account for 15% of all new cancers in women [2]. Dynamic contrast enhanced (DCE) magnetic resonance imaging (MRI) of the breast is being used increasingly for a variety of clinical purposes, including screening of women at high risk for developing breast cancer, evaluating of the extent of malignant disease, and post-treatment evaluation [3][4][5]. DCE-MRI has emerged as a modality that is complementary to mammography and ultrasonography because of the additional three-dimensional spatial and temporal information about the lesion that it yields.
While there is diagnostic value of DCE-MRI characterization in the differentiation of malignant from benign lesions [6], the MRI assessment of breast cancer cases may be hindered by inter-observer and intra-observer variations, labor-intensive interpretation methods, and limited clinical interpretation guidelines [7,8]. To aid radiologists in diagnostic classification, various investigators are developing computerized image analysis methods for characterization, i.e., computer-aided diagnosis (CADx)/radiomics [9][10][11][12][13][14][15]. The purpose of this study was to evaluate the potential of quantitative MRI radiomics and machine learning in the task of distinguishing between malignant and benign breast lesions on an independent, consecutive clinical dataset within a single institution for ultimate use as a computer aid to radiologists in the workup of breast lesions. To our knowledge, our study is the largest such independent study in the field.

Breast DCE-MRI database
Our study initially involved 4704 patients presenting for breast DCE-MRI examinations as recorded in the Department of Breast Imaging of the Tianjin Medical University Cancer Institute and Hospital. As this study was a retrospective and anonymized machine learning study, informed consent was waived and the study was deemed exempt. Patient's MRIs and clinical data were collected consecutively for our study within the years of 2015-2017. Exclusion criteria included patients with either previous surgical excision, systemic hormone therapy, chemotherapy or the patients without final pathology results. A total of 1979 patients were ultimately included in our study (Fig. 1).
We conducted a retrospective review of the breast MRI images from the 1483 histopathology-proven breast cancer patients and the 496 histopathologyproven benign patients who had underwent diagnostic breast MRI examinations between February 2015 and October 2017. All histopathology was based on surgical specimens. The age range of the cancer patients was between 19 and 77 years old with an average of 48.1 years with a standard deviation of 9.9 years and a median of 47 years. The age range of the benign patients was between 16 and 76 years old with an average of 42.1 years with a standard deviation of 9.8 years  The breast MRI databases  consisted of 1494 lesions from the 1483 cancer patients, including 8 bilateral breast cancer patients and  3 bifocal breast cancer patients, and 496 primary lesions from 496 benign patients. MR images had been obtained with a 3 T GE system using a dedicated 8-channel phased-array breast coil (Discovery 750, GE Medical Systems, Milwaukee, WI). Sagittal dynamic contrast-enhanced MRI (DCE-MRI) was obtained with the volume imaging for breast assessment (VIBRANT) bilateral breast imaging technique, with TR = 6.1 ms, TE = 2.9 ms, flip angle = 15°, matrix size = 256 × 128, field of view = 26 cm × 26 cm, NEX = 1, slice thickness = 1.8 mm. The temporal resolution for each dynamic acquisition was 90 s. Before injection of the contrast agent, serial mask images were obtained. Successively, the contrast agent (Gd-DTPA, 0.1 mmol/kg body weight, flow rate 2.0 ml/s) was injected using an automatic MR-compatible power injector, and followed by flushing with the same total dose of saline solution. Dynamic MRI acquisitions were started immediately after the injection. The acquisition was repeated five times, and each phase took 90 s.
In order to not incur bias in case selection as well as to mimic a development-then-clinical-use scenario, our database was divided into a training dataset and a testing dataset based solely on the date of the MRI examinations. The training data set included the breast MRIs acquired within February 2015 through December 2016, and the test dataset included the breast MRIs acquired within January 2017 through October 2017. Note that the cases were unique in that no patients were within both the training and testing sets.
The clinicopathological characteristics of the breast cancer and benign patients of the two datasets are shown in Table 1, including the BI-RADS classifications. Invasive ductal carcinomas composed the majority of malignant lesions, whereas fibroadenomas were the most common benign lesion (Fig. 2). During the patients' clinical workup, BI-RADS ratings had been recorded by the MRI radiologist using the Breast Imaging Reporting and Data System (BI-RADS) [16]. Note that all of the patients in this study underwent pathological examination, even those with MRI-BI-RADS categories 1 or 2 or 3 when their mammographic or their sonographic findings were judged to be suspicious or highly suggestive for cancer, and the actual clinical decisions were made according to the multimodality medical imaging interpretations.

Computerized analysis of breast lesions on MRI images
We analyzed the DCE-MRIs using an existing quantitative radiomics machine learning workstation from the University of Chicago, which had been previously developed to characterize suspicious breast lesions on MRI as benign or malignant (Fig. 3) [11,[17][18][19]. With the workstation, a breast lesion is first manually located on the MRI by the study radiologist (YJ), a breast radiologist with 5 years of experience in breast DCE-MRIs. The computer then automatically conducted three-dimensional segmentation of the tumor and extraction of radiomic features, including those from six categories: size, shape, morphology, enhancement texture, kinetics, and enhancement-variance kinetics.
The output from this established workstation was subsequently used for the machine learning predictive model to perform classification-that is, calculation of a malignancy score related to the likelihood of malignancy for each lesion.
During training of the predictive model on the training set, stepwise feature selection using linear discriminant analysis with a Wilks lambda cost function [20] was conducted in order to identify the subset of features that performed effectively in the classification of malignant and benign lesions [21]. Then a support-vector machine (SVM) classifier [22] was trained yielding a lesion score, related to the likelihood of malignancy.
The diagnostic performance was evaluated using the trained predictive model on the independent test set for (a) all cases, both mass and non-mass lesions, (b) only mass lesions, and (c) only non-mass lesions. In order to assess the robustness of the trained system, only the one trained system was used in all three evaluations. Such evaluations were deemed to mimic the clinical situation where the mass/non-mass status of a lesion is unknown.

Performance evaluation and statistical analyses
Receiver operating characteristic (ROC) analysis was used to assess overall classification performance on the independent test set for the task of differentiating between malignant and benign lesions: (a) for all lesions; i.e., both mass and non-mass lesions, (b) only mass lesions, and (c) only non-mass lesions. Area under the ROC curve (AUC) served as the primary figure of merit in these tasks [23,24]. Secondary performance metrics calculated were sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) [25].
Note that the BI-RADS had been used by the radiologist during the actual clinical interpretation in which all available MR images were used. And although BI-RADS categories 1, 2 and 3 are considered benign and categories 4 and 5 are considered malignant, clinically, all lesions had been sent to biopsy.  Therefore, the clinical performance could be characterized as having 100% sensitivity and 0% specificity. Thus, for comparison of the machine learning system to the actual clinical findings, the threshold value of the computer-generated malignancy score that resulted in 100% sensitivity on the training set was determined and subsequently applied to the testing set to obtain sensitivity, specificity, PPV, and NPV values.
Resulting performance values at different threshold values were also calculated. PPV is calculated as the percentage of true positives over all lesions that had been classified as positive (i.e., malignant) by the trained predictive model, i.e., the probability that a case with a malignant computer output actually has cancer. NPV is the percentage of true negatives over all lesions that had been classified as negative (i.e., benign) by the trained predictive model, i.e., the likelihood that a case with a benign computer output actually is cancer free.
All statistical analyses were performed using SPSS software (version 19.0, SPSS). The reported p-values were two-sided. A p-value less than 0.05 was set as the threshold for statistical significance given that a single performance evaluation was conducted. In addition, confidence intervals were calculated using ROC software.

Results
Radiomic features, which had been selected and merged into the lesion signature during training included 2 shape phenotypes, 1 morphological phenotype, 3 enhancement texture phenotypes, and 4 kinetic curve assessments ( Table 2). On the independent test dataset including both mass and non-mass lesions, the trained machine learning system yielded an AUC value of 0.89 (95% CI: 0.858, 0.922) in the task of distinguishing between malignant and benign mass lesions (Fig. 4).
For mass lesions in the test dataset, the trained system yielded an AUC value of 0.88 (95% CI: 0.845, 0.926). For non-mass lesions in the test dataset, the trained system yielded an AUC value of 0.90 (95% CI: 0.837, 0.940).
Summary of sensitivity, specificity, PPV, and NPV values at different threshold values of the malignancy score in test set are given in Table 3. At the threshold value that had yielded 100% sensitivity on the  training set, the machine learning system on both mass and non-mass lesions demonstrated on the test set a higher PPV (i.e., 80.3%, 419/522) than the actual clinical decisions (78.7%, 421/535) (P > 0 .05), that is, it suggested eleven fewer unnecessary benign biopsies (i.e., 9.6%, 11/114). However, it erroneously would have not recommended biopsy of two cancers (i.e., 0.5%, 2/421). These two cases were both invasive ductal carcinomas and were initially classified by the radiologist as BI-RADS 5, and thus, would have gone to biopsy. Compared with non-mass lesions, the machine learning system demonstrated a lower sensitivity (P > 0.05) and higher specificity on mass lesions (P > 0.05).
Some representative breast DCE-MRI studies from the independent consecutive test set as classified by the trained MRI machine learning system are presented in Fig. 5.

Discussion
Our results demonstrated that a computer workstation, initially developed with datasets from the US for automatic 3D lesion segmentation and radiomic feature extraction, has the potential to distinguish between malignant and benign breast lesions from Chinese populations. It is important to note that the statistical power of the current study was limited by the modest size of the database, even though, to our knowledge, this is the largest database of this type in this breast radiomics field. Our results demonstrate that machine learning analysis of DCE-MRI may potentially provide clinically-useful information to distinguish benign and malignant lesions in Chinese databases obtained from a single institution.
While we cannot compare directly to the reported results from others due to the use of different databases, we can note that the performance level of the computer workstation was similar, and often higher, than other reported AUCs in this diagnostic task [26][27][28]. We also note that our performance was higher than that reported in Shimauchi et al. [29], which indicated that use of the computer aid resulted in a statistically significant improvement in radiologists' performances.
The American College of Radiology (ACR) BI-RADS MRI lexicon [16] is used worldwide for describing the morphologic and kinetic features of breast lesions. It allows for standardization of the terminology used in describing the findings and categorization of the study. Subsequent descriptors of other lesion features, such as shape, distribution, margins, enhancement pattern are also used, which differ depending on the type of enhancement, i.e., mass enhancement or nonmass enhancement. Most previous investigations have reported on masses and rarely for lesions presenting as non-mass enhancement, primarily because of the challenges in defining the lesion extent for computerbased analysis. In our study, in order to mimic clinical practice, a single and independently-trained machine learning model was used for all the lesion types (masses and non-mass enhancements), and our result demonstrated that the classification model was stable in the task of distinguishing between malignant and benign for mass and non-mass lesions.
Note that in clinical practice, radiologists' performance is based on multiparametric breast MR images, including DCE, T2-weighted, and diffusion-weighted images, as well as mammography and ultrasound. In our study, the computer only analyzed dynamic contrast-enhanced MR images to yield the predictive lesion signature. One would expect improved performance by using multiparametric breast MR images and multimodality medical images; thus, we will analyze those in the future.
The imaging technique used in our study involves acquisition of one pre-contrast and a series of postcontrast images of both breasts at a temporal resolution of roughly 90 s. This type of breast MRI acquisition sequence has the advantage of being able to provide both morphological and kinetic information from one MRI examination, and was representative of early dynamic MRI protocols [30]. In addition, our large clinical database came from a single institution, thus, handling the problem that the image acquisition protocols across breast MRIs might not be standardized. However, that also limits statements on generalizability of the findings.
Patient motion during image acquisition may introduce inaccuracies in the computer-extracted kinetic features [31,32]. Cases with abrupt and large patient movements between dynamic series had been clinically treated as acquisition failure and were clinically excluded from our datasets. In our datasets, only patient respiratory motion was observed. The motion mostly resulted in additional blurring rather than actual displacement of image structure. However, it is important to note that image alignment of breast volumes at different time frames may improve the accuracy of our analyses.
There are some limitations of this study, First, this was a retrospective analysis of images from a single vendor acquired at a single institution, although the analysis was conducted with independent training and testing sets with unique patients. It will be critical to evaluate whether the present findings generalize to other vendor images and external data. A future multicenter study may help address this question. Second, all the cases had gone to biopsy, thus, we could not assess the system on benign lesions that were deemed benign solely by follow-up. Also, the study findings cannot be used to determine whether the radiologists' performances with the computer aid system are significantly improved in comparison with their performances without computer aid, even though we analyzed the DCE-MRI diagnostic results by the clinical radiologists. A clinical observer study is necessary. We note that we previously demonstrated in an observer study that use of computer-aided diagnosis with MRI improves the performance of radiologists in the task of differentiating malignant and benign lesions [29].

Conclusions
In conclusion, we have validated a machine-learning radiomics method for DCE-MRI on an independent, consecutive patient test set, suggesting a potentially useful aid for radiologists in the task of distinguishing between malignant and benign breast lesions during diagnostic workup of breast lesions.