Radiological semantics discriminate clinically significant grade prostate cancer

Background Identification of imaging traits to discriminate clinically significant prostate cancer is challenging due to the multi focal nature of the disease. The difficulty in obtaining a consensus by the Prostate Imaging and Data Systems (PI-RADS) scores coupled with disagreements in interpreting multi-parametric Magnetic Resonance Imaging (mpMRI) has resulted in increased variability in reporting findings and evaluating the utility of this imaging modality in detecting clinically significant prostate cancer. This study assess the ability of radiological traits (semantics) observed on multi-parametric Magnetic Resonance images (mpMRI) to discriminate clinically significant prostate cancer. Methods We obtained multi-parametric MRI studies from 103 prostate cancer patients with 167 targeted biopsies from a single institution. The study was approved by our Institutional Review Board (IRB) for retrospective analysis. The biopsy location had been identified and marked by a clinical radiologist for targeted biopsy based on initial study interpretation. Using the target locations, two study radiologists independently re-evaluated the scans and scored 16 semantic traits on a point scale (up to 5 levels) based on mpMRI images. The semantic traits describe size, shape, and border characteristics of the prostate lesion, as well as presence of disease around lymph nodes (lymphadenopathy). We built a linear classifier model on these semantic traits and related to pathological outcome to identify clinically significant tumors (Gleason Score ≥ 7). The discriminatory ability of the predictors was tested using cross validation method randomly repeated and ensemble values were reported. We then compared the performance of semantic predictors with the PI-RADS predictors. Results We found several semantic features individually discriminated high grade Gleason score (ADC-intensity, Homogeneity, early-enhancement, T2-intensity and extraprostatic extention), these univariate predictors had an average area under the receiver operator characteristics (AUROC) ranging from 0.54 to 0.68. Multivariable semantic predictors with three features (ADC-intensity; T2-intensity, enhancement homogenicity) had an average AUROC of 0.7 [0.43, 0.94]. The PI-RADS based predictor had average AUROC of 0.6 [0.47, 0.75]. Conclusion We find semantics traits are related to pathological findings with relatively higher reproducibility between radiologists. Multivariable predictors formed on these traits shows higher discriminatory ability compared to PI-RADS scores.


Background
Prostate cancer is the most prevalent cancer and the second cause of cancer deaths among men in the USA [1]. A reliable prostate cancer screening approach that provides accurate risk assessment for targeting, diagnosis and treatment is still a critical need. The European Randomized Study of Screening for Prostate Cancer (ERSPC) reported that PSA-based screening has reduced the rate of death from prostate cancer by 20% [2], but is limited by a low specificity leading to over diagnosis at an estimated rate of 23 to 42% [3].
Transrectal ultrasound (TRUS) -guided needle biopsy of the prostate is recommended for patients with elevated serum PSA levels, an abnormal feeling prostate on digital rectal examination, or both. Given the heterogeneous and multifocal nature of prostate cancer, both indolent and clinically significant tumors may be found in the same gland. It is also known that tumors located in certain regions of the prostate are under sampled, missing dominant or high-grade tumors in these regions. In addition, prostate cancer stage upgrading or downgrading frequently occurs following repeat biopsies [4]. More recently, ultrasound-MRI fusion guided needle biopsies have been shown to improve precision in identifying, targeting and sampling prostate lesions of interest [5,6].
Multi-parametric magnetic resonance imaging (mpMRI) has shown great promise as a non-invasive approach for prostate cancer detection [7], but the lack of uniform interpretation and reporting has led to high variability among radiologists [8].
But it has been generally agreed that, radiological appearance and the following interpretable descriptions are related to cancer progression [9,10]. Radiologist training in the performance, interpretation and reporting of prostate imaging studies plays a major role in improving the performance of cancer detection in prostate cancer [11]. Various groups have developed radiological-based reporting scales for prostate cancer [12][13][14]. For example, a Likert reporting scale has been recommended by the Prostate Diagnostic Imaging Consensus Meeting (PREDICT) panel, and quantifies radiologist(s) opinion to a simplified 5 point scale [15].
The European Society of Urogenital Radiology (ESUR) first proposed the use of the Prostate Imaging Reporting and Data System (PI-RADS) as a way to standardize reporting of imaging consensus criteria. PI-RADS was later adopted by the American College of Radiology (ACR), and jointly proposed changes were formulated in a revision of the criteria [16,17]. Findings on mpMRI are assessed on a 5-point categorical scale, based on the expert's observational probability that a combination of findings on T 2 -weighted (T 2 WI) sequences, diffusionweighted MRI (DWI) and dynamic contrast-enhanced MRI (DCE-MRI) correlate with the presence of a clinically significant prostate cancer at the specific location. The overall PI-RADS score considers a combination of multiple features obtained for the modality/sequences, such as nodule shape, margin and intensity. The PI-RADS assessment categories have a range of 1 to 5, with 5 being most likely to represent clinically significant prostate cancer. Previous studies [18,19] have shown moderate inter-reader agreement with PI-RADS. The major pitfall in the clinical use of PI-RADS has been the degree of subjectivity of radiologists in study interpretations, leading to large variability in reported findings, and a suboptimal ability to characterize the nature and/ or degree of malignancy in a lesion of interest [20].
Locating and discriminating clinically significant from insignificant cancers remains a challenge in prostate cancer screening. Current validation is primarily based on the pathologic Gleason score. Patients with Gleason score ≥ 7 ((3 + 4) or (4 + 3)) are considered clinically significant forms of cancer with increasing aggressiveness as the score increases to 8, 9 and 10 [20]. Recently, there have been numerous efforts to develop quantitative metrics for medical imaging to identify and describe abnormalities in radiological studies [21][22][23][24].
In this study, we propose to describe radiological traits independently for each mpMRI sequence on a numerical point scale. These traits were then taken in combination and related to pathological outcome of cancer aggressiveness (Gleason score) using a linear classifier approach. These combinations of traits were rigorously evaluated in a cross validation setting with multiple repeats. The semantic-based feature model was then compared to PI-RADS based predictors at different cutoffs to find clinically significant grade cancer.

Patients cohort
The study was approved by the Institutional Review Board (IRB) at the University of South Florida, and patient informed consent was waived for the retrospective analysis. All the patients were referred to the Radiology department for multiparametric MRI and targeted prostate lesion biopsy planning using the UroNav Ultrasound-MR Fusion Biopsy System (Invivo Corporation) at the H. Lee Moffitt Cancer Center. The patients were scanned using a Siemens 1.5 T MRI scanner with endorectal coil placement ( Table 1). The inclusion criteria were as follows: a) availability of at least one targeted biopsy by the UroNav fusion system identified on the original interpretation, b) availability of mpMRI sequences (T 2 WI, DCE, DWI/ ADC) suitable for PI-RADS (version-2) scoring, and c) no image related limitations; i.e., post-biopsy hemorrhage, motion artifacts, et al.
The exclusion criteria include patients with prior localized treatment such as external beam radiation therapy, brachytherapy or cryoablation. The data curation step resulted in excluding 24 patients from the initial list, leaving 103 patients (167 biopsies) qualifying for the study cohort. We had 90 biopsies (65 unique patients) with Gleason scores ≥6, 33 of those biopsies had Gleason scores equal to 6, and 57 biopsies had Gleason scores ≥7. The rest of the 77 biopsies were negative for cancer (benign). Data extracted included age, race, smoking status, other cancer history, family cancer history, PSA level and board certified pathologist evaluated the cancer status and gleason scores for the slides. Multi-parametric MRI scans (T 2 WI, DCE, DWI, ADC) were downloaded from the Picture archive communication systems (PACS). Semantics were scored using offline DICOM (digital imaging and communication in medicine) viewers with prostate specific window settings.

Radiologist marked biopsy targets
The clinical radiologist marked most aggressive target locations on the mpMRI scans and converged based on consensus reading with a fellow radiologist on duty. The markings were carried out using the commercial prostate biopsy system (Uronav/DynaCAD, Invivo inc, FL) that integrates the software modules to the biopsy hardware, that includes real time ultrasound (TRUS) location system. Patient preparation and endo-rectal coil placement follows the standard procedure. Using the automatic spring loaded biopsy-needles targeted core biopsies was obtained. Additionally, standard extendedpattern 12-core biopsies (Sextants) were obtained in accordance with the NCCN guidelines. The core targets were separately labeled and processed.

Semantics and PI-RADS-version2 scoring
Semantic descriptors were derived from lesions targeted for UroNav Fusion biopsy. The semantics were marked for each modality (T 2 WI, DCE, DWI, ADC) independently on a point scale (1 to 5). A total of 24 semantic features were developed, of that 16 were used in this study. Specifically, these features described the location, size, shape, margin, intensity and extra-prostatic extension of the lesion, the organ volume, and the presence of either benign prostate hyperplasia or lymphadenopathy (detailed explanation in Table 2). Figure 1 shows example patient MRI with semantic scores, where 1a shows score for nodule/shape characteristics, oval nodule was scored as 1, irregular nodule was scored as 2, amorphous was scored as 3. Fig. 1b, shows example of semantic score on ADC images, where the nodule on left upper was hyper intense, right upper was iso-intense, left lower was hypo intense, and right lower was 'marked hypo intense'. Figure 1c, shows contrast enhanced images, left (first panel) shows no early enhancement, received a score of 1, Followed by light enhancement (score = 2), moderate enhancement (score = 3), and  The intensity of nodule on T2WI compared with the intensity of normal peripheral zone The ADC value of nodule compared with that of normal peripheral zone ADC nodule /ADC peri T2 intensity T2 signal intensity of the lesion was compared to surrounding tissues and defined as "marked hypointensity" if the lesion expressed similar signal intensity than back muscles, "hypointensity" if the lesion was brighter than back muscles but darker than adjacent prostate tissue, "iso" if the lesion was similar to adjacent prostate tissue, and "hyperintensity" if the lesion was brighter than the adjacent prostate tissue.

Statistical analysis
Agreement between the radiologists (Q.L. and J.C) was measured by the (weighted) Kappa index [25] for binary or ordinal variables. The kappa value was interpreted as follows: < 0, less than chance agreement; 0.01 to 0.2, slight agreement; 0.21 to 0.4, fair agreement; 0.41 to 0.6, moderate agreement; 0.61 to 0.8, substantial agreement; > 0.8, almost perfect agreement [26]. In our analysis, the radiologists scored 16 semantic features. Of these, 4 features had kappa value ≥0.7, 4 feature values were between kappa ≥0.6 and < 0.7, 4 features had kappa ≥0.5 and < 0.6, and 4 features could not be scored due to a limited range of the semantic characteristics (see Table 3). We built a linear classifier model to find discriminant features that distinguish clinically significant cancers from indolent cases (GS ≥ 7 Vs GS ≤ 6), and indolent cases from benign (GS =6 Vs Benign). We selected the best 3 semantic features, taking all possible feature combinations ranked by Youden's index [27,28] for selecting highly predictive discriminators. The statistics were estimated following a cross validation approach (Hold out, 10 fold), randomly repeated over 100 times [29]. We also find the area under the receiver operator characteristics (AUROC) along with sensitivity, specificity, positive predictive value, and negative predictive value for the multivariable pairs of interest. The reported statistics were the ensemble value obtained over random repeats, and 95% confidence limits for the values reported.

Results
The final cohort used for the study had 167 biopsies (103 patients) with 57 biopsies that were considered clinically significant tumors (GS ≥ 7), 33 biopsies that were indolent tumors (GS ≤ 6), and 77 biopsies that were benign. Patient age ranged from 46 to 75 years at diagnosis. The PSA levels ranged from 0.8-44.7 ng/ml. Figure 2 shows distribution of semantic values in a box Among the semantic features, the Kappa scores for capsule status (presence or absence), homogeneity, shape, T2 intensity, ADC-intensity showed moderate agreement. Enhancement degree, extra-prostatic extension, enhanced homogeneity, and border were in substantial agreement between readers, while early enhancement and cyst (presence or absence) showed almost perfect agreement. The scores for four features, including seminal vesicle involvement, distal sphincter involvement, bladder neck involvement and lymphadenopathy, could not be computed, due to lack of examples.
We find ADC-intensity, homogeneity, and early enhancement to be univariate semantic predictors that gives the highest average AUROC (0.57 to 0.68). The PPV (positive predictive value) and sensitivity for these markers are relatively high, with average values to be [0.62 to 0.69] and [0.82 to 0.96] respectively, for finding the clinically significant prostate cancers (GS ≥ 7).
When these features were combined together, we found the combination of ADC-intensity, T2-Intensity, and enhancement homogeneity showed the highest average AUROC of 0.70, with average sensitivity and PPV for detecting the aggressive cancer to be 0.79 and 0.72, respectively. The next best feature combination was based on early ADC intensity, Border, enhancement homogeneity, that had an average AUROC of 0.71, with average sensitivity and PPV for detecting aggressive cancer to be 0.82 and 0.68, respectively. In comparison, we characterized predictors that discriminate aggressive from indolent prostate cancers using overall PI-RADS (version 2) scores. We repeated the predictive analysis with different level of cutoffs on the PI-RADS scores to discriminate aggressive cancer (i.e. PI-RADS≥5, PI-RADS≥4, PI-RADS≥3). We found that having a moderate cutoff (PI-RADS≥4) showed the highest AUROC of 0.6, with sensitivity and PPV of 0.98 and 0.69 respectively. Table 4 shows discriminant semantic features with their predictive statistics. We also find the top semantic predictors (ADC-intensity, T2-intensity, enhancement homogeneity) receiver operator characteristics was significantly different from PI-RADS3 (p = 0.0022), PI-RADS5 (p = 0.0048) based predictor of malignancy defined by Gleason score (GS ≥7). While semantics predictor was non-significant with PI-RADS4 (p = 0.0724) predictor, where significance was computed using nonparametric Delong's statistics [30].
We then built models to find semantic predictors to differentiate indolent (GS =6) from benign cases. We found extra-prostatic extension, early enhancement, ADC intensity features to be univariate discriminators, with an average AUROC of 0.58 to 0. 61. When combining these features together, the combination of ADC intensity, early enhancement and extra prostatic extension shows the highest average AUROC of 0.63 with an average sensitivity and PPV of 0.16 and 0.51 respectively. The next feature combination of homogeneity, early enhancement degree, and extra prostatic extension had an  Fig. 3. Adding semantics to PI-RADS increases average AUC to 0.64 from 0.63 for GS6 vs Benign and lowers from 0.7 to 0.66 for GS 7 Vs GS6.

Discussion
In this study, we propose a radiological semantic scheme that captures traits on a point scale independently on different modalities of mpMRI. We used the semantic descriptor as a combination to build linear discriminant functions to identify clinically significant prostate cancers. These semantic predictors were then compared to PI-RADS-v2 based discriminators, the American College of Radiology had adopted the use of PI-RADS (version 2) system to report standardized prostate cancer findings in mpMRI [17,31]. We found that semantics demonstrated better predictability of pathological outcome compared to PI-RADS based predictors. We believe semantic traits may help reduce the variability in image interpretation between radiologists as the observational scorings are made for a trait, independently in a modality (T2w, ADC, DCE). Semantics scoring is specifically defined to obtain an expert opinion about a radiological trait, such as the presence or absence of a trait, or the multi-level appearance of a trait in the scan. In a recent report PI-RADS 1 and 2 scoring schemes were compared and report a PPV of 75% for both versions to find clinically significant cancers. The NPV (negative predictive value) was 46% for PI-RADS-1 and 43% for PI-RADS-2, in a cohort of 66 patients [32].
Tissue cell densities have been well characterized and is reflective of molecular movement, in prostate carcinoma it is, characterized by reduced ADC values [33]. Further, ADC value in prostate has been shown to be related to Gleason score showing an inverse trend [34,35]. It is useful in differentiating carcinoma from benign hyperplasia [36], high-risk patients from those at low and intermediate risk [37] and helpful for transitional zone (TZ) lesion detection [38]. We also find that ADC is a critical marker in identifying clinical significant cancer and are capabale of distinguishing indolent from benign cases. Due to interpretational variability of dynamic contrast enhancement images, they do not contribute to the overall clinical assessment of prostate lesions, especially in PIRADS-v2 (exception of PI-RADS score of 3).
While in our study, early enhancement and enhancement degree were effective predictors, and the cancerous nodule usually presents early enhancement and higher enhancement degree. When combined with ADC intensity and extra-prostatic extension, they form better predictors of clinically significant cancers.
. Clinically, any non-binary point scale can lead to some level of unnecessary confusion to practitioners, and eventually leading to variability in diagnosis that will impact the patient care [39]. In our study, we used discriminator functions and formed different multivariable models agnostically combing traits across modalities, with each trait having equal likelihood to be part of the predictor model. We limit the size of the predictors to three semantic traits due to a limited sample size. This approach allows combination of information across modalities to find clinically significant prostate cancers.
We find PI-RADS based predictor with a cutoff of ≥4 showed slightly lower discriminatory ability to find clinically significant cancers (AUROC of 0.6, sensitivity and PPV of 0.98 and 0.68 respectively), compare to its ability to differentiate indolent from benign (AUROC of 0.62, PPV of 0.38 and Sensitivity of 0.77). We find semantics based predictors shows better performance, with an AUROC of 0.70 and 0.63 for discriminating clinically significant versus indolent tumor and indolent tumor versus benign, respectively (see Table 4 & 5). We also find adding semantics to PI-RADS (overall score) shows improvement in predictor performance, both in discriminating clinically significant lesion (GS ≥ 7) from indolent (GS =6) and benign from indolent (GS =6).
There is a high level of subjectivity among radiologists in scoring PI-RADS (v.2) [40], in a recent review, these shortcomings were categorized into clinical indications and technical/physiological artifacts [41,42]. The clinical consequence in disease identification has resulted in impacting patient care by over-detection in some cases and missed diagnosis of aggressive cancer in others. We believe evaluation of semantic traits in mpMRI images will reduce subjectivity in tumor detection.
In our study, trained radiologists were asked to describe observed traits on a point scale following the semantic descriptors and these are then related to pathological outcome. The use of semantic discriminant functions may provide an alternative real value risk score to the oncologist to decide upon an appropriate management plan for the patient. We understand that there is a further need to train such predictors on a larger cohort to obtain balanced coefficients based on the radiological traits. We believe semantic predictors can discriminate clinically significant cancers and provide valuable risk assessment to aid clinical decisions both in targeting lesions and planning treatment for the disease.

Limitations
We have assembled over 103 patients (167 biopsies) all of the data was obtained in a single institution with diverse cohort and used to train the model in cross validation setting. The data in our center were obtained from couple of clinical locations and biopsies carried out by multiple urologists. Data from multi-institutions will improve diversity of the cohort. This approach will have a better possibility of obtaining a stable model with independent test and validation cohort. We acknowledges the absence of such a dataset. We used the lesions on mpMRI scan to make semantic assessment and pathological validation was obtained by TRUS/MPI biopsy. It's possible that core lesion may have been missed leading tumor, leading to lower gleason grade, consequentially reduce classifier performance.

Conclusions
The proposed radiological semantic schema to describe prostate lesions on mpMPI shows promise in quantifying tumor imaging traits. A model based approach of these traits provides a computational means to relate these findings to pathological outcome. These methods show potential in discriminating prostate cancer lesions with better accuracy than currently practiced risk assessment.