The effects of volume of interest delineation on MRI-based radiomics analysis: evaluation with two disease groups

Background Manual delineation of volume of interest (VOI) is widely used in current radiomics analysis, suffering from high variability. The tolerance of delineation differences and possible influence on each step of radiomics analysis are not clear, requiring quantitative assessment. The purpose of our study was to investigate the effects of delineation of VOIs on radiomics analysis for the preoperative prediction of metastasis in nasopharyngeal carcinoma (NPC) and sentinel lymph node (SLN) metastasis in breast cancer. Methods This study retrospectively enrolled two datasets (NPC group: 238 cases; SLN group: 146 cases). Three operations, namely, erosion, smoothing, and dilation, were implemented on the VOIs accurately delineated by radiologists to generate diverse VOI variations. Then, we extracted 2068 radiomics features and evaluated the effects of VOI differences on feature values by the intra-class correlation coefficient (ICC). Feature selection was conducted by Maximum Relevance Minimum Redundancy combined with 0.632+ bootstrap algorithms. The prediction performance of radiomics models with random forest classifier were tested on an independent validation cohort by the area under the receive operating characteristic curve (AUC). Results The larger the VOIs changed, the fewer features with high ICCs. Under any variation, SLN group showed fewer features with ICC ≥ 0.9 compared with NPC group. Not more than 15% top-predictive features identical to the accurate VOIs were observed across feature selection. The differences of AUCs of models derived from VOIs across smoothing or dilation with 3 pixels were not statistically significant compared with the accurate VOIs (p > 0.05) except for T2-weighted fat suppression images (smoothing: 0.845 vs. 0.725, p = 0.001; dilation: 0.800 vs. 0.725, p = 0.042). Dilation with 5 and 7 pixels contributed to remarkable AUCs in SLN group but the opposite in NPC group. The radiomics models did not perform well when tested by data from other delineations. Conclusions Differences in delineation of VOIs affected radiomics analysis, related to specific disease and MRI sequences. Differences from smooth delineation or expansion with 3 pixels width around the tumors or lesions were acceptable. The delineation for radiomics analysis should follow a predefined and unified standard.


Background
As an emerging non-invasive tool, radiomics has shown gratifying performance in phenotype diagnosis and classification [1,2], tumor prognosis [3,4], treatment decision [5,6], and molecular marker estimation [7,8] by permitting comprehensive quantification tumor heterogeneity on radiographic imaging [9][10][11]. The process mainly consists of six consecutive steps including image acquisition, image preprocessing, tumor segmentation, feature extraction, feature selection, and radiomics model development. Each step can be an uncertain factor contributing to an unreasonable result due to a lack of standardization in radiomics analysis. Recent studies have focused on identifying the factors that affect radiomics analysis. Vallieres et al. investigated the impact of six parameters of feature extraction on the prediction of lung metastases in soft-tissue sarcomas of the extremities [12]. Lu [13]. In the process of image preprocessing for patients with head and neck cancers, Bagher-Ebadian et al. evaluated changes in radiomics features from images subject to smoothing, sharpening, and noise relative to baseline datasets [14]. In a recent study, Shiri et al. considered the need of reliable feature values against image reconstruction and assessed the variability of radiomics features extracted from multi-scanner phantom and patient PET/CT images over a wide range of different reconstruction settings [15].
Among the factors that affect radiomics analysis, delineation of tumors or lesions occupy an important position, as the volume of interest (VOI) is directly used to extract quantitative features [9]. The accuracy may affect subsequent radiomics analysis. Usually, VOIs are manually outlined by radiologists with labor intensive as well as time-consuming. The work in [16] showed that the delineation of VOIs for radiotherapy currently was imprecise with high inter-operator variability, even for experienced observers. Most prior studies have focused solely on the effects of inter-observer variability in manual tumor delineation to identify radiomics features with high robustness [17,18]. In fact, quantification of tumor delineation and tolerance assessment of the differences are likely more important in developing standardized research. Recently, Kocal et al. [19] determined the influence of segmentation with margin shrinkage of 2 mm on CT-based radiomics analysis for distinguishing low and high nuclear grade renal clear cell carcinomas (RcCCs). However, in most cases, delineation tends to overestimate the lesion volume to ensure that the entire lesion is identified [20]. The delineation differences that can be accepted and possible influence on radiomics analysis have not been unexplored, requiring quantitative assessment.
The aim of this work is to investigate the effects of delineation of VOIs on each step of radiomics analysis in detail, including feature extraction, feature selection, and prediction performance of radiomics models. Simultaneously, the tolerance of delineation differences of VOIs was assessed for reference in radiomics analysis.

Patients
Two datasets were collected to investigate the effects of delineation of VOIs on radiomics analysis. The first problem is to distinguish whether metastasis occurs in patients with NPC before radiotherapy. In clinical practice, the majority of NPC patients with metastasis before radiotherapy suffer from poor prognosis [21][22][23]. Hence, it will be beneficial to improve prognosis if the risk of transfer before treatment can be distinguished accurately and take timely intervention on patients with high risk of metastasis. Our study retrospectively recruited 238 patients with NPC who had been diagnosed by histopathology between August 2009 and January 2013. All patients were divided into two groups in accordance with the metastasis status: (i) metastasizing (TM) group with 126 patients; (ii) non-metastasizing (NM) group with 112 patients.
The second problem is the prediction of sentinel lymph node (SLN) metastasis in patients with breast cancer, as described in [24]. It is of great significance using radiomics analysis to predict SLN metastasis for treatment decision making in breast cancer. A total of 146 consecutive patients with histologically confirmed breast cancer between March 2014 and June 2016 were retrospectively enrolled in this work. The patients consisted of two groups on the basis of SLN metastasis: (i) TM group with 55 patients; (ii) NM group with 91 patients. The inclusion criteria are available in Additional file 1: Note S1.
The patients were divided into a strictly training cohort for radiomics model building and an independent validation cohort (25% in NPC group and 33% in SLN group) for evaluating the final prediction performance. The detailed demographic characteristics and clinical information are summarized in Table 1.

Image acquisition protocol
All analyses were carried out in accordance with the relevant guidelines and regulations, and the requirement to obtain informed consent was waived. This retrospective study was approved by the local institutional review board.

Image pre-processing
As for the subjects in two datasets enrolled in our study, multi-sequence MR images are required from several MR scanners with different protocols, hence image standardization are essential for all images to avoid the inhomogeneity. Prior to analyzing MR images, additional image standardization involving bias field correction and intensity normalization were conducted to avoid inhomogeneity. First, the N4ITK algorithm [25] was applied to remove the bias field artifacts in the MR images. Subsequently, intensity normalization [26] was utilized to reduce the variability across image acquisitions from different manufactures. Fig. 1 illustrates the schematic framework of the radiomics analysis in this work.

Volume of interest segmentation
All MR images were imported into the ITK-SNAP software designed by Yushkevich et al. [27] to define the VOI of each tumor. The tumor contours were individually first outlined slice-by-slice by two radiologists (Z.L., 4 years of experience, and Z.B., 6 years of experience) and then reviewed by a senior radiologist (Z.S., 12 years of experience). Any disagreement between the readers was discussed until a final consensus was generated. During the session 30 cases randomly selected from each dataset were used for the inter-observer analysis of the segmentation. For each selected region of interest (ROI), the smallest rectangle that best fits the tumor region was used to calculate margin distance of two kinds of manual segmentation in four directions (up, down, left, and right), resulting in multiple calculated values (number of selected ROIs × 4) for analysis together.

Changes of volume of interest
On the basis of the original segmented regions, erosion, dilation, and smoothing were performed on the VOIs slice-by-slice to generate diverse VOIs. For the dilation operation, the radius sizes (number of pixels) of the circular structural element to dilate the VOIs were separately set as 3, 5, and 7. Given that certain tumors were extremely small, the size for the erosion operation was only set to 3. Image smoothing for VOIs was implemented by a Gaussian smoothing filter configured with correlation operator, where sigma was set as 3 and the template size was 7 × 7. Pixel values outside the bounds of the region of interest were set to the value of the nearest border. The three types of operations were respectively implemented using the functions imdilate, imerode, and imfilter of MATLAB version 8.5 (MathWorks, R2015a). No additional processing was implemented on the contours. For the sake of analysis, five operations on VOIs were abbreviated as Erosion, Smoothing, Dilation, Dilation5, Dilation7, respectively. The VOIs accurately delineated by radiologists were denoted as Baseline. Fig. 2 exemplifies the VOI in a single slice of the original tumor and presents the corresponding drawing of partial enlargement under different operations simultaneously. The degree of tumor volume change of diverse delineations in relation to the accurate delineation is summarized in Table 2.

Feature extraction
A total of 2068 radiomics features were extracted for each VOI. In reference to [12], four non-texture features that describe the geometric characteristics were calculated, including tumor volume, size, solidity, and eccentricity. In view of the effects of varying extraction parameters on texture features, three extraction parameters, respectively, isotropic voxel size, quantization of gray levels, and quantization algorithm, were adopted, thereby leading to 2064 textural features for each patient. The textural features consisted of Global (extracted from the intensity histogram with 100 bins of the tumor region), grey-level co-occurrence matrix (GLCM), grey-level run length matrix (GLRLM), grey-level size zone matrix (GLSZM) and neighbourhood grey-tone difference matrix (NGTDM) [28][29][30]. The extraction was conducted with a MATLAB toolkit for radiomics analysis (https://github.com/mvallieres/radiomics). The detailed extraction parameters and description are available in Additional file 1: Note S2 and Table S1.

Feature selection
The feature selection was performed within the training cohort. Maximum Relevance Minimum Redundancy (mRMR) [31], which has good trade-off between the maximum relevance and minimum redundancy, was firstly explored to identify a well-ranked feature set that included 100 features. Referring to [12,32], the 0.632+ bootstrap method combined with the area under the receiver operating characteristic curve (AUC) metric were adopted to evaluated the predictability of features (Additional file 1: Note S3). One thousand iterations were performed with 63.2% random data resampling

Development of radiomics model
Once the discriminative features were identified, radiomics models were built based on different feature sets. A new feature set was composed when one feature from higher to lower rank was added, which contributed to 20 radiomics models. We used the random forest classifier [33] to evaluate the capability of foregoing radiomics models across 10-fold cross-validation in the training cohort, with 150 decision trees used for training ultimately. The model that possessed the most superior properties was determined for further analysis.

Statistical analysis
First, Mann-Whitney U test was used to compare the difference in age and other continuous variables between TM and NM. Chi-square test was performed to analyze the differences based on factors, such as gender and clinical stages. Statistical analysis was performed on SPSS version 22.0 (IBM, Armonk, NY, USA). The robustness of features against delineation differences versus the accurate VOIs was quantified using the intra-class correlation coefficient (ICC). Features with ICC ≥ 0.9 were considered excellent robust. The performance of diverse VOI-derived radiomics models were assessed by AUC, and the differences were compared by the method of DeLong et al. [34] using the MedCalc version 15.2.2 (MedCalc Software bvba, Ostend, Belgium). Note that a two-tailed p value less than 0.05 indicated statistical significance in this work.

Results
For the metastasis differentiation in NPC before radiotherapy, no significant differences were observed between NM and TM groups except in histologic grade (p < 0.05; Table 1). In the prediction of SLN metastasis in breast cancer, NM and TM groups had no significant differences in all characteristics (p > 0.05; Table 1). Inter-observer differences are summarized in Fig. 3. Colors in the heatmap indicated that margin differences of ROIs from two radiologists were concentrated between 0 and 8 pixels for all datasets.

Feature robustness analysis
ICCs for features against all VOI variations were distributed in a wide range for all scans ( Fig. 4a and b). ICC values in Smoothing which represented the smallest differences were most concentrated with the smallest effect on feature values, except for T2-FS images, for which Dilation had more concentrated distribution with the narrowest ICC range of 0.134-0.999. Dilation7 which changed the most in VOIs, revealed the largest ICC range with the great effect on feature values. The features extracted from breast cancer data were more sensitive to VOI variations compared with NPC, showing fewer robust features as a whole (Fig. 4c). Smoothing resulted in the maximum number of robust features, whereas Dilation7 worked the other way around.

Feature selection analysis
As a matter of convenience, the top-predictive features selected from diverse VOIs were re-indexed according to feature type. Each symbol in Fig. 5 represents one type of feature, and features in area filled with gray represent the same top-predictive features as accurate VOIs. The features selected under diverse VOIs showed considerable differences (Fig. 5), which indicated great effects of delineation differences on feature selection. Under any variation in the two tasks, not more than 15% top-predictive features were identical to the accurate VOIs, particularly for CET1-w images. This result was the case for no common features. Analogously, there was a large difference in features contributing the best radiomics models (see solidfilled symbols in Fig. 5).

Prediction performance analysis
As seen in Table 3 Table S2.

Model performance across testing data from diverse VOIs
On the basis of feature parameters obtained from the radiomics models, we assessed stability by validating the model using data from diverse VOIs, as shown in Fig. 6. The predictive AUCs using CET1-w images, trained by data from the Dilation7 model and tested by data from other VOIs, were all above 0.7 and seemed relatively stable. Poor prediction results were still represented by Dilation5 and Dilation7 models using T2-w images. The prediction results changed in relatively large ranges across different VOI-operated validation data in SLN group. The model with training and validation data undergoing the same delineation outperformed other models in most cases.

Discussion
In this work, we investigated the influence of tumor delineation on radiomics analysis in detail within two disease groups. The tolerance of the delineation differences was explored to provide references for tumor delineation in future radiomics studies. Application of various operations to VOIs corresponded to diverse delineations in clinical practice. The results illustrated that delineation differences of VOIs had an effect on the radiomics feature values, feature selection, and prediction performance which depended on specific disease as well as MRI sequences. The experiment results provided strong evidence that the larger the VOIs changed, the greater the influence on the feature values (Fig. 4). According to the number of robust features, different diseases had discrepant sensitivity to VOI variations, consistent with a previous discovery [18]. This result could be explained from the fact that the tumors in breast cancer are larger with illdefined margins, which cause great changes on the feature values across larger variation. A comparison of toppredictive features showed that even slight smoothing on VOIs could lead to large differences in feature selection. This agrees with the discovery in [19], only one texture feature appeared on both contour-focused segmentation and the one with shrinkage of 2 mm. Probably because the variations exactly weaken the correlation with the class of certain features by changing the feature values, which resulted in a new order of top-predictive features. A feature possessing good distinguishing characteristics does not stand out under all conditions, and thus depended on the specific analysis task.
The study also demonstrated that delineation differences of VOIs affected prediction performance of radiomics models. Stable and prominent performance from VOIs across Smoothing and Dilation indicated the tolerance of corresponding differences for radiomics models and corroborated the feasibility that the radiologists smoothly outline the lesions or slightly larger of 3 pixels width around the tumor. Note that it is not that bigger is better for VOIs. The worse performance from VOIs across Dilation5 and Dilation 7 in NPC group (Fig. 2) could be explained by the dilated area that contained more areas of the nasal cavity which exhibits low-signal intensity. This increased the effect of certain features, tending to confusion classification and facilitating feature sets with poor differentiation property. However, in breast images, more soft tissues containing complex textures were associated to capture heterogeneity for predicting SLN metastasis, indicating that the peritumoral regions had a positive influence to a certain extent. This finding is consistent with past researches  [35,36]. Braman et al. showed that the textural analysis of peritumoral regions contributed to the prediction of pathological complete response in neoadjuvant chemotherapy on pretreating breast cancer DCE-MRI [36]. This explanation also holds true for the worse performance from VOIs across erosion. The radiomics models with good predictive properties might not necessarily perform well on the validation data from VOIs of diverse delineations, which implied that the VOIs of training and validation data should be outlined on the basis of the same criterion. In this regards, a unified standard should be referred in the delineation of VOIs, e.g., slight larger delineation with 3 pixels width around the tumors or lesions for all images. We suppose this assists in more accurate analysis, as the same proposal by Welch et al. [37]. In particular, the Dila-tion7 model distinctly reflected stable performance against all the variations using CET1-w images. The modeling features are shown in Additional file 1: Table S3. Beyond our expectation, no features showed high robustness, whether in one or all variations. We can infer that features which are not robust to the differences in VOIs may not result in poor prediction performance, which is similar with the observation of past researches [38,39]. The results also confirmed the insufficiency of simply analyzing the effects of differences on features robustness. In fact, whether the final performance of the radiomics models exist substantial differences is the most important issue, as emphasized in [40].
The present work also has several generalizability issues and limitations. First, while the number of patient population was small, an independent validation cohort was divided for radiomics model evaluation devoid of information leakage between feature selection/training phases. We believe this makes the results reliable and generalizable. It is in demand of more patient data for stronger verification in future research. Second, we used simple morphological operations to change tumor margin. Other contour randomization processing methods that provide stochastic components in the delineation of VOIs are lacking. For the purpose of determining the feasibility of alternative delineation of VOIs, relative changing of the tumor-focused delineation is easier to implement from a medical point of view. Third, regarding the differences in image resolutions between MR images, we changed the size of VOIs at pixel level to better adapt to the delineation of different scenarios. As radiologists delineate the VOIs in term of the original images, which does not involve image resampling and additional preprocessing. Fourth, the effects of the diverse delineations were analyzed and synthesized in spite of differences in tumor location and imaging manifestations within two disease groups. More types of diseases should be further assessed to provide more comprehensive references. Additionally, we only assessed the effects of diverse delineations of VOIs using MRI. The effect on the radiomics analysis for other modalities, such as PET/ CT, is still unclear.

Conclusions
The differences in delineation of VOIs could lead to considerable differences in feature value and feature selection. The influence on prediction depended on specific disease as well as MRI sequences, among which smooth or slight larger delineation with 3 pixels width around the tumors or lesions were feasible. In addition, predefining a unified standard is suggested in the delineation to promote reliable analysis. Despite several limitations, we believe these findings are of great significance as a reference for tumor delineation in future radiomics analysis.
Additional file 1. Note S1 : the inclusion criteria and flow chart of the patient data. Note S2: radiomics features. Note S3: 0.632+ bootstrap feature selection. Figure S1. The inclusion criteria and flow chart of nasopharyngeal carcinoma data. Figure S2. The inclusion criteria and flow chart of breast cancer data. Table S1. Radiomics features type and number. Table S2. Prediction results of radiomics models from diverse VOIs on the training cohorts of two disease groups. Table S3. Radiomics features for the Dilation7 (dilation with structural element radius size of 7) model as well as the ICCs for metastasis estimation in nasopharyngeal carcinoma