Predicting microvascular invasion in hepatocellular carcinoma: a deep learning model validated across hospitals

Background The accuracy of estimating microvascular invasion (MVI) preoperatively in hepatocellular carcinoma (HCC) by clinical observers is low. Most recent studies constructed MVI predictive models utilizing radiological and/or radiomics features extracted from computed tomography (CT) images. These methods, however, rely heavily on human experiences and require manual tumor contouring. We developed a deep learning-based framework for preoperative MVI prediction by using CT images of arterial phase (AP) with simple tumor labeling and without the need of manual feature extraction. The model was further validated on CT images that were originally scanned at multiple different hospitals. Methods CT images of AP were acquired for 309 patients from China Medical University Hospital (CMUH). Images of 164 patients, who took their CT scanning at 54 different hospitals but were referred to CMUH, were also collected. Deep learning (ResNet-18) and machine learning (support vector machine) models were constructed with AP images and/or patients’ clinical factors (CFs), and their performance was compared systematically. All models were independently evaluated on two patient cohorts: validation set (within CMUH) and external set (other hospitals). Subsequently, explainability of the best model was visualized using gradient-weighted class activation map (Grad-CAM). Results The ResNet-18 model built with AP images and patients’ clinical factors was superior than other models achieving a highest AUC of 0.845. When evaluating on the external set, the model produced an AUC of 0.777, approaching its performance on the validation set. Model interpretation with Grad-CAM revealed that MVI relevant imaging features on CT images were captured and learned by the ResNet-18 model. Conclusions This framework provide evidence showing the generalizability and robustness of ResNet-18 in predicting MVI using CT images of AP scanned at multiple different hospitals. Attention heatmaps obtained from model explainability further confirmed that ResNet-18 focused on imaging features on CT overlapping with the conditions used by radiologists to estimate MVI clinically.


Introduction
Hepatocellular carcinoma (HCC) is a common cancer existing globally and it is now ranked as the fourth major cause for cancer-related death [1,2]. Currently, liver transplantation (LT), surgical resection and radiofrequency ablation (RFA) are the three potentially curative therapies for HCC [3]. Despite undergoing these treatments, the 5-year recurrence rates after LT, surgical resection and RFA are 10-15 %, 42-52 % and 42-70 %, respectively [1,4,5]. It was reported that HCC with microvascular invasion (MVI) positive often recurs within 2 years [3]. It was reported by several studies conducted research on MVI that MVI provides an independent risk factor for predicting tumor recurrence and overall survival rate after resection [6][7][8]. However, the accuracy of estimating MVI preoperatively by clinical observers is usually low [9].
Previous studies demonstrated preoperative prediction of MVI using computed tomography (CT) images and clinical factors (CFs) [10][11][12]. Ma et al. (2019) used least absolute shrinkage and selection operator (LASSO) method for radiomic feature extraction and multivariable logistic regression analysis for CF predictor selection to construct models for preoperative MVI prediction. Xu et al. (2019) applied a feature selection support vector machine (SVM) on CT radiomics and then combined with radiological features and CFs (e.g. age, gender, Child-Pugh class, hepatic virus infection, etc.) to develop a computational approach for prediction of MVI status and long-term clinical survival outcome of patients with HCC. Peng et al. (2018) developed a predictive model using multivariable logistic regression for MVI status in hepatitis B virus-related HCC patients by including radiomics and CFs. In addition to radiomic and radiological features, Banerjee et al. (2015) [1] demonstrated accurate prediction of histological MVI using radiogenomic venous invasion (RVI) as a contrastenhanced CT biomarker of MVI. RVI was derived from an association mapping of CT imaging traits with the expression of 91 HCC-specific "venous invasion" gene signatures. They reported that RVI is associated with early disease recurrence and lower overall survival.
Feature-based method is one of main approaches to construct MVI predictive models [10,[12][13][14]. For instance, potential radiomic features extracted from CT images were selected using supervised or unsupervised methods before the development of predictive statistical or machine learning models. However, shortcomings are identified in these methods. First, the types of features extracted with handcrafted method and the numbers of selected radiomic features included in model development varied from study to study. Using hand-crafted method to extract features from CT images is usually tedious and relies heavily on the experiences of observers. Some underlying imaging information relevant to HCC may not be faithfully captured by observers who are less experienced. Second, these methods were not able to take into account the detailed information provided in the images pixel by pixel. Third, some studies [10,12,13,15] used images with tumor contouring performed manually by radiologists. This, however, will increase workload of radiologists and consume a lot of time in tumor contouring when a predictive model developed with such images was added into the clinical workflow. Manual tumor contouring may also generate selection bias and influence the performance of predictive models. In addition, external validation of predictive model performance using CT images obtained from multiple different medical centers was limited. This restricts the deployment of a well-developed model at other hospitals that use different scanning parameters than the medical center that built the model.
To tackle the above limitations, a deep learning-based framework, in which feature extraction is performed automatically, is demonstrated in this study. A pretrained convolutional neural networks (CNNs) model was applied to analyze HCC histopathological CT imaging features and concatenate with CFs to predict MVI preoperatively. In addition to using CT images from one hospital for model training and validation, external validation of predictive models using CT images obtained across hospitals were also performed. This provided evidence verifying the robustness of our constructed CNN model in generalization and its capability to be applied on CT images obtained at other hospitals. Furthermore, model interpretability was revealed by adopting activation heatmaps to visualize and verify whether the section focused by the model for MVI prediction was in line with the clinical decision workflow performed by radiologists during MVI diagnosis. As a whole, this study provides a new milestone for future deep learning-related research to predict MVI in early-stage HCC across hospitals.

Patients
CT images of HCC patients were collected retrospectively from China Medical University Hospital (CMUH). All HCC cases occurred within the period of Jan, 2007 to Dec, 2020 were identified from the electronic medical record of CMUH. The following inclusion criteria were used: (1) surgical resection or LT for initial MVI diagnosis was performed; (2) HCC was confirmed and MVI status was recorded in the pathological report; (3) clinical data, including age, gender, maximum tumor diameter (MTD), Child-Pugh score, alpha-fetoprotein (AFP), hepatitis B and C status, were available one week before surgery; (4) dynamic CT images included pre-contrast enhancement, late arterial phase (AP) and portal venous phase (PVP) acquired within 3 months before surgery; and (5) presence of a single tumor and no gross venous invasion. Subsequently, HCC cases were excluded with the following criteria: (1) conducted locoregional therapy (i.e., ablation, trans-arterial chemoembolization or radiation therapy) before the time of imaging; (2) presence of other malignant liver tumors; and (3) presence of two or more HCC tumors. Finally, a total of 309 HCC patients (232 men and 77 women) were found to fulfill these exclusion and inclusion criteria. While these patients had their initial CT scanning and subsequent follow-ups at CMUH, another cohort of patients who had their initial CT scanning at other hospitals but were referred to CMUH were also screened with these exclusion and inclusion criteria. A total of 164 HCC patients (127 men and 37 women) were obtained eventually. Since these patients were referred from 54 hospitals and took their first CT imaging outside of CMUH, they were eligible to be used for external validation of model performance.
Before model development, the 309 patients were randomly split at a ratio of 70: 30 into training (N = 216) and validation (N = 93) sets. Since one patient might have multiple CT slices, we separated our data set based on patients to avoid distributing CT slices of the same patient into both the training and validation sets (i.e., data leakage). Random separation on patient data at a ratio of 70:30 was repeated many times to generate multiple combinations of training and validation sets, each having different numbers of CT slices in both sets. The combination with more CT slices in the training set, and no significant differences of MVI status and CFs in both sets was selected for model development. The demographics of CFs for the training and validation sets are shown in Table 1. On the other hand, the 164 patients, who were referred from other hospitals, were used as the external validation set and their demographics of CFs are shown in Table 2.

Feature selection and statistical analysis for patients' clinical factors
According to some previous studies [10,12,16], age, gender, maximum tumor diameter (MTD), Child-Pugh score, AFP, and hepatitis B status were the common features used to estimate MVI. In additional to hepatitis B, some patients enrolled in this study were found to have hepatitis C as well. As a consequence, we decided to include gender, age, MTD, AFP, Child-Pugh score, hepatitis B and C status as patients' CF for model development. Subsequently, a multivariate logistic regression analysis was performed on the association of these CFs with MVI status.

CT images
All CT images were scanned using multi-detector CT scanners (Brightspeed 16: N=91, Lightspeed 16: N=45, Lightspeed VCT 64: N= 46, and Optima 660: N=126, GE Medical Systems, Milwaukee, WI) except one case, which was scanned with 640-row spiral CT scannerk (Aquilion ONE, Canon medical systems, Hong Kong). A 1.5 mL/kg body weight bolus of Iohexol or Iodixanol (Omnipaque or Visipaque, GE Healthcare.) was injected intravenously via a power injector at a flow rate of at least 2.7 mL/s. After precontrast enhanced CT images were captured, a smart prepared technique was used. This technique involves repeatedly scanning the aorta after contrast medium injection and waiting for the density of aorta to rise over 120 Hounsfield unit (HU) before started scanning the AP images. During scanning, a target region was placed at the abdominal aorta level located above the orifice of celiac artery. PVP images were obtained at 8-15 seconds after AP, and delay phase (DP) images were obtained at 80-90 seconds after PVP. The scanning parameters were 120 kV, auto current in mA, 0.8 s rotation time and a collimation of 1.25 mm. Axial slices were reconstructed with a slice width of 5.0 mm and a slice interval of 5.0 mm. CT scanning protocols, which were performed at other hospitals, for patients in the external validation set contained pre-contrast enhanced CT, AP and PVP. However, the scanning parameters used by other hospitals might vary from one another.
Since the process of scanning AP images, which started after the density of aorta rise over 120 HU, was consistently performed in all hospitals and the timings of scanning PVP and DP images are often different across hospitals, we decided to use CT images of AP (not DP or/and PVP) to eliminate data drift in CT images due to inconsistent imaging parameters applied by distinct medical centers. A total of 1927 CT slices of AP were collected from the 309 patients in the training and validation sets, in which 216 patients in the training set contained 1186 slices (MVI+: 578; MVI-:608) and 93 patients in the validation set contained 741 slices (MVI+: 377; MVI-: 364). For the external validation set, a total of 1418 CT slices (MVI+:505; MVI-: 913) of AP were obtained from the 164 patients. All slices were confirmed to contain only one HCC tumor. All slices had a resolution of 512 × 512 with grayscale values ranging from 0 to 255. The grayscale values were transformed from HU by applying a linear transformation with a window level of 70 and a window width of 200.

Labeling of regions of interest
Labeling of regions of interest (ROIs) on all AP slices was performed by a radiologist with 15 years of experiences. A circled ROI, which contained an HCC tumor at the center and was approximately 1 cm larger than the tumor boundary to ensure full coverage of the tumor, was labeled on every CT slice of AP (see Fig. 1). This step was accomplished by the radiologist by carefully comparing all sets of CT images and the pathological findings for each patient. Labeling HCC tumor with a circled ROI on CT slices of AP reduced a substantial amount of labeling time and effort spent by the radiologist compared to delineation of tumor boundary that was performed in other studies [10,12,13,15]. As a consequence, manual tumor contouring, which is usually tedious and time-consuming, was not required in this deep-learning framework. The circled ROI was then converted to a square bounding box fitting the entire tumor and covering the ROI. A square bounding box was used because deep learning models only accept square or rectangular images as inputs.

Imaging cropping with optimal margin
The study of Banerjee et al. (2015) revealed that peritumoral regions in CT images provide information, such as hypodense halo and tumor-liver difference, which may be used as indications for the presence of MVI [1]. In contrast, other information, like a large area of air, bone tissues, kidney, great vessels and inferior vena cava, found in these regions can act as artifacts and influence MVI prediction. Therefore, cropping a smaller region from the ROIs could help to remove unnecessary noise or artifacts. Moreover, the margin used to crop the region is critical as a cropped region that is too small may sacrifice some important information relevant to MVI identification. Therefore, a few different values were tested to search for the most optimal margin for cropping. The margin over here is defined as the scale of the edge length of the labeled bounding box. A margin less than 1.0 produces a cropped region that is smaller than the labeled bounding box, as shown in Fig Fig. 2A shows an example of a cropped region with a margin of 0.8. The HCC boundary was found to be fitting nicely within the cropped region (blue square box) implying that the visual information provided in the cropped region is adequate for MVI prediction. All images were processed with a margin of 0.8 and were used in subsequent experiments.

Data augmentation
After processing all images with the most optimal margin, data augmentation was performed to reduce overfitting of deep learning models. This also increased the data size for model training. Three different types of In step (4), some examples of images in data augmentation were provided. For random rotation (images randomly rotating at − 10 to 10 degrees) and horizontal flipping, the rotating angle is shown below each image. For random cropping (images randomly cropped with sizes of 0.8-1.0 and aspect ratios of 0.95-1.05), the cropped size and aspect ratio is shown below each image. The red square box marks out the resulted image with the corresponding cropped size and aspect ratio. CT = computed tomography; ROI: region of interest; MVI = microvascular invasion; MTD = maximum tumor diameter; AFP = alpha-fetoprotein; HBsAg = Hepatitis B surface antigen; HCsAg = Hepatitis C surface antigen; FC = fully connected transformations were applied: (1) randomly rotating at angles varying from -10 to 10 degrees; (2) randomly cropping the images with sizes of 0.8 to 1.0 and aspect ratios of 0.95 to 1.05; (3) horizontally flipping the images with a probability of 0.5. In addition to increasing the data amount, data augmentation also served to increase the variation of tumor visual appearance in size and distortion. This aided in constructing deep learning models which would be robust to the variability of HCC tumors in patients.  [17] demonstrated that a deep learning CNN model, such as GoogLeNet and ResNet, pretrained on the ImageNet data set is able to benefit visual recognition tasks for medical images. Therefore, we adopted ResNet [18] since it is a CNN model suitable for visual recognition tasks. Moreover, ResNet has been used by other HCC studies to predict surgical response [19] or recurrence [20] with CT images. By taking ResNet-18 as an example, the postfix number in ResNet-18 indicates the number of layers of computational blocks in the neural network. A larger number of layers is usually required for a neural network to handle visual recognition tasks with high complexity. However, a deeper neural network tends to have higher risk of being overfitting. To search for an optimal number of layers and a better CNN model, we tested ResNet models with different numbers of layers as well as other CNN algorithms, e.g. VGG [21], ResNeXt [22], and DenseNet [23], using the training set (Fig. 3).

Machine learning model
In addition to deep learning models, SVM models were developed to be used for comparisons. SVM was applied by Ma   purpose of testing if CFs without imaging features would be sufficient to make accurate prediction. On the other hand, the SVM model built with both CFs and imaging features was compared to the RestNet-18 model trained with CFs and AP images. Eventually, the performance of the constructed SVM models was compared to the performance of the developed ResNet-18 models.

Deep learning
After applying image cropping and subsequent data augmentation, all AP slices were resized to 256×256. The input image's size of the pre-trained RestNet-18 model was changed from 224x224 to 256x256. During transfer learning, no layer was frozen and all layers before the final fully connected layer were initialized to the pretrained model that was trained on the ImageNet data set. Moreover, we used a stochastic gradient descent optimizer on cross-entropy loss with a mini-batch size of 8, a weight decay of 0.00001, a momentum of 0.9, and a learning rate of 0.0003.

Machine learning
For Prediction and metrics for model evaluation As one patient had multiple CT slices and MVI statuses of all slices might not be predicted to be the same by the model, we aggregated the prediction results per-patient basis by following the guideline of clinical decision workflow adopted by the radiologist. Based on the guideline, as long as one of the slices shows MVI+, this patient will be considered as having an HCC tumor with MVI+. A patient will only be considered as having an HCC tumor with MVI-if all his/her slices show MVI-. This is because an HCC tumor is three-dimensional and the portion with MVI+ might not be captured in every slice. For model evaluation, AUC, accuracy, sensitivity and specificity scores were computed for all the models developed in this study on the training, validation and external data sets. Before computation of accuracy, sensitivity and specificity scores, the optimal threshold was first determined using Youden's J statistic from the receiver operation characteristic (ROC) curve of each model. The J statistic consists of differences between the true positive rate and the false positive rate of the ROC curve. The maximum of the differences was considered as the optimal threshold. The optimal threshold was then used as the cut-off for the predicted probability of MVI to compute the metric scores.

Model explainability
In addition to prediction, explaining or interpreting how a deep learning model makes predictions has become more important recently to obtain trust from physicians, patients, regulators and other stakeholders involved [24]. Therefore, we explored the critical areas or features on an input image that contributed or had impact to ResNet-18 model prediction using the gradient-weighted class activation map (Grad-CAM) [25]. Grad-CAM visualizes critical areas or features on an image 'focused' by a model with the representation of attention heatmaps. We set the intensities of heatmaps as the gradients of model outputs with respect to the activation of the last convolutional layer in the ResNet-18 trained with AP images and CFs. The gradients were computed using backpropagation algorithm.

Clinical factors
A total of 309 patients were included in both the training and validation sets. After surgeries, these patients were histopathologically identified and classified into two groups: the MVI+ group (96 patients, 31 %) and the MVI-group (213 patients, 69 %). The clinical factors, demographics and statistical comparison of the training and validation sets are reported in Table 1. There were no significant differences between the MVI+ group and the MVI-group in the training and validation data sets. Moreover, there were no significant differences in age, gender, MTD, AFP, Child-Pugh score, HBsAg, or HCsAg between the training and validation sets. Using both the training and validation sets, a multivariate logistic regression analysis was performed to test the association of these clinical factors with MVI status and the results are shown in Table 3. We used multivariate logistic regression for analysis because we had more than one independent variables. The odd-ratios of MTD, AFP and Child-Pugh score were found to be significant (p < 0.05). This indicated that MTD, AFP and Child-Pugh score were independently associated with MVI status.
A total of 164 patients were included in the external data set. Although patients in the external data were referred from multiple medical centers, all of them received subsequent surgical treatment at CMUH. Based on the histopathological results, 39 patients (24 %) were MVI positive and 125 (76 %) patients were MVI negative.
The clinical factors and demographics of the external data set are reported in Table 2. The CT images of patients referred from other hospitals were acquired using a larger variety of CT scanners with disparate manufacturing vendors compared to patients within CMUH (Fig.  4). As a consequence, the external set is useful in verifying the robustness of our developed model in making an accurate prediction.

Construction of CNN (ResNet-18) models for MVI prediction
In this current study, a ResNet-18 model utilizing patients' AP CT images and CFs was developed to predict MVI pre-operatively. The detailed process of constructing this ResNet model, including image cropping, data augmentation, resizing and image pre-processing, is shown in Fig. 1. Another ResNet-18 model utilizing only AP CT images was also constructed for performance comparison. We ran model training and validation 5 times for both these ResNet-18 models. The ROC curves and mean AUC scores of these ResNet-18 models (AP image vs. AP image + CF) are presented in Fig. 5A & C. The accuracy, sensitivity, specificity and AUC score of these models (the best among 5 repetitions) are reported in Table 4 for training, validation and external sets. The model developed with both the patients' AP images and CFs produced an AUC score of 0.85 while the model for AP images had an AUC score of 0.82 on the validation set. For the external set, the AUC score of the model for AP images and CFs was 0.78 and the model for CFs was 0.75. The AUC score (0.78) of the external set was close to the mean AUC score (0.81) of the validation set indicating that the ResNet-18 model for AP images and CFs was able to generalize to patients who had their CT imaging obtained in other hospitals. This also implied that the model can be applied to predict MVI status using CT images scanned using scanners manufactured by various different vendors (Fig. 4). When looking at the results of multiple metrics on the external set (Fig. 5), the ResNet-18 model utilizing both AP images and CFs for MVI prediction had an overall higher accuracy, sensitivity and specificity and AUC scores than the one utilizing only AP images for MVI prediction.

Construction of SVM models for MVI prediction
Next, we trained SVM algorithms to use either CFs or CT imaging features and CFs to pre-operatively predict MVI. A SVM model for CFs was constructed because we were interested to test if CFs only were sufficient to predict MVI. We were also interested to find out if a hybrid model, i.e. extracted CT imaging features from a deep learning model were used as inputs for a machine learning model [26,27], would perform as good as the ResNet-18 model. Therefore, we extracted 512 imaging features from the ResNet-18 model that was developed with AP images only and then fused with their corresponding patients' CFs. These data were then used as inputs to develop a SVM model for AP images and CFs. The ROC curves and AUC scores of the constructed two SVM models (CF and AP image + CF) are presented in Fig. 5B & C. Meanwhile, the accuracy, sensitivity and specificity of these models are reported in Table 4. Unlike the ResNet-18 models, the ROC curves of these two SVM models were less overlapping (Fig. 5B). The SVM model for CFs only produced higher AUC scores on the validation and external sets than the SVM model for both AP images and CFs (Fig. 5C). Although AUC score ( Fig. 5C and Table 4) could achieve above 0.95 on the training set, the SVM model using both AP images and CFs for MVI prediction was unable to generalize well to the validation and external sets (see Fig. 5C, Table 4). This indicated that the SVM model was overfitting. On the other hand, overfitting was less observed in the SVM model using only CFs for MVI prediction.

Comparison of CNN (ResNet-18) versus SVM models
In the training set, all ResNet-18 and SVM models could achieve AUC scores above 0.95 except the SVM model that used only patients' CFs for MVI prediction (Fig.  5C). The ROC curves of the SVM model that used both AP images and CFs for MVI prediction (fig 5B) on the validation and external sets revealed that its performance was the worst among these four models. Furthermore, when comparing performance of the four models on the validation and external sets, the AUC scores of the ResNet18 model for AP images and CFs were the highest while the SVM model for AP images and CFs were the lowest. The multiple metric scores, except sensitivity, of the SVM model for AP images and CFs on the validation and external sets (Table 4) were also not as high as the respective scores of the other models. Overall, the ResNet-18 model utilizing both AP images and CFs for MVI prediction had the highest AUC scores (Fig. 5C) Fig. 4 Computed tomography (CT) scanners used by China Medical University Hospital (CMUH; purple bars) and other hospitals (red bars) for CT imaging were manufactured by various different vendors. The bar indicates the number of CT images of arterial phase that were scanned using the particular scanner. CT images within CMUH contained arterial phase images in the training and validation sets while CT images from other hospitals included images in the external data set. There was a larger variety of scanners that were used by other hospitals compared to CMUH for imaging. CT images taken at CMUH were mainly scanned using scanners of these four types: BrightSpeed, LightSpeed VCT, LightSpeed16 and Optima CT 660 and most of its multiple metric scores (Table 4) were also higher than the other models.

Explainability of ResNet-18 model
Four representative CT images of true positive (Fig. 6A-B) from the validation and external sets, and two false positives (Fig. 6C-D) and false negative (Fig. 6E-F), each from the validation and external sets, as well as their respective attention heatmaps were presented in Fig. 6. We noticed that the ResNet-18 model for AP images and CFs identified MVI similar to the logic reported by Banerjee et al. (2015) in determining RVI. RVI, a contrast-enhanced CT biomarker of MVI derived from a 91-gene HCC "venous invasion" gene expression  signature, was demonstrated to have a strong relationship with histological MVI and predict MVI independently [28]. A three-trait decision tree (refer to Fig. 1B in [1]), which involves defining internal arteries, hypodense halo and tumor-liver differences on CT images, was used to describe the strategy used by the radiologist to identify MVI. A representative of this three-trait decision tree was re-drawn and shown in Fig. 6G. According to Banerjee et al. (2015), an HCC tumor on a CT image is identified as MVI positive if its visual appearance satisfies these three conditions sequentially: (1) presence of internal arteries; (2) no hypodense halo; and (3) no tumor-liver difference. As shown in Fig. 6A-D, these representative images in the validation and external sets were classified as MVI positive by our ResNet-18 model (red arrows indicate presence of internal arteries) because they satisfied all the three conditions. On the other hand, the two representative images in Fig. 6E-F satisfied either one or two out of these three conditions and were classified as MVI negative by our RestNet-18 model (black arrows indicate presence of hypodense halo). For Fig. 6A-B, MVI status predicted by the ResNet-18 model were consistent with the real MVI status, which was based on the histopathological results. MVI statuses estimated by the radiologist were uncertain for Fig. 6A Fig. 6C-F. Although conflicts were found between the actual histopathological diagnosis of MVI and the prediction of ResNet-18 model in these cases, MVI estimations made by the radiologist, based on the strategy presented in Fig. 6G, were consistent with the outcomes of our ResNet-18 model. Patients would be recommended to undergo a more detailed checking for further assessment of MVI status when the result of histopathological diagnosis is different from the ResNet-18 model prediction.

Discussion
A preoperative noninvasive assessment of MVI is helpful in assisting surgeons to determine the tumor resection area as well as guiding subsequent treatment methods [15]. Previous studies used radiological and/or radiomics features extracted from CT images to construct regression or machine learning models to predict MVI status preoperatively in HCC patients [10][11][12][13]. Moreover, patients' clinical variables, e.g. age, history of hepatic virus infection, AFP, etc., were also used in the MVI predictive model development in other studies [10,12,16,29]. Radiological features, such as MTD, number of tumors, tumor margin, internal arteries, were obtained from image analysis performed by experienced radiologists who were blinded to the pathological, clinical data and MVI status [12,15]. Meanwhile, radiomics is defined as a quantitative high-throughput extraction method used for converting medical images into high-dimensional data set for feature extraction [30,31]. It has gained increasing attention in recent years because of its capability in decoding tissue pathology and unveiling the "hidden" data which may be invisible to radiologists' or clinicians' eyes [32,33]. However, the applications of both radiological and radiomics features are limited by manual and tedious tumor contouring, which increased radiologists' workload, before feature extraction. Instead of using radiological and/or radiomics features, CT images with simple tumor labeling and without the need of manual tumor contouring were demonstrated in this study (refer to step 1 in Fig. 1). As a consequence, this not only saves time but also reduces bias that could be due to human selection and variable factors in experiences.
Comparison to other studies Model performance Having high sensitivity in preoperative assessment of MVI is actually more essential and critical than having high specificity (i.e. the proportion of MVI negatives that are correctly identified) because correctly identifying MVI-positive patients would help them in receiving more suitable surgical treatment, e.g. LT [34], and even increase their subsequent survival years [35,36]. This would also assist surgeons in making the decision of whether a larger resection area around the lesion should be performed to remove as much cancerous tissues as possible and might thus aid in reducing HCC recurrence in MVI-positive patients.

Phase of CT images
In addition to model performance, the phase of CT images used in this study was also different from the other two studies.   CT slice at a time from the volume was used as the input during ResNet-18 model training in this study. As a CT slice is two-dimensional (2D), our developed ResNet-18 model is a 2D-CNN model. Although hepatic tumors are 3D and it should be reasonable to use 3D volume samples as inputs for 3D-CNN classification tasks, such as malignant tumor identification or MVI detection, a 3D-CNN model is substantially larger than a 2D-CNN model and requires more complicated preprocessing steps [37]. A 3D-CNN model not only has exponentially more parameters, it also requires more training data, training time and storage space [38]. Furthermore, to elevate model's learning ability, a better and more advanced computing hardware is essentially important to support the whole training process. The ResNet-18 model for AP images and CFs, which is a 2D-CNN model, was compared to a 3D-CNN model in terms of speed and memory usage ( Table 6). The running speeds of these two CNN models were tested separately on a NVIDIA GeForce GTX 1650 graphic processing unit (GPU) with 4 GB video RAM, an Intel Core i7-9700 central processing unit (CPU1), or an Intel Core i5-8265 central processing unit (CPU2). According to Fig. 1 in Jiang et al. (2021), 3 volumes of 16x64x64 of CT slices were inputted into their 3D-CNN model. We imitated this by feeding 3D data with similar data size as inputs to a 3D-CNN model. To perform a fair comparison, our ResNet-18 model was given a batch of 16 CT slices as inputs at a time. We also tested our ResNet-18 model with average and maximal slices per patient in our data set as the input sizes. Average or maximal slices per patient serves as a better comparison for a patient-based decision. We repeated the same run on GPU or CPU, respectively, for 100 times to obtain an averaged running speed. For 16 and average slices/patient, ResNet18 model completed the run on GPU and CPU faster than the 3D-CNN model. For maximal slices/patients, ResNet-18 model completed the run on GPU faster than the 3D-CNN model. In addition, ResNet-18 model occupied 50 % less memory footprint than the 3D-CNN model. As a result, a 2D-CNN model has apparently higher efficiency and lower memory requirement than a 3D-CNN model. When the accuracy and AUC scores of a 2D-CNN model attain an acceptable range, it would be more practical and have higher potential than a 3D-CNN model to be deployed as an integrated diagnosis software or applied as an embedded software into an HCC tumor scanning system for preoperative MVI prediction.

Limitations and future research
In this study, external validation of MVI predictive models was demonstrated by using CT images of AP that were taken at other medical centers. Variation in CT acquisition parameters, e.g. the injected dosage and flow rate of Iohexol or Iodixanol and the time of image scanning, as well as a lack of harmonization in the types of CT scanners were noticed when compared across multiple medical centers. Since the same scanning criterion for AP is applied in all centers, AP images obtained at multiple different centers have higher consistency in terms of HU density. Hence, external validation could be reasonably achieved by using AP images as inputs to our MVI predictive models. The problem of data drift, which may degrade models' predictive performance, could thus be overcome. However, the sample size of the data set used for model development in this study was comparable and not dramatically larger than other studies. Therefore, predictive performance, e.g. AUC or accuracy, of our models could be improved by increasing the sample size of training set. In addition to data size, manual labeling of ROIs on CT images was still required and performed by only one radiologist in this study. Hence, it needs to be tested by other radiologists to confirm that the result is reproducible. On the other hand, resizing smaller ROIs extracted from CT images might have introduced noise in the spatial domain. However, using an input size (256x256) larger than the tumor ROIs became necessary because the pre-trained ResNet-18 can only accept images with certain input sizes. The use of 2D slices in this study may also limit the amount of information that could be used by the model for MVI prediction.