Machine learning-based CT radiomics approach for predicting WHO/ISUP nuclear grade of clear cell renal cell carcinoma: an exploratory and comparative study

Purpose To investigate the predictive performance of machine learning-based CT radiomics for differentiating between low- and high-nuclear grade of clear cell renal cell carcinomas (CCRCCs). Methods This retrospective study enrolled 406 patients with pathologically confirmed low- and high-nuclear grade of CCRCCs according to the WHO/ISUP grading system, which were divided into the training and testing cohorts. Radiomics features were extracted from nephrographic-phase CT images using PyRadiomics. A support vector machine (SVM) combined with three feature selection algorithms such as least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE), and ReliefF was performed to determine the most suitable classification model, respectively. Clinicoradiological, radiomics, and combined models were constructed using the radiological and clinical characteristics with significant differences between the groups, selected radiomics features, and a combination of both, respectively. Model performance was evaluated by receiver operating characteristic (ROC) curve, calibration curve, and decision curve analyses. Results SVM-ReliefF algorithm outperformed SVM-LASSO and SVM-RFE in distinguishing low- from high-grade CCRCCs. The combined model showed better prediction performance than the clinicoradiological and radiomics models (p < 0.05, DeLong test), which achieved the highest efficacy, with an area under the ROC curve (AUC) value of 0.887 (95% confidence interval [CI] 0.798–0.952), 0.859 (95% CI 0.748–0.935), and 0.828 (95% CI 0.731–0.929) in the training, validation, and testing cohorts, respectively. The calibration and decision curves also indicated the favorable performance of the combined model. Conclusion A combined model incorporating the radiomics features and clinicoradiological characteristics can better predict the WHO/ISUP nuclear grade of CCRCC preoperatively, thus providing effective and noninvasive assessment. Supplementary Information The online version contains supplementary material available at 10.1186/s13244-021-01107-1.


Introduction
Renal cell carcinoma accounts for 5% and 3% of all diagnosed cancers in men and women, respectively, and clear cell renal cell carcinoma (CCRCC) represents the most common subtype (∼ 80%) [1][2][3]. With a relatively poor prognosis, there is great interest in the field for improving diagnostic accuracy in order to start antineoplastic protocols at the early stage of CCRCC [4], because its biological aggressiveness significantly affects the prognosis. The pathological nuclear grade is an independent prognostic factor for CCRCC [5,6]. Although the four-tiered Fuhrman grading system (FGS) for the pathological classification of CCRCC is widely used before, the 2016 World Health Organization/International Society of Urological Pathology (WHO/ISUP) grading system has achieved widespread usage and has now replaced the FGS globally [7,8]. This system can be simplified as two-tiered classification combining grade I and II as low-grade and grade III and IV as high-grade. Moreover, low-grade cancers are generally considered less aggressive than high-grade ones [9]. The two-tiered classification has been verified to predict cancer-specific mortality and guide clinical practice in the same way as four-tiered systems, while it can reduce inter-observer variability and promote clinical practice [10,11].
Percutaneous biopsy is a common method that can identify the pathology of the lesions, but it may be controversial because of invasive operation and sampling bias and even result in the increased risk of complications [12,13]. Moreover, tumor heterogeneity refers to the existence of different subpopulations of cells, which can show distinct genotypes and divergent biological behaviors in different parts of a tumor. Thus, a noninvasive approach that can provide more information of lesions without the spatial and temporal restriction in tissue sampling is urgently needed, because it is too hard to biopsy each part of an entire tumor [14].
Despite its status as a routine noninvasive method to detect CCRCC, the routine computed tomography (CT) has the limitative power to differentiate renal cancer histologic grade with high consistency and accuracy [15]. Since resecting radiographically suspicious CCRCC without a tissue diagnosis is recommended, and this may lead to overtreatment in patients with low-grade CCRCC [16,17], an exploration of the noninvasively preoperative differentiating between low-and high-nuclear grade of CCRCCs is urgent. Radiomics analysis enables the measurement of repetitive texture patterns at the voxel or pixel levels of medical images that are beyond the identification of the naked eye [18][19][20]. Previous investigations have shown that CT-based radiomics analysis performed efficiently in differentiating between low-and high-grade CCRCCs [21][22][23][24]. It might be a promising noninvasive assessment for predicting the nuclear grade of CCRCC. To our knowledge, most studies only constructed machine learning (ML) models using radiomics features extracted from CT images rather than a comprehensive model combined with those and clinicoradiological information. Furthermore, no previous studies have evaluated the performance of nephrographic-phase (NP) CT radiomics analysis for predicting the nuclear grade of CCRCC. Therefore, this study aims to investigate if radiomics features extracted from NP CT images combined with clinicoradiological characteristics may have potential in preoperatively differentiating the WHO/ISUP nuclear grade of CCRCC.

Patient cohort
This retrospective study was approved by the Institutional Review Board of the First Affiliated Hospital of Chongqing Medical University, and the requirement for the acquisition of informed consent from patients was waived. The initial query yielded a target population of 808 patients with pathologically confirmed CCRCC who underwent partial or radical nephrectomy between January 2013 and October 2020 in our institution. Finally, a total of 406 patients with 330 low-grade and 76 highgrade CCRCCs were included in this study based on the following exclusion criteria: (1) pathology grade that was not classified according to the WHO/ISUP grading system (n = 243); (2) absence of NP CT images (n = 117); (3) images with poor definition or severe artifacts (n = 31); (4) a history of radiotherapy or chemotherapy before surgery (n = 10); and (5) radiomics features could not be extracted due to an undersized tumor volume (n = 1). The flowchart of this study is presented in Fig. 1. Moreover, the synthetic minority oversampling technique was used to increase the cases of high-grade CCRCC by oversampling for data balance [25].

Nuclear grade and clinical characteristics
Two independent histopathological specialists re-evaluated each CCRCC sample regarding nuclear grade based on the criteria of the 2016 WHO/ISUP classification [8]. Discordant reports were resolved by a third senior histopathologist. We exhibit four hematoxylin-eosin staining slides with different magnifications from four patients with WHO/ISUP grading I-IV CCRCCs (Additional file 1: Figure S1). Data on the clinical characteristics that were presumed potentially grading-correlated (age, sex, body mass index [BMI], smoking history, hypertension history, diabetes history, tumor location, resection surgical procedure, etc.) and were extracted from the electronic medical record system of our institution.

CT acquisition
All patients underwent a routine preoperative abdominal CT scan performed on a GE Discovery 750 HD (GE Healthcare, Milwaukee, WI) multidetector scanner. The parameters for CT imaging were as follows: tube voltage, 120-140 kV; tube current, 220-300 mAs; detector collimation, 0.625×64 mm; matrix, 512 × 512; slice thickness, 5 mm. All patients were injected with nonionic intravenous contrast agent, via the antecubital vein with mechanical power injector, according to their weight (1 mL/kg body weight, with a maximum of 150 mL).

Image analysis
The semantic annotations of CT images and the corresponding diagnostic criteria were as follows: (a) tumor size, defined as the maximum diameter on transverse images; (b) intratumoral necrosis, defined as the nonenhanced fluid region of the tumor, which was greater than 50% of the tumor [27]; (c) cystic degeneration, defined as target lesion showing uniform water density and signal intensity, but no enhancement on enhancement examination [28]; (d) intratumoral calcification, interpreted as obvious dense shadows in the parenchyma that were speckled, lined, or shell-shaped; (e) violation of the renal capsule, interpreted as abnormal lesion violating the margin of the renal capsule; (f ) intratumoral angiogenesis, defined as vascular enhancement observed in the parenchyma of the cortical stage tumor [27,29]; (g) venous invasion, interpreted as radiological characteristics of tumor thrombosis in the renal vein and inferior vena cava [27]; (h) perinephric metastasis, defined as perinephric invasion phenomenon on CT images; and (i) distant metastasis, considered as metastasis in the lung, liver, bone, brain, or other organs via the blood or lymphatics. In our study, two radiologists with 10 or more years of experience in renal imaging who were blinded to histopathological results independently identified and evaluated these characteristics. Any discrepancy was resolved by reaching a consensus via discussion, and the results agreed on were used for further analysis.

Tumor segmentation
All CT images were downloaded in DICOM format from the pictured archiving and communication system (Carestream, Canada) at their original dimensions and resolution and loaded into ITK-SNAP software version 3.8 [30]. A radiologist with ≥ 10 years of experience in abdominal imaging who was blinded to the pathological results (reader 1) meticulously manually delineated the regions of interest (ROIs) in a slice-by-slice manner (Fig. 2).
To evaluate the reproducibility of radiomics features, ROI-based radiomics features of 30 randomly selected patients (from the whole study cohort) were re-extracted by reader 1 and another radiologist with 15 years of experience (reader 2). Thereafter, the intraclass correlation coefficient (ICC) values of both intra-and inter-observer agreement analyses were calculated to evaluate consistency and reproducibility in terms of feature extraction, where features with ICC values > 0.80 were included in the subsequent analysis. Inter-observer variation refers to the discrepancy between the results obtained by two or more observers performing the same ROI detection. Intra-observer variation refers to the discrepancy in the measurements of one observer when performing an experiment more than once.

Radiomics feature extraction
All images were preprocessed before radiomics feature extraction as follows: first, the images and ROIs were resampled to an isotropic voxel size of 1 × 1 × 1 mm 3 using B-spline interpolation; second, we focused on the chosen region and divided by standard deviation to normalize the images; third, the gray level of the image was discretized by a fixed bin width of 25 in the histogram. An open-source PyRadiomics library [31] was employed to extract radiomics features, which were divided into the following three subgroups: (1) descriptors of the size and shape of the ROI, such as the volume and maximum The delineation of ROI on two patients with low-grade CCRCC. c, d The delineation of ROI on two patients with high-grade CCRCC surface, compactness, and sphericity of the tumor; (2) first-order statistics features, such as the mean, median, maximum, and minimum values, that described the distribution of voxel intensities within tumor; and (3) second-and higher-order statistics features (texture features) that reflected changes in the gray levels of image space and were used to measure the inter-relationships between voxel distributions within tumor. Gray-level cooccurrence matrix, gray-level run-length matrix, graylevel size-zone matrix, and gray-level dependence matrix were included in these features.

Prediction model construction
The following three models were built to predict the WHO/ISUP grade in this study: clinicoradiological, radiomics, and combined models. To construct the clinicoradiological model, univariate regression was first used to analyze radiological and clinical characteristics, such as sex, age, and intratumoral necrosis. Significant variables were further selected for the multivariate regression model. Finally, variables with p value < 0.05 were adopted. The radiomics-based ML model was constructed using a support vector machine (SVM). To obtain the top of prediction performance, different feature selection algorithms such as least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE), and ReliefF were employed to select suitable radiomics features, and those features from different feature selection algorithms were fed into SVM for the prediction performance comparison, respectively. The combined model was constructed and analyzed using the SVM by gathering the selected radiological and clinical characteristics as well as radiomics features. All of these procedures were implemented using the scikit-learn library in Python (version 3.6).

Model evaluation
Performance metrics, including sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic (ROC) curve (AUC), were used to evaluate the performance of the three prediction models. The DeLong test was performed as a nonparametric approach for the comparison of ROC curves in AUC values. In the testing cohort, calibration curve analysis was used to assess the similarity between the predicted and observed outcomes of the model, accompanied by the Hosmer-Lemeshow test. Furthermore, decision curve analysis (DCA) was conducted to demonstrate the clinical net benefit of the model. The net reclassification index (NRI) was used to evaluate the prediction ability of the model in clinical utility. To minimize perturbation problems in feature selection and to examine the reproducibility of experimental results [32], we randomly assigned the patients to a training or testing cohort 10 times. Categorizing the original dataset into different cohorts was stratified and shuffled to ensure a similar CCRCC nuclear grade distribution across the datasets. Overall, 30% of the data were taken as an independent testing cohort, whereas the rest were taken as the training and validation cohorts for the model via fivefold cross-validation. Stratification into the training cohort was automatically performed without user intervention to avoid selection bias. Subsequently, the model was reconstructed and verified repeatedly.

Statistical analysis
Categorical variables are expressed as counts (n) and percentages (%), whereas continuous variables are presented as mean values ± standard deviations or as medians with interquartile ranges. Differences in characteristics across the three datasets were analyzed using one-way analysis of variance or the Kruskal-Wallis test for normally or non-normally distributed continuous variables, followed by a post hoc test, as appropriate. Student's t test or the Wilcoxon test was used for the comparison of continuous variables between groups. Categorical variables were subjected to the Chi-square test or Fisher's exact test. The inter-observer agreement of CT findings for low-and high-grade CCRCCs between two radiologists was evaluated using kappa statistics. A forward stepwise regression was used to refine the regression model according to the Akaike information criterion. To correct for multiple comparisons, we adjusted the p values by false discovery rate correction using the Benjamini-Hochberg method [33]. All statistical analyses were performed using R software version 3.5.2 (http:// www. rproj ect. org) with "pROC", "rms", and "DecisionCurve" packages. A two-tailed p value of < 0.05 was considered statistically significant.

Clinicoradiological characteristics between the lowand high-grade groups
Out of 406 patients enrolled in this study, 240 were male and 166 were female, with an average age of 57.48 ± 12.10 years (range 16-83 years). The baseline clinical characteristics of patients are summarized in Table 1. In the patient cohort, 330 patients were diagnosed with low-grade CCRCC, whereas the rest were diagnosed with high-grade CCRCC. The majority of patients with high-grade tumors (n = 56, 73.7%) underwent radical nephrectomy, whereas most patients with low-grade tumors preferred partial nephrectomy (n = 192, 58.2%, p < 0.001). High-grade and low-grade cohorts significantly differed with respect to tumor size (5.82 ± 2.85 cm vs. 4.21 ± 2.02 cm; range: 0.8-14.6 cm, p < 0.001), hematuria symptoms (p = 0.023), distant metastasis (p = 0.035), intratumoral necrosis (p < 0.001), calcification (p < 0.001), violation of the renal capsule (p < 0.001), angiogenesis (p < 0.001), venous invasion (p < 0.001), and perinephric metastasis (p < 0.001). Table 2 shows the differences in clinicoradiological features between patients with low-and high-grade CCRCCs in the training, validation, and testing cohorts. Clinicoradiological characteristics of those are summarized in Table 3. Except for violation of the renal capsule (P < 0.001), no significant differences in either clinical or radiological features were identified between the different cohorts.

Clinicoradiological model construction
Kappa analysis indicated that the inter-observer agreement of CT findings for low-and high-grade CCRCCs between the two radiologists were highly consistent, yielding kappa values of 0.779-0.923 (Table 4). Based on the results of univariate analysis, indicators such as tumor size, hematuria symptoms, intratumoral necrosis, calcification, violation of the renal capsule, angiogenesis, venous invasion, and perinephric metastasis showed significantly different between high-and lowgrade groups were included in the multivariate analysis (

Radiomics feature extraction and radiomics model construction
A total of 972 features of NP CT images were extracted from the ROIs using the PyRadiomics package, and those with ICC values > 0.8 on both intra-and inter-observer agreement analyses were retained. A dimensionality     0.634-892), SVM-ReliefF turned into the best performer among the three classifiers. A comparison of the AUCs of the three algorithms in each data set is displayed in Fig. 4.

Comparison of the performance among clinicoradiological, radiomics, and combined models
As the optimum algorithm of the three classifiers, SVM-ReliefF was chosen to predict WHO/ISUP nuclear grade of CCRCC by analyzing features contained in clinicoradiological, radiomics, and combined models. AUC, sensitivity, specificity, PPV, and NPV were calculated to assess the prediction performance of models. As exhibited in Fig. 5a-c, compared with the clinicoradiological and radiomics models, the combined model showed the best predictive efficacy in distinguishing low-from highgrade CCRCCs with the highest AUC values in training, validation, and testing cohorts (p < 0.05, DeLong test , which showed the best prediction performance in differentiating the WHO/ISUP nuclear grade. The detailed predictive performance of the three models are summarized in Table 6, and the confusion matrices of the combined model in the testing cohort for the random splitting process of 10-times runs are shown in Additional file 1: Figure S2.

Clinical usefulness
The calibration curves of these three models for predicting low-and high-nuclear grade in CCRCC are shown in Fig. 6a. The calibration curve for the combined model demonstrated good agreement between observations and predictions in the testing cohort, accompanied by the Hosmer-Lemeshow test (p = 0.487, Fig. 6a) and followed by the radiomics model (p = 0.321, Fig. 6a). However, there were differences between observations and  predictions for the clinicoradiological model in the testing cohort (p = 0.04, Fig. 6a). DCA indicated a higher net benefit for the combined model in distinguishing low-from high-grade CCRCCs than the other models (Fig. 6b). The threshold probability was within the range of 0.15-0.98. In the testing cohort, both the combined and radiomics models achieved better discrimination performance than the clinicoradiological model (p = 0.010 and 0.021, NRI test). Additionally, the discrimination ability of the combined model was also superior to that of the radiomics model (p = 0.038, NRI test).

Discussion
In this study, we utilized NP CT based radiomics features combined with clinicoradiological characteristics to build three models such as the clinicoradiological, radiomics and combined models for distinguishing between low-and high-grade CCRCCs. The results demonstrated that NP CT based radiomics was valuable in predicting the WHO/ISUP nuclear grade of CCRCC, and associating the radiomics features with clinicoradiological characteristics could improve the predictive performance, compared with clinicoradiological and radiomics models alone. The combined model exhibited the best predictive performance and clinical usefulness with satisfactory reproducibility and reliability. Although percutaneous biopsy is the routine way to identify the preoperative pathology grade, it is an invasive approach, and patients may suffer from sampling bias and the risk of complications [12,13]. Some emerging imaging technologies such as dual-energy spectral CT, intravoxel incoherent motion imaging and diffusion kurtosis imaging could provide valuable information on the assessment of pathological grading of CCRCC [34,35]. As a recommended noninvasive detection technology for CCRCC, CT may provide to improve the accuracy of percutaneous biopsy. CT radiomics as a burgeoning technique, is able to quantify tumor heterogeneity by the spatial arrangement of imaging voxels with signal intensity variations and detect the imperceptible differences of the intensity distribution in medical images, thus noninvasively predicting pathological grade of tumor with outstanding performance [36][37][38]. Recently, the WHO/ ISUP grading system has taken the place of the former Fuhrman grading system and received acceptance in current clinical practice [39]. There are only a few published papers that have studied the application of CT radiomics to predicting the WHO/ISUP nuclear grade of CCRCC [40][41][42][43]. However, no previous studies used radiomics features extracted from NP CT images combined with clinicoradiological characteristics to develop the prediction model.
Most previous studies constructed ML models only based on CT radiomics features, which ignored the importance of traditional clinical and radiological information [26,41,44]. In our study, some parameters with clinical and radiological information that have the potential to be risk factors in the WHO/ISUP nuclear grade of CCRCC determined by multivariate regression model were fed into ML model, and the radiomics features combined with the clinicoradiological characteristics showed a better performance for the discrimination of CCRCC grades. Our result is in concordance with the results of previous studies [22,40,42,[45][46][47], and this is reinforced by the results of previous studies on the association between clinicoradiological characteristics and the nuclear grade of CCRCC [22,48]. Xu et al. [49] observed that coagulative necrosis often occurs in the CT images of patients with high-grade CCRCC. In addition, our study also found intratumoral necrosis, calcification, angiogenesis, and perinephric metastasis could be risk factors of the pathological grading of CCRCC. The previously mentioned studies have shown the potential of quantitative CT features in preoperatively predicting the WHO/ISUP nuclear grade of CCRCC, but their sample sizes were relatively small. Our study with a larger sample size would provide support for verification of the reproducibility of CT radiomics in the application of predicting WHO/ISUP nuclear grade of CCRCC using an independent testing cohort. Furthermore, we firstly demonstrated that the radiomics features from only NP CT images could obtain a preferable predictive performance in distinguishing low-from high-grade CCRCCs.
The preoperative noninvasive knowledge of CCRCC grades may contribute the clinical managements and impact clinical decisions. The new WHO/ISUP grading system is a prognostic factor for CCRCC whose grades were strongly related to patient outcomes and tumor biological behavior [50]. If low-grade CCRCC can be identified preoperatively, the treatment may be different, and the patients with low-grade CCRCC may be candidates for less invasive procedures, such as radiofrequency ablation and nephron-saving surgery, whereas radical interventions are strongly recommended in patients with high-grade CCRCC [11]. Moreover, partial nephrectomy can preserve partial renal function, thus reducing rates of infection, overall mortality and the incidence of cardiovascular disease [51]. In the clinical management, patients with low-grade CCRCC are less likely to suffer from paraneoplastic syndrome and distant metastasis, so accurately preoperative prediction of CCRCC grades may reduce unnecessary examinations, such as positron emission tomographycomputed tomography and radionuclide imaging, decreasing the economic burden and incidence of complications resulting from the usage of contrast agent. Considering the latest update of the European Association of Urology Guidelines on renal cell carcinoma [7], patients with suspicious CCRCC are strongly recommended to use multiphasic contrast-enhanced CT imaging of the abdomen for diagnostic assessment and staging of renal tumors. Therefore, medical images can become a valuable source of information, and radiomics may be used as a noninvasive method for characterizing and classifying lesions. Compared with percutaneous biopsy, the radiomics has the advantages of noninvasion, easy-to-repeat operation and no complications. Our result indicates that combining NP CT based radiomics and clinicoradiological characteristics would provide good predictive performance in distinguishing between patients with low-and high-grade CCRCCs. This could provide a reference for clinicians to choose a suitable treatment strategy. However, further larger prospective or prospective studies with multi-centric data are necessary to validate the performance of our proposed combined model in the future. A good performance does not always imply a clinically applicable and reliable model [52], and however, we found that most previous studies did not evaluate the clinical utility of their models [21,22]. In our study, we used calibration curve and decision curve analyses to evaluate the discrimination performance of the three predictive models, which showed the combined model has higher clinical usefulness with a good agreement between observations and predictions and a preferable discrimination performance, thus indicating practical value.
This study has several limitations. First, although 406 subjects with all-sided data were included, this retrospective study was conducted in a single institution, which may inevitably result in selection bias and make it less generalizable to other institutions. Therefore, further studies should enroll the larger simple sizes from different centers and scanners to improve the generalization of the prediction model. Moreover, only single-phase CT images were used in this study, and comparison with other phases should be considered. Second, an automatic segmentation algorithm should be developed to replace the manually sketching of ROI to increase the stability of prediction model. Third, although we have performed calibration statistics and decision curve analyses on the prediction models and revealed that the combined model had the best discrimination ability, the clinical application should be further validated using larger prospective or prospective studies with multi-centric data. Fourth, CCRCC is a subtype of malignant renal tumor. Despite its high occurrence, other renal cancer subtypes could have similar radiological features, and therefore, should be evaluated in future studies.
In conclusion, we demonstrated that NP CT images could become a valuable source of information, and radiomics analysis of those may be used as a potentially noninvasive method for distinguishing low-from highgrade CCRCCs. The ML model associating the radiomics features with clinicoradiological characteristics could improve the predictive performance for WHO/ ISUP nuclear grade of CCRCC, which may be a promising and feasible way to assist in the clinical managements and therapeutic decisions.