Patients
This exploratory study retrospectively analyzed the dataset of patients with cervical or uterine malignancies at a tertiary referral center between July 2010 and January 2018. The Institutional Review Board approved the study, and informed consent was waved. The experiments were performed using two cohorts of patients: (1) cervical dataset, which was used as the source domain for the pretrained model. This dataset was used for establishing the tumor segmentation model in previous report [15]; the dataset included the data of 144 patients with cervical cancer for training and the data of 25 patients for testing; (2) uterine dataset, the target domain for transfer learning experiments.
The inclusion criteria for patients in the uterine dataset were as follows: (a) female sex, (b) age of 20–80 years, and (c) clinical diagnosis of uterine malignancies. The exclusion criteria were as follows: (a) contraindicated for an MRI study due to a cardiac pacemaker or cochlear implantation; (b) post major pelvic surgery, total hip replacement, or magnetic substance implantation in the pelvis; (c) significant major systemic disease, such as renal failure, heart failure, stroke, acute myocardial infarction/unstable angina, poor controlled diabetes mellitus, and poor controlled hypertension; and (d) pregnant or breast-feeding.
Of 345 consecutive patients enrolled, we excluded 16 patients who had no visible tumors and nine patients susceptible to artifacts in DW imaging. Thus, the data of 320 patients in the uterine dataset were included in the final analysis (Fig. 2). Among them, 256 patients (80%) were randomized to the training dataset, and the remaining 64 patients (20%) were included in the testing dataset. All data were exported anonymously.
MRI data and image annotation
MRI studies were performed using two MRI scanners: Skyra (n = 248) and Trio TIM (n = 72) (Siemens Healthineers). All patients underwent the standard MR protocol from Chang Gung Memorial Hospitals following the guide of European Society of Urogenital Radiology for female pelvis imaging [16]. The imaging protocol included T1-weighted, T2-weighted, DW images and contrast-enhanced T1-weighted images acquired in sagittal and axial planes. The DW imaging utilized single-shot echo planar imaging with b-values = 0 and 1000 s/mm2 to generate ADC map. (repetition time/echo time, 3700–8200 ms/65–85 ms; slice thickness/interval, 4 mm /1 mm; field of view, 200 × 200 mm; matrix, 256 × 256). The slice sections ranged 14–22 to cover the whole tumor for each patient. The sagittal DW images and the corresponded ADC maps of each slice were used as input sources for training and testing.
Regions of interest (ROIs) of tumor contours were delineated by the consensus of two gynecologic radiologist (Y.L.H. and G.L. with 7 and 13 years of experience in gynecology, respectively) using an in-house developed interface in MATLAB (Mathworks, Natick). Both readers were blinded to clinical outcomes. We avoided the ROIs contaminating the adjacent normal endometrium and myometrium and excluded the normal cervical stroma when studying the cervical invasion. The labeled ROIs were used as the ground truth for the model training.
Network and training
In an optimization study, we explored the performance of U-Net and DeepLab V3 + architectures for tumor segmentation in cervical cancer. Finally, the DeepLab V3 + architecture was adopted because it produced higher preliminary accuracies (Additional file 1). The DW MRI with b-values of 0 and 1000 s/mm2 and ADC images were used as three-channel input sources for training. Xception was used as the backbone (first 356 layers) of the DeepLab V3 + network. The networks were trained with weight randomization and stochastic gradient descent Adam Optimizer method [17]. The signal intensities of all images were normalized to a mean = 0 and standard deviation = 1 [18]. We implanted data augmentation on each training image set, such that six times of image data were generated (20°, − 20°, 60°, − 60°, and horizontal flip). Finally, 10,164 images from the 256 patients in the uterine training dataset were used for training. The learning rate was 0.001, and the number of epochs until convergence was 100, with batch sizes of 2. The network was trained using Keras 2.1.4 written in Python 3.5.4 and TensorFlow 1.5.0. The code for the DeepLab V3 + model is available at https://github.com/bonlime/keras-deeplab-v3-plus.
Model experiments
The pretrained model was established using DeepLab V3 + for cervical dataset (n = 144). We performed three combinations of model training and prediction: (a) UT-only model: training from scratch using the uterine dataset without TL from cervical model; (b) TL model: using the pretrained cervical model and fine-tuning of certain layers using the uterine dataset; and (c) Aggregated model: training from scratch by using the combined cervical and uterine datasets. This model was proposed to test the generalization for both cervix and uterine cancers.
To investigate the effect of freezing/tuning layers on TL performance, we examined three levels as the cutoff layers on the TL model. The layers before the identified layer were frozen, whereas those after that were fine-tuned based on the target domain data (Fig. 1). (a) L1: the first layer following the Xception model of the encoder. This was to retain the low-level features learned in the source domain and retrain the high-level features from the target domain. (b) L2: a deep layer following the Atrous Spatial Pyramid Pooling at the end of the encoder. This was to retain the low-level features and most of the high-level features of the encoder in the source domain and retrain the last layer in the encoder. (c) L3: the layer at decoder initiation. This was to retain all the extracted features of the source domain in the encoder and retrain from the start of the decoder.
To assess the influence of data size on training performance, we examined different training data sizes of uterine dataset through splitting the training samples randomly with 2% (n = 5), 5% (n = 13), 10% (n = 26), 20% (n = 51), 50% (n = 128), and 100% (n = 256) of patients (Fig. 2). The independent dataset comprising patients with uterine cancer (n = 64) and cervical cancer (n = 25) was used for testing the performance of each group.
Evaluation of model performances
The accuracy of segmentation was estimated using a dice similarity coefficient (DSC) [19] as follows: \({\text{D}}\left( {{\text{X}},{\text{Y}}} \right) = \frac{{2\left\lfloor {X \cap Y} \right\rfloor }}{\left\lfloor X \right\rfloor + \left\lfloor Y \right\rfloor }\), where X and Y denote the segmentation of the prediction and ground truth, respectively. The trained models with the highest DSC in each group were selected as the final models for prediction in the testing dataset.
Extraction of ADC radiomics
To assess the reliability of predicted ROIs by the established models, we examined the radiomics features of ADC values of tumor ROIs extracted by manual and automatic segmentation models. The 14 first-order radiomics features of tumors were calculated using pyradiomics software [20] based on the 3D volumes of ROIs on ADC images.
Statistics
Statistical analysis was performed using GraphPad Prism software version 8.0 for Mac (GraphPad Software, San Diego, CA, USA). The differences in DSCs in various trained models were assessed using analysis of variance (ANOVA) with Tukey’s post hoc analysis. The stability of the model was assessed through k-fold cross-validations by using ANOVA on DSCs between labeled and predicted ROIs by each trained model. The reliability of radiomics features of tumor ROIs was evaluated using intraclass correlation coefficient (ICC) obtained by manual and automatic segmentation models.