CT and MRI radiomics of bone and soft-tissue sarcomas: a systematic review of reproducibility and validation strategies

Gitto, Salvatore; Cuocolo, Renato; Albano, Domenico; Morelli, Francesco; Pescatori, Lorenzo Carlo; Messina, Carmelo; Imbriaco, Massimo; Sconfienza, Luca Maria

doi:10.1186/s13244-021-01008-3

Original Article
Open access
Published: 02 June 2021

CT and MRI radiomics of bone and soft-tissue sarcomas: a systematic review of reproducibility and validation strategies

Salvatore Gitto ORCID: orcid.org/0000-0002-3623-7822¹,
Renato Cuocolo^2,3,
Domenico Albano^4,5,
Francesco Morelli⁶,
Lorenzo Carlo Pescatori⁷,
Carmelo Messina^1,4,
Massimo Imbriaco⁸ &
…
Luca Maria Sconfienza^1,4

Insights into Imaging volume 12, Article number: 68 (2021) Cite this article

4507 Accesses
37 Citations
3 Altmetric
Metrics details

Abstract

Background

Feature reproducibility and model validation are two main challenges of radiomics. This study aims to systematically review radiomic feature reproducibility and predictive model validation strategies in studies dealing with CT and MRI radiomics of bone and soft-tissue sarcomas. The ultimate goal is to promote achieving a consensus on these aspects in radiomic workflows and facilitate clinical transferability.

Results

Out of 278 identified papers, forty-nine papers published between 2008 and 2020 were included. They dealt with radiomics of bone (n = 12) or soft-tissue (n = 37) tumors. Eighteen (37%) studies included a feature reproducibility analysis. Inter-/intra-reader segmentation variability was the theme of reproducibility analysis in 16 (33%) investigations, outnumbering the analyses focused on image acquisition or post-processing (n = 2, 4%). The intraclass correlation coefficient was the most commonly used statistical method to assess reproducibility, which ranged from 0.6 and 0.9. At least one machine learning validation technique was used for model development in 25 (51%) papers, and K-fold cross-validation was the most commonly employed. A clinical validation of the model was reported in 19 (39%) papers. It was performed using a separate dataset from the primary institution (i.e., internal validation) in 14 (29%) studies and an independent dataset related to different scanners or from another institution (i.e., independent validation) in 5 (10%) studies.

Conclusions

The issues of radiomic feature reproducibility and model validation varied largely among the studies dealing with musculoskeletal sarcomas and should be addressed in future investigations to bring the field of radiomics from a preclinical research area to the clinical stage.

Key points

Radiomic studies focused on CT and MRI of musculoskeletal sarcomas were reviewed.
Feature reproducibility analysis and model validation strategies varied largely among these studies.
Radiomic feature reproducibility was assessed in less than half of the studies.
Only 10% of the studies included an independent clinical validation of the model.

Background

Bone and soft-tissue primary malignant tumors or sarcomas are rare entities with several histological subtypes, and each has an incidence < 1/100,000/year [1, 2]. Among them, osteosarcoma is the most common sarcoma of the bone. Along with Ewing sarcoma, it has a higher incidence in the second decade of life, while chondrosarcoma is the most prevalent bone sarcoma in adulthood [1]. The most frequent soft-tissue sarcomas are liposarcoma and leiomyosarcoma [2]. Due to the rarity of these diseases, bone and soft-tissue sarcomas are managed in tertiary sarcoma centers according to current guidelines [1, 2]. Both biopsy and imaging integrate clinical data prior to the beginning of any treatment, with the former representing the reference standard for preoperative diagnosis [1, 2]. However, biopsy may be inaccurate in large, heterogeneous tumors due to sampling errors, and, in turn, inaccurate diagnosis may lead to inadequate treatment and subsequent need for further interventions, with increased morbidity. Additionally, the risk of biopsy tract contamination remains a concern. Imaging already plays a pivotal role in the assessment of bone and soft-tissue sarcomas. Magnetic resonance imaging (MRI) and computed tomography (CT) are employed for local and general staging, respectively [1, 2]. These modalities may certainly benefit from new imaging-based tools such as those based on radiomics, which may potentially provide additional information regarding both diagnosis and prognosis noninvasively [3].

The term “radiomics” derives from a combination of “radio,” referring to medical images and “omics,” which indicates the analysis of high amounts of data representing an entire set of some kind, like genome (genomics) and proteome (proteomics) [3]. Therefore, “radiomics” includes extraction and analysis of large numbers of quantitative parameters, known as radiomic features, from medical images [4]. This technique has recently gained much attention in oncologic imaging as it can potentially quantify tumor heterogeneity, which can be challenging to capture by means of qualitative imaging assessment or sampling biopsies. Particularly, radiomic studies to date have focused on discriminating tumor grades and types before treatment, monitoring response to therapy and predicting outcome [5].

Despite its great potential as a noninvasive tumor biomarker, radiomics still faces challenges preventing its clinical implementation. Two main initiatives have addressed methodological issues of radiomic studies to bridge the gap between academic endeavors and real-life application. In 2017, Lambin et al. proposed the Radiomics Quality Score that details the sequential steps to follow in radiomic pipelines and offers a tool to assess methodological rigor in their implementation [6]. In 2020, the Image Biomarkers Standardization Initiative produced and validated reference values for radiomic features, which enable verification and calibration of different software for radiomic feature extraction [7]. However, numerous challenges still remain to ensure clinical transferability of radiomics. As radiomics is essentially a two-step approach consisting of data extraction and analysis, in the first step (i.e., data extraction), the main challenge is reproducibility of radiomic features, which can be influenced by image acquisition parameters, region of interest segmentation technique and image post-processing technique [8, 9]. In the second step (i.e., data analysis), models can be built upon either conventional statistical methods or machine learning algorithms with the aim of predicting the diagnosis or outcome of interest. In either case, the main challenge is model validation [9].

The challenges of reproducibility and validation strategies in radiomics have been recently addressed in a review focusing on renal masses [10]. The aim of our study is to systematically review radiomic feature reproducibility and predictive model validation strategies in studies dealing with CT and MRI radiomics of bone and soft-tissue sarcomas. The ultimate goal is to promote and facilitate achieving a consensus on these aspects in radiomic workflows.

Methods

Reviewers

No Local Ethics Committee approval was needed for this systematic review. Literature search, study selection, and data extraction were performed independently by two recently boarded radiologists with experience in musculoskeletal tumors and radiomics (S.G. and F.M.). In case of disagreement, agreement was achieved by consensus of these two readers and a third reviewer with radiology specialty and doctorate in artificial intelligence and radiomics (R.C.). The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines [11] were followed.

Literature search

An electronic literature search was conducted on EMBASE (Elsevier) and PubMed (MEDLINE, U.S. National Library of Medicine and National Institutes of Health) databases for articles published up to December 31, 2020, and dealing with CT and MRI radiomics of bone and soft-tissue sarcomas. A controlled vocabulary was adopted using medical subject headings in PubMed and the thesaurus in EMBASE. Search syntax was built by combining search terms related to two main domains, namely “musculoskeletal sarcomas” and “radiomics.” The exact search query was: (“sarcoma”/exp OR “sarcoma”) AND (“radiomics”/exp OR “radiomics” OR “texture”/exp OR “texture”). Studies were first screened by title and abstract, and then, the full text of eligible studies was retrieved for further review. The references of selected publications were checked for additional publications to include.

Inclusion and exclusion criteria

Inclusion criteria were: (1) original research papers published in peer-reviewed journals; (2) focus on CT or MRI radiomics-based characterization of sarcomas located in bone and soft tissues for either diagnosis- or prognosis-related tasks; (3) statement that local ethics committee approval was obtained, or ethical standards of the institutional or national research committee were followed.

Exclusion criteria were: (1) papers not dealing with mass characterization, such as those focused on computer-assisted diagnosis and detection systems; (2) papers dealing with head and neck, retroperitoneal or visceral sarcomas; (3) animal, cadaveric or laboratory studies; (4) papers not written in English language.

Data extraction

Data were extracted to a spreadsheet with a drop-down list for each item, as defined by the first author, grouped into three main categories, namely baseline study characteristics, radiomic feature reproducibility strategies, and predictive model validation strategies. Items regarding baseline study characteristics included first author’s last name, year of publication, study aim, tumor type, study design, reference standard, imaging modality, database size, use of public data, segmentation process, and segmentation style. Those concerning radiomic feature reproducibility strategies included reproducibility assessment based on repeated segmentations, reproducibility assessment related to acquisition or post-processing techniques, statistical method used for reproducibility analysis, and cut-off or threshold used for reproducibility analysis. Finally, data regarding predictive model validation strategies included the use of machine learning validation techniques, clinical validation performed on a separate internal dataset, and clinical validation performed on an external or independent dataset.

Results

Baseline study characteristics

A flowchart illustrating the literature search process is presented in Fig. 1. After screening 278 papers and applying our eligibility criteria, 49 papers were included in this systematic review. Tables 1 and 2 detail the characteristics of papers dealing with radiomics of bone (n = 12) and soft-tissue (n = 37) tumors, respectively.

Table 1 Characteristics of the papers dealing with bone sarcomas included in the systematic review

Full size table

Table 2 Characteristics of the papers dealing with soft-tissue sarcomas included in the systematic review

Full size table

All studies were published between 2008 and 2020. Twenty-three out of 49 investigations (47%) were published in 2020, 14 (29%) in 2019, 4 (8%) in 2018, and 8 (16%) between 2008 and 2017. The design was prospective in 6 studies (12%) and retrospective in the remaining 43 (88%). The imaging modality of choice was MRI in 42 (86%), including one or multiple MRI sequences, and CT in 7 (14%) cases. The median size of the database was 60 patients (range 19–226). Public data were used only in 3 (6%) studies.

The research was aimed at predicting either diagnosis or prognosis, as follows: benign versus malignant tumor discrimination (n = 14); grading (n = 10); tumor histotype discrimination (n = 4); proliferation index Ki67 expression (n = 1); survival (n = 12); response to therapy, either chemotherapy or radiotherapy (n = 8); local and/or metastatic relapse (n = 9). It should be noted that the aim was twofold in some studies, as detailed in Tables 1 and 2. In those focused on diagnosis-related tasks, including benign versus malignant discrimination, grading, tumor histotype discrimination, and proliferation index expression, histology was the reference standard in all cases excepting benign lesions diagnosed on the basis of stable imaging findings over time in two papers [12, 13]. In studies focused on prediction of response to chemotherapy or radiotherapy, the reference standard was histology if lesions were surgically treated, based on the percentage of viable tumor and necrosis relative to the surgical tissue specimen, or consistent imaging findings if lesions were not operated. In studies focused on prediction of tumor relapse, the diagnosis was based on histology or consistent imaging findings, as the reference standard. In studies dealing with survival prediction, survival was assessed based on follow-up.

Regarding segmentation, the process was performed manually in 45 (92%) studies and semiautomatically in 4 (8%) studies. In no case, the segmentation process was fully automated. The following segmentation styles were identified: 2D without multiple sampling in 11 (23%) studies; 2D with multiple sampling in 3 (6%); 3D in 35 (71%). Of note, a single slice showing maximum tumor extension was chosen in all studies employing 2D segmentation without multiple sampling, excepting one case [14] where it was chosen based on signal intensity homogeneity.

Reproducibility strategies

Eighteen (37%) of the 49 studies included a reproducibility analysis of the radiomic features in their workflow. In 16 (33%) investigations [13, 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29], the reproducibility of radiomic features was assessed on the basis of repeated segmentations performed by different readers and/or the same reader at different time points. Two (4%) studies presented an analysis to assess the reproducibility based on different acquisition [30] or post-processing [31] techniques. Of note, segmentations were validated by a second experienced reader in 15 studies [12, 32,33,34,35,36,37,38,39,40,41,42,43,44,45] without, however, addressing the issue of radiomic feature reproducibility.

The intraclass correlation coefficient (ICC) was the statistical method used in most of the papers reporting a reproducibility analysis [13, 15,16,17,18, 20, 22,23,24,25, 27,28,29, 31]. ICC threshold ranged between 0.6 [13] and 0.9 [22] for reproducible features. The following statistical methods were used less commonly: analysis of variance [30, 31]; Cronbach alpha statistic [26]; Pearson correlation coefficient [19], and Spearman correlation coefficient [21].

Validation strategies

At least one machine learning validation technique was used in 25 (51%) of the 49 papers. K-fold cross-validation was used in most of the studies [13, 25, 28, 31,32,33, 37, 38, 40, 43, 44, 46,47,48,49,50]. The following machine learning validation techniques were used less commonly: bootstrapping [42, 51]; leave-one-out cross-validation [34, 35, 41]; leave-p-out cross-validation [52]; Monte Carlo cross-validation [23]; nested cross-validation [25, 27]; random-split cross-validation [20]. Figure 2 provides an overview of machine learning validation techniques. Figure 3 illustrates an example of a radiomics-based machine learning pipeline.

Clinical validation

A clinical validation of the radiomics-based prediction model was reported in 19 (39%) of the 49 papers. It was performed on a separate set of data from the primary institution, i.e., internal test set, in 14 (29%) studies [15, 16, 22, 24, 28, 31, 32, 35, 37, 38, 41, 46, 47, 52]. It was performed on an independent set of data from the primary institution (related to a different scanner) or from an external institution, i.e., external test set, in 5 (10%) studies [25, 27, 29, 43, 51].

Discussion

This systematic review focused on the radiomics literature regarding MRI and CT of bone and soft-tissue sarcomas with particular emphasis on reproducibility and validation strategies. The number of papers reporting the assessment of radiomic feature reproducibility and the use of independent or external clinical validation was relatively small. This finding is in line with recent literature reviews showing that the quality of sarcoma radiomics studies is low [53, 54], which may hamper performance generalizability of radiomic models on independent cohorts and, consequently, their practical application [53]. Thus, these issues need to be addressed in the radiomic workflow of future studies to facilitate clinical transferability.