Automated detection of lung cancer-caused metastasis by classifying scintigraphic images using convolutional neural network with residual connection and hybrid attention mechanism

Background Whole-body bone scan is the widely used tool for surveying bone metastases caused by various primary solid tumors including lung cancer. Scintigraphic images are characterized by low specificity, bringing a significant challenge to manual analysis of images by nuclear medicine physicians. Convolutional neural network can be used to develop automated classification of images by automatically extracting hierarchal features and classifying high-level features into classes. Results Using convolutional neural network, a multi-class classification model has been developed to detect skeletal metastasis caused by lung cancer using clinical whole-body scintigraphic images. The proposed method consisted of image aggregation, hierarchal feature extraction, and high-level feature classification. Experimental evaluations on a set of clinical scintigraphic images have shown that the proposed multi-class classification network is workable for automated detection of lung cancer-caused metastasis, with achieving average scores of 0.7782, 0.7799, 0.7823, 0.7764, and 0.8364 for accuracy, precision, recall, F-1 score, and AUC value, respectively. Conclusions The proposed multi-class classification model can not only predict whether an image contains lung cancer-caused metastasis, but also differentiate between subclasses of lung cancer (i.e., adenocarcinoma and non-adenocarcinoma). On the context of two-class (i.e., the metastatic and non-metastatic) classification, the proposed model obtained a higher score of 0.8310 for accuracy metric.


Background
Skeletal metastasis is common in several of prevalent cancers including prostate, breast, and lung cancers [1], with 80% of all skeletal metastatic lesions originating from one of these primary sites [2]. The percentage of metastasis-related death reaches up to 90% for all lung cancer mortality [3]. Early detection of skeletal metastasis is extremely important not only for decreasing morbidity but also for disease staging, outcome prediction, and treatment planning [4].
Skeletal scintigraphy (bone scan) and positron emission tomography (PET) are commonly used for surveying bone metastasis [5,6]. Compared to PET, bone scan is more affordable and available due to its low-cost equipment and radiopharmaceutical. Bone scan is typically characterized by high sensitivity but low specificity, bringing significant challenge to manual analysis of bone scan images by nuclear medicine physicians. The reasons of low specificity are multi-fold, mainly including low spatial resolution, accumulation of radiopharmaceutical in normal skeletal structures, soft tissues or viscera, and uptake in benign processes [7].
Automated analysis of bone scan images becomes therefore desired for accurate diagnosis of skeletal metastasis. There has been a substantial amount of works aimed at developing automated diagnosis approaches using conventional machine learning models to classify bone scan images into classes [5,[8][9][10][11], where the image features were manually extracted by researchers. The handcrafted features, however, often suffer from insufficient capability and unsatisfied performance for clinical tasks [6].
Convolutional neural network (CNN), a mainstream branch of deep learning techniques, has gained huge success in automated analysis of natural images [12][13][14] and medical images [14][15][16][17] due to their ability to automatically extracting hierarchical features from images in an optimal way. CNN-based automated classification methods have been proposed to detect metastasis caused by a variety of various primary tumors including prostate cancer [18][19][20][21][22][23], breast cancer [22][23][24], lung cancer [25,26], and both of them [25][26][27]. The main purpose of existing works is to develop two-class classification models to determine whether or not an image contains metastasized lesion(s) by classifying this image (normal and metastatic). Differently, a series of CNN-based methods has been proposed to classify whole-body scintigraphic images for automated detection of skeletal metastases in our previous works [28,29], in which we did not distinguish between the primary cancers.
Targeting at automated detection of skeletal metastasis caused by lung cancer, in this work, we propose a CNN-based multiclass classification network to classify whole-body scintigraphic images acquired from patients with clinically diagnosed lung cancer using a SPECT (single photon emission computed tomography) imaging device (i.e., GE SPECT Millennium MPR). The proposed network can not only determine whether an image contains lung cancer-caused skeletal metastasis, but also differentiate between subclasses of lung cancer (i.e., adenocarcinoma and non-adenocarcinoma).
The main contributions of this work can be summarized as: First, to the best of our knowledge, we are the first to attempt to solve the problem of automated detection of skeletal metastasis originated from various subclasses of lung cancer. Second, we convert the detection problem into the multiclass classification of low-resolution, largesize scintigraphic images using a CNN-based end-toend network that first extracts hierarchal features from images, then aggregates these features, and finally classifies those high-level features into classes. Lastly, we use a group of scintigraphic images acquired from patients with clinically diagnosed lung cancer to evaluate the proposed method. Experimental results have shown that our CNN-based classification network performs well for distinguishing SPECT images between non-metastatic and metastatic as well as their sub-classes of metastasis.
The rest of this paper is organized as follows. We present in "Methods" section the proposed method. We report in "Results" section the experimental evaluation conducted on clinical SPECT images. In "Discussion" secton, we provide a brief discussion on the reasons that cause the misclassifications. In "Conclusions" section, we conclude this work and point out the future research directions.

Methods
To automatically detect metastasis of lung cancer in scintigraphic images, image fusion operation is first employed to enhance the lesion(s) in low-resolution whole-body scintigraphic images through aggregating the anterior-and posterior-view images of each bone scan. Parametric variation-based data augmentation is then applied to expand the size of the dataset used in this work to improve the performance of CNN-based network on classifying images as much as possible. A CNN-based end-to-end network is developed to classify the fused images by first extracting hierarchal features from images, then aggregating features, and finally classifying high-level features into classes of concerns, i.e., without metastasis (NoMet), adenocarcinoma metastasis (ADMet) and non-adenocarcinoma metastasis (nAD-Met). Figure 1 provides an overview of the proposed multiclass classification method, comprising of three main steps, i.e., view aggregation, data augmentation, and image classification.

View aggregation
During the SPECT imaging, two whole-body images were collected for each patient, corresponding to the anterior-and posteriorviews, respectively. When a primary tumor (e.g., lung cancer) invades into bone tissue, there will be an area of increased radionuclide's uptake in the image. It is common, however, the metastatic areas have varied intensity of uptake in anterior-and posteriorview images. How to enhance the metastatic areas in images becomes crucial for accurate detection of metastasis. A pixel-wise view aggregation method is proposed to 'excite' those metastatic pixels, while 'squeeze' the normal pixels by fusing two views as shown in Fig. 2.
Let I Ant and I Post denote the anterior-and posteriorview image respectively, the pixel-wise view aggregation method works as follows.

Image flipping
The posterior-view image I Post is flipped horizontally around its central vertical line to obtain an image IT Post.

Pixel aligning
A horizontal line sweeps the image (i.e., I Ant and IT Post) line by line to find out the critical points P Ceil and P Floor by examining the pixel value that represents uptake intensity. Similarly, we use a vertical line to sweep image line  by line to find out the critical points P Left and P Right . Two images I Ant and I T Post will be aligned according to these four critical points.

Pixel-wise image addition
The aligned images, I Ant and I T Post , will be aggregated to generate a composite image, I Comp , according to Eq. 1.
where f is aggregation function, i.e., pixel-wise addition operation.

Data augmentation
It is widely accepted that the classification performance of CNN-based models depends on the size of dataset, with high classification accuracy always corresponding to the large dataset. Currently, a variety of various methods can be utilized to augment dataset including the parametric variation and adversarial learning techniques. In this work, we use the parametric variation technique to augment our dataset since the parametric variationbased data augmentation can obtain samples that have the same distribution as the original ones with the lower time complexity. Specifically, image translation and rotation are used, which are detailed as follows [30]. (1)

Image rotation
Given a constant r ∈ [0, r T ], an image will be randomly rotated by r o in either the left or right direction around its geometric center. The parameter r T is experimentally determined according to the distribution of the radiotracer uptake of all images in the dataset. Figure 3d depicts the obtained image by rotating the image in Fig. 3a to the right direction by 3°.

Image translation
Given a constant r ∈ [0, t T ], an image will be randomly translated by + t or −t pixels in either the horizontal or vertical direction. The parameter t T is experimentally determined according to the distribution of the radiotracer uptake of all images in the dataset. Figure 3c shows a resulting example by translating the given image in Fig. 3a + 3 pixels horizontally. Table 1 outlines the structure of the proposed 26-layer CNN-based classification network, consisting of one convolution layer (Conv), one normalization layer (Norm), one pooling layer (Pool), a set of residual convolution layer attached attention (RA-Conv) with varied kernel, one global average pooling layer (GAP), and 1 Softmax layer. An input 256 × 1024 scintigraphic image is convolved by the Conv layer with filter of 7 × 7 to calculate a feature map made of neurons, followed by a batch normalization layer and a max pooling layer with kernel of 3 × 3. The subsequent convolutional layers are organized as residual convolution with hybrid attention inside or outside of the convolution. A global average pooling layer is used to alleviate the over-fitting problem while speeding up the training process. The Softmax layer points out the class of an image with a real number. The main layers will be detailed as follows.

Normalization layer
Batch normalization [31] is used to accelerate the network training by making normalization a part of the model architecture and performing the normalization for each training mini-batch. With batch normalization, we can thus use much higher learning rates and be less careful about initialization. Figure 4 demonstrates the structure of residual convolution with hybrid attention mechanism. We use residual connection to reduce the training parameters and training time. We also introduce hybrid attention mechanism to improve network focusing on those more important areas (i.e., lesions) on the feature maps by considering only the important information. Specifically, we use inRA-Conv (outRA-Conv) to indicate that a hybrid attention module is located inside (outside) the residual convolution. The classifiers are accordingly named as Classifer-inRAC and Classifer-outRAC, respectively.

RA-Conv layer
The cascaded hybrid attention module in Fig. 4 using channel and spatial attention mechanism is capable of computing complementary attention by focusing on 'what' (channel attention) and 'where' (spatial attention), respectively [32]. Specifically, let F be the input of a 2D Global average pooling (GAP) Softmax Fig. 4 Structure of residual convolution with hybrid attention mechanism feature map to the channel attention sub-module. We can obtain a 1D output F, which will be further processed by the spatial attention sub-module to output a refined 2D feature map M according to Eq. 2.
where ⊗ is the element-wise multiplication, and f C and f S denotes the channel and spatial function, respectively, which are given in Eqs. 3 and 4. where σ is the sigmoid function, MLP is the multi-layer perceptron, AvgPool (MaxPool) is the average (max) pooling, and f k×k is a convolutional operation with the kernel size of k × k.

Softmax layer
The network output nodes apply the Softmax function for the number of the unordered classes. A Softmax function is defined in Eq. 5 [33].
where f (x j ) is the score of the j-th output node, x j is the network input to j-th output node, and n is the number of output nodes. In fact, all of the output values f (x) are a probability between 0 and 1, and their sum is 1.

Results
In this section, we provide an experimental evaluation of the proposed network using a set of clinical whole-body scintigraphic images.

Dataset
In this retrospective study, the whole-body scintigraphic images were collected from the Department of Nuclear Medicine, Gansu Provincial Tumor Hospital from Jan 2014 to Dec 2019 using a single-head gamma camera (GE SPECT Millennium MPR). SPECT imaging was performed between 2 and 3 h after intravenous injection of 99m Tc-MDP (20-25 mCi) using a parallel-beam low-energy high-resolution (LEHR) collimator (energy peak = 140 keV, intrinsic energy resolution ≤ 9.5%, energy window = 20%, and intrinsic spatial resolution ≤ 6.9 mm). Each SPECT image was stored in a DICOM (Digital Imaging and Communications in Medicine) file with the imaging size of 256 × 1024. Every element in an image is represented by a 16-bit unsigned integer, differing from the natural images in which element ranges from 0 to 255. A total of 506 patients who were clinically diagnosed with lung cancer were encompassed in this study. Figure 5 demonstrates the distribution of patients with respect to gender and age.
To keep the balance between samples in different subclasses, we randomly selected 226 images from NoMet class and group the original images into dataset D1 as shown in Table 2. Applying data augmentation technique on D1, we obtained an augmented dataset D2. The dataset D3 is achieved by aggregating images in D2.
For supervised image classification problem, the CNNbased model is evaluated by comparing the automated classification results against ground truth (human performance) that is often obtained by manually labeling images. However, it is time-consuming, laborious, and subjective to manually label low-resolution, large-size SPECT images. To facilitate labeling SPECT image, in this work, we developed an annotation system based on the LabelMe (http:// label me. csail. mit. edu/ Relea se3.0/) released by MIT.
With LabelMe-based annotation system, imaging findings including the DICOM file and the textual diagnostic report can be imported into the system in advance. In the labeling process, three nuclear medicine physicians from the Department of Nuclear Medicine, Gansu Provincial Tumor Hospital manually labeled areas on the visual presentation of DICOM file with a shape tool (e.g., polygon and rectangle). The labeled area will be annotated with a self-defined code combined with the name of disease or body part. The manually labeled results for all images act as ground truth in the experiments and form an annotation file together to feed into the classifiers.

Experimental setup
The evaluation metrics we use are accuracy, precision, recall, specificity, F-1 score, and AUC (Area Under ROC Curve), which are defined in Eqs. 6-10.  It is desirable that a classifier shows both a high true positive rate (TPR = Recall), and a low false positive rate (FPR = 1-Specificity) simultaneously. The ROC curve shows the true positive rate (y-axis) against the false positive rate (x-axis), and the AUC value is defined as the area under the ROC curve. As a statistical explanation, the AUC value is equal to the probability that a randomly chosen positive image is ranked higher than a randomly chosen negative image. Therefore, the closer to 1 the AUC value is, the higher performance the classifier achieves.  We divided every dataset (D1, D2 and D3) into two parts: training set and testing set, with the ratio of them being 7: 3. It means that we use 70% of samples in each dataset to train the classifiers, and the rest 30% for testing the classifiers. Images including the augmented ones from the same patient were not divided into the different subsets because they would show similarities. The parameters setting is shown in Table 3.
The experiments are run in Tensorflow 2.0 on an Intel Core i7-9700 PC with 32 GB RAM running Windows 10.

Experimental results
For the proposed multiclass classifiers Classifer-inRAC and Classifer-outRAC, Table 4 reports the scores of the defined evaluation metrics obtained on the testing samples in dataset D3. Table 4 shows that the classifier Classifer-inRAC performs better than Classifer-outRAC. Results in Table 5 further show that Classifer-inRAC obtains the best performance on the aggregated samples in augmented dataset (i.e., D3).    We further examine the ability of Classifer-inRAC on differentiating between the subclasses of images in dataset D3 by providing confusion matrix in Fig. 7 and scores of evaluation metrics in Fig. 8.
Experimental results in Figs. 7 and 8 show that differentiating subclasses between images with metastasis is more challenging than differentiating between metastatic and non-metastatic images. There are 22 ADMet images that have been incorrectly identified as nADMet ones.
With the testing samples in dataset D3, we show the impacts of network structure and depth on classification performance obtained by the proposed classifier Classifer-inRAC. Table 6 reports the scores of evaluation metrics obtained by Classifer-inRAC after we remove the residual connection and hybrid attention module from Classifer-inRAC.
It shows that the best performance can be obtained if Classifer-inRAC has residual connection and hybrid attention module simultaneously from the scores of evaluation metrics as shown in Table 6. Separately, residual connection has more positive impact than hybrid attention mechanism on the classification performance.
Following the architectural design of Classifer-inRAC, we define two classifiers with different network depth, which are given in Table 7. Figure 9 reports the scores of evaluation metrics obtained by the classifiers defined in Table 7 and Classifer-inRAC, showing comparative advantage of the proposed classifier on classifying whole-body images.   Clasifier-18 C lasifier-34 C lasifier-inRAC

Score of metrics
Classifier Accuracy Precision Recall F-1 score Fig. 9 Classfication performance comparison between different classifiers in Table 7  We further test the performance of Classifer-inRAC on two-class classification by merging the metastatic subclasses (i.e., ADMet and nADMet) in the dataset D3. Specifically, the dataset for two-class classification is consisted of metastatic images (n = 638, ≈ 67.51%) and non-metastatic images (n = 307, ≈ 32.49%). Table 8 reports the scores of evaluation metrics on two-class classification of testing samples and Fig. 10 depicts the corresponding confusion matrix.
The results of two-class classification show that our classifier performs better on differentiating between metastatic and non-metastatic images than classifying images in different subclasses.
A comparable analysis has also been performed between the proposed model and two classical deep models Inception-v1 [34] and VGG 11 [35], which are given in Table 9 by providing their network structures.
The scores of evaluation metrics obtained by three classifiers on the dataset D3 are reported in Table 10, showing that our model is more suitable for classifying lung cancer-caused images than the classical models. The possible reason is that the network structure of our model (i.e., residual convolution combined with hybrid attention) is capable of extracting more representative features of metastatic lesions.

Discussion
In this section, we provide a brief discussion about the reasons that may cause the misclassifications by providing a group of examples in Fig. 11. Now, we provide the reasons for misclassification explained by one nuclear medicine physician and one oncologist from Gansu Provincial Tumor Hospital.

Misclassification between the metastatic and non-metastatic
Uptake of 99m Tc-MDP in benign processes (i.e., knee arthritis) is detected as metastatic lesions by the developed classifier due to the visually similar appearances to skeletal metastasis (see Fig. 11a). Furthermore, a normal bone would show a higher concentration of activity in trabecular bone with a large-mineralizing surface area like the spine. This brings huge challenge to the CNN-based automated classification of SPECT images, hence the metastatic images being misclassified as non-metastatic.

Misclassification between the diseased subclasses
It is very challenging to accurately classify metastatic images since skeletal metastases are often distributed irregularly in the axial skeleton and typically show variability in size, shape, and intensity [7]. The irregularly distributed radioactivity of ADMet can mimic nADMet, and vice versa, resulting in misclassification between ADMet and nADMet (see Fig. 11b).

Multiclass classification vs. two-class classification
Multiclass classification aims to not only determine whether an image contains lung cancer-caused skeletal metastasis, but also differentiate between subclasses of lung cancer (i.e., ADMet and nADMet). This is more difficult than to answer that an image whether contains  (c) Fig. 12 Characteristics of metastatic lesions in ADMet and nADMet subclasses. a shape; b body region; and (c) uptake intensity metastasis (i.e., two-class classification). So, the proposed classifier Classifer-inRAC obtained score of 0.8310 and 0.7782 for accuracy metric for multiclass and tow-class classification problems, respectively. Metastatic lesions are further examined by providing statistical analysis of shape, location (body region), and uptake intensity in Fig. 12. The mottling, patchy, punctate lesions dominate both the ADMet and nADMet metastasis as shown in Fig. 12a. The chest (vertebra and ribs) acts the main location (i.e., body region) where the lung cancer-caused metastasis is frequently present in as shown in Fig. 2b. As shown in Fig. 12c, the distribution of detected uptake intensity ranges widely, with 44% of lesions falling into [50, 100]; and much higher uptake can often be detected in the regions of urinary bladder and injection point. This further reveals that it is more difficult to develop an automated method for analyzing scintigraphic images than natural images in which the value of pixel ranges from 0 to 255.
To alleviate the issues mentioned above, technical solutions need to be developed in the future. With a largescale dataset of SPECT images, representative image features can be extracted for each kind of subclasses by CNN-based end-to-end classifiers. This would contribute to improving the performance of distinguishing between metastatic and non-metastatic images. Moreover, statistical analysis conducted on large-scale SPECT images and pathologic findings would have the potential to develop a multi-modal fusion classifier, enabling to achieve higher classification performance between metastatic images caused by various subclasses of lung cancer.

Conclusions
Targeting the automated detection of lung cancercaused metastasis with SPECT scintigraphy, we have developed a convolutional neural network with the hybrid attention mechanism in this work. Parametric variation was first conducted to augment the dataset of original images. An end-to-end CNN-based classification network has been proposed to automatically extract features from images, aggregate features, and classify high-level features into classes. Clinical wholebody scintigraphic images were utilized to evaluate the developed network. Experimental results have demonstrated that our self-defined network performs well in detecting lung cancer-caused metastasis as well as differentiating between subclasses of lung cancer. The analysis has also been conducted to compare the proposed model with other related models. The results reveal that our method can be used for determining whether an image contains lung cancer-caused skeletal metastasis and differentiating between subclasses of lung cancer.
In the future, we plan to extend our work in the following directions. First, we intend to collect more data of images and laboratory findings to improve the proposed multiclass classification model. Hopefully, a robust and effective computer-aided diagnosis system will be developed. Second, we attempt to develop deep learning-based approaches that can classify whole-body SPECT images with multiple lesions from various primary diseases that may present in a single image.