Deep learning model based on multi-lesion and time series CT images for predicting the benefits from anti-HER2 targeted therapy in stage IV gastric cancer

Objective To develop and validate a deep learning model based on multi-lesion and time series CT images in predicting overall survival (OS) in patients with stage IV gastric cancer (GC) receiving anti-HER2 targeted therapy. Methods A total of 207 patients were enrolled in this multicenter study, with 137 patients for retrospective training and internal validation, 33 patients for prospective validation, and 37 patients for external validation. All patients received anti-HER2 targeted therapy and underwent pre- and post-treatment CT scans (baseline and at least one follow-up). The proposed deep learning model evaluated the multiple lesions in time series CT images to predict risk probabilities. We further evaluated and validated the risk score of the nomogram combining a two-follow-up lesion-based deep learning model (LDLM-2F), tumor markers, and clinical information for predicting the benefits from treatment (Nomo-LDLM-2F). Results In the internal validation and prospective cohorts, the one-year AUCs for Nomo-LDLM-2F using the time series medical images and tumor markers were 0.894 (0.728–1.000) and 0.809 (0.561–1.000), respectively. In the external validation cohort, the one-year AUC of Nomo-LDLM-2F without tumor markers was 0.771 (0.510–1.000). Patients with a low Nomo-LDLM-2F score derived survival benefits from anti-HER2 targeted therapy significantly compared to those with a high Nomo-LDLM-2F score (all p < 0.05). Conclusion The Nomo-LDLM-2F score derived from multi-lesion and time series CT images holds promise for the effective readout of OS probability in patients with HER2-positive stage IV GC receiving anti-HER2 therapy. Critical relevance statement The deep learning model using baseline and early follow-up CT images aims to predict OS in patients with stage IV gastric cancer receiving anti-HER2 targeted therapy. This model highlights the spatiotemporal heterogeneity of stage IV GC, assisting clinicians in the early evaluation of the efficacy of anti-HER2 therapy. Key points • Multi-lesion and time series model revealed the spatiotemporal heterogeneity in anti-HER2 therapy. • The Nomo-LDLM-2F score was a valuable prognostic marker for anti-HER2 therapy. • CT-based deep learning model incorporating time-series tumor markers improved performance. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13244-024-01639-2.


Introduction
Gastric cancer (GC) is globally the fifth most prevalent and third leading cause of cancer-related mortality [1].Human epidermal growth factor receptor 2 (HER2) overexpression has been detected in 17-30.5% of patients with GC [2,3].Anti-HER2 targeted therapy has proven to be a practical treatment approach for GC [4].In the Trastuzumab for Gastric Cancer (ToGA) study, trastuzumab combined with chemotherapy extended overall survival (OS) to 16 months for patients with HER2-positive advanced GC.However, the ToGA study reported a modest objective response rate of 47.3% [5].Therefore, identifying the patients who have the potential to benefit from anti-HER2 targeted therapy has long been overdue.
GC is distinguished by high spatiotemporal heterogeneity, which plays a crucial role in resistance to anti-HER2 therapy [6].The temporal heterogeneity exhibits a change in the HER2 status before and after treatment, while the spatial heterogeneity manifests as discordant HER2 expression between primary and metastatic lesions [7].The methods to evaluate the spatiotemporal heterogeneity of stage IV GC are lacking in clinical practice.Multi-spot sampling under gastroscopy can only represent a static snapshot of primary tumors at a particular time point, which does not reflect the heterogeneous features of various metastases to targeted therapy over time.Therefore, it is imperative to develop a model that can predict the long-term prognosis of anti-HER2 targeted therapy in both temporal and spatial dimensions so that personalized therapeutics can be managed.
Clinicians mainly use tumor markers and radiographic images to track dynamic tumor changes [8].RECIST v1.1 is generally used to estimate the therapeutic efficacy in advanced GC, focusing exclusively on the unidimensional measurement of lesions rather than considering the overall landscape.However, the primary tumor cannot be identified as the target lesion due to the hollow nature of the stomach.In contrast, evidence supports the notion that artificial intelligence can reveal longitudinal heterogeneity during treatment [9][10][11][12].The deep learning model developed by Xu et al. utilized baseline and follow-up CT images at months 1, 3, and 6 to predict survival and cancer-specific outcomes for chemoradiationtreated non-small cell lung cancer [8].Lu et al. used an unlimited number of time series CT images to train the deep learning model and used baseline and two-month follow-up images to predict OS in patients with metastatic colon cancer [9].No study has used deep learning models based on time series CT images to screen out patients who benefit from anti-HER2 targeted therapy in GC as of yet.We believed that the early radiological changes were worth mining, because they demonstrated the tumor's responsiveness or resistance to targeted therapy in a temporal dimension.
Multi-lesion and time series images collected in this study could better reflect the spatiotemporal heterogeneity of stage IV GC.We constructed an attention-based deep learning framework that automatically discerns features from multiple lesions of different time points for OS prediction in anti-HER2 targeted therapy.We applied this framework to time series CT images and tumor markers to introduce the lesion-based deep learning model (LDLM) and the tumor marker-based deep learning model (TDLM).Furthermore, we built a nomogram (Nomo-LDLM) by combining the deep learning models with clinical information to achieve accurate early prediction of OS probability.

Data collection
We retrospectively enrolled patients with HER2-positive advanced GC treated with trastuzumab from four centers between November 2011 and November 2019 and prospectively enrolled patients at center 1 between December 2019 and December 2020.The ethics committee of Peking University Cancer Hospital (PUCH) approved this study (No. 2020KT08).OS was defined as the duration from the initiation of anti-HER2 therapy to death from any cause or to the most recent follow-up.
The details of patient recruitment are shown in Fig. 1 and Text S1.A total of 375 patients diagnosed as HER2positive advanced GC were enrolled, of whom 207 patients meeting the criteria were included in the analysis.The retrospectively collected 137 patients from center 1 were randomly split into the training and internal validation cohort in a 2:1 ratio (n = 91 and 46, respectively), and the prospectively collected 33 patients were included in the prospective cohort.The external validation cohort included 37 patients from centers 2-4 (25 from The First Affiliated Hospital of Zhengzhou University, 10 from Nanjing Drum Tower Hospital of Nanjing University Medical School, 2 from Ruijin Hospital of Shanghai Jiao Tong University).We included baseline and post-treatment CT images up to four follow-ups in the training cohort for model construction and validated its performance of early prediction in other cohorts only using baseline and post-treatment CT images up to two follow-ups.Moreover, the baseline and post-treatment tumor markers up to two follow-ups were also recorded in center 1 for model improvement (tumor markers in center 2/3/4 were not included).A total of 680 CT scans and 703 times examination of tumor markers were collected.Table S1 displays the protocol details of the CT scans.

Preprocessing data
The workflow is illustrated in the Graphical abstract.For CT images, two radiologists manually provided bounding boxes at the maximum slice of the primary tumor and target lesions (M.H. and J.Y., both with two years of diagnostic experience) with ITK-SNAP (version 3.8).According to RECIST v1.1, for each patient, the radiologists selected up to five target lesions whose diameters were larger than 10 mm at baseline.Then, the radiologists used bounding boxes to mark each lesion and kept the size of the bounding boxes on the baseline and follow-up images consistent for maintaining scale information.The bounding boxes were then reviewed by a senior radiologist (L.T., with 18 years of diagnostic expertise).All readers were blinded to demographic information.
As shown in the Graphical abstract, the images were processed as follows before being fed into the network: (1) extracting 1.5-fold the annotated bounding box regions from CT slices as the regions of interest (ROIs) of the lesions; (2) symmetrically padding rectangular ROIs to its minimum circumscribed square to generate image patches and then resampling them to 224 × 224 resolution; (3) normalizing the lung metastasis with the window level of -400 and the window width of 1500, and abdominal lesions with the window level of 50 and the window width of 350; and (4) augmenting images by random rotation of -30 to 30° to cope with the uncertainty in clinical settings.We applied the same operations to the upper and lower layers of the annotated lesions and obtained a 3-channel image (224 × 224 × 3) to provide richer contextual information (Fig. S1, S2, Text S2, and Table S2).For tumor markers, we normalized all data with the mean and variance in the training cohort (Table S3).

Models based on deep learning
We constructed a two-level attention-based deep learning framework using Transformer architecture [13][14][15].The first level was the temporal heterogeneity Transformer (TH-former), which modeled the feature changes in lesions or tumor markers over time.The second level, the object heterogeneity Transformer (OH-former), modeled the interactions between the features of multiple lesions or tumor markers and generated a descriptive signature of features for each patient.This framework was instantiated as the LDLM for the lesions and the TDLM for the tumor markers (Fig. S4, Texts S3 and S4).
When training LDLM, the CT images put into the model contained a baseline and up to four follow-ups, which improved the robustness of the model by reducing timedependent signal noise [10,16].Then, the CT images obtained at baseline and up to two follow-up visits were incorporated into the model in the validation cohorts for early on-treatment prediction.
For multiple ROIs of a patient, the features were firstly extracted by a feature extractor and then were combined using attention weighting.Figure S5 examples a visualization of feature fusion.We first defined a learnable aggregation vector; then through the attention weighting module of Transformer, the importance of time-serial inter-lesion heterogeneity was passed onto this vector.With this aggregation, each patient would have only one outcome vector, regardless of the number of target lesions.Finally, we used a multi-layer perceptron with softmax layer to generate the predicted OS probability, a continuous variable ranging from 0 to 1, with low values indicating poor OS (< 12 months) and high values indicating good OS (> 12 months).In addition, we used masking operations to deal with the missing data in the clinical settings (Text S5).The code can be found on GitHub https:// anony mous.4open.scien ce/r/ HER2/.

Models based on RECIST v1.1 and tumor burden
We established two prognostic models based on RECIST v1.1 and measurements of tumor burden (TB-delta model).For the RECIST model, we assessed the relative changes in diameter between the baseline and the second follow-up based on RECIST criteria.TB-delta was measured on the percentage change of the summed area of target lesions between baseline and the second followup.If the patient only had baseline and the first follow-up CT images, the result of RECIST and TB-delta were calculated based on these two examinations.

Overall model using the nomogram
The deep networks paid much attention to higherorder features [17].In contrast, the size-based RECIST focused on first-order features, providing a good complement to the deep networks.We constructed a nomogram, called Nomo-LDLM, based on multiple variables (LDLM score, TDLM score, RECIST, and clinical information) through Cox regression to achieve complementary advantages.We also constructed a nomogram without the TDLM score, Nomo-w/o TDLM, for performance comparison in the external validation cohort without tumor markers.

Explanation of deep learning models
The core module of the Transformer blocks (TH-former and OH-former) modeled complex relationships between multiple components (time points and objects).We depicted the last layer of the Transformer as attention maps, which highlighted the contribution of different time points and distinguished the relationship between lesions from different organs in high-and lowrisk groups.In particular, we used Gradient-weighted Class Activation Mapping (GradCAM) algorithm to generate the heatmap visualization of the LDLM.The high-response regions in the heatmap represented that the model paid more attention to the part of lesions, indicating a strong relationship between the lesions and risk probability at the pixel level [18].

Clinical characteristics
This study finally included 207 patients with HER2positive stage IV GC from 4 centers (Table 1).Kaplan-Meier analysis in different centers and cohorts revealed no significant difference in survival (p = 0.22 and 0.064, respectively, Fig. S6).A total of 680 CT scans were used in the study, and 765 lesions (207 primary and 558 target lesions, average 3.70 ± 1.23 per patient) involving 10 anatomical sites were examined.Only 17 individuals, accounting for 8% of the total, underwent just one follow-up CT examination, and other patients collected baseline and at least two follow-up CT images.A total of 4104 bounding boxes were delineated on the arterial and venous phases of the full-body CT (only the venous phase for liver metastases).No significant differences in the number and area of target lesions were found among the four cohorts (p > 0.0083, Bonferroni correction, Tables S1 and S4).

Development and validation of LDLMs
The model was trained with 2125 bounding boxes of 331 lesions in the training cohort and then predicted the risk probability of the patients longitudinally over time.We found that the deep learning model based on two follow-ups was overall better than LDLM-BS and LDLM-1F (baseline (BS); one follow-up (1F); two follow-ups (2F), Table 3), with the one-year AUC of internal validation, external validation, and prospective cohorts 0.844 (0.673-0.971), 0.683 (0.350-0.957), and 0.690 (0.419-0.962), respectively.We selected the optimal cutoff value according to the Youden index on the training cohort.Patients with predicted risk probability greater than the cutoff value were classified as the high-risk group, and those with lower risk probability were classified as the low-risk group.Kaplan-Meier analysis was performed between the two groups to compare the prognostic stratification ability of models based on the baseline with no follow-up or up to two follow-up CT scans (Fig. 2).We found that the LDLM-BS did not yield a statistical disparity in prognosis between high-and low-risk groups among the three cohorts (p > 0.05).With the addition of the follow-up CT scans, the stratification in OS between the high-and the low-risk groups gradually became more separable (LDLM-2F: all p < 0.05, log-rank test).
In terms of prognostic prediction, the RECIST performed relatively well; however, it failed to stratify risk in all cohorts (Fig. S8).The Nomo-w/o TDLM or Nomo-LDLM-2F outperformed the other models in the four cohorts, suggesting that the addition of follow-up images provided additional information on temporal heterogeneity, enabling more precise prognostic stratification.The nomogram with TDLM performed the best in the internal validation and prospective cohorts, indicating that the addition of tumor markers refined the multidimensional model (Table 3, Fig. 4B, C).

GradCAM for visualization of regions highlighted in LDLM
We used GradCAM to localize the saliency information correlated to prognosis for clustering lesions where tumors or peripheral areas were activated.GradCAM identified the superiority of LDLM in revealing the spatiotemporal differences between high-and low-risk groups.Figure 5 displays two patients, one from the low-risk group (I) HER2 3 + , assessed as SD at the second follow-up (TB = -27.49%); the disease progressed after 25 cycles of treatment, and the OS was 34.1 months.GradCAM mapping focused on the primary tumor and the marginal part of the lymph node.Another patient in the high-risk group (II) was also HER2 3 + and diagnosed as SD at the second follow-up (TB = -11.34%); the disease progressed at the third follow-up, and the OS was 10.97 months.In the GradCAM maps, LDLM paid great attention to the primary tumor and liver metastases.As shown in Fig. 6A, the lymph nodes had a strong synergistic correlation with peritoneal metastases in the high-risk group.In general, the lymph nodes had a high correlation with other metastases.We also observed the importance of attention patterns on different time points in the high-and low-risk groups (Fig. 6B).LDLM focused more on the first follow-up in the high-risk group compared with the low-risk group, suggesting that patients with poor prognosis were more likely to occur drug resistance at an early stage.

Discussion
We established a nomogram built with clinical data and radiological signatures using multi-lesion and time series CT images from multiple centers, which could estimate the OS outcomes of patients with stage IV GC treated with anti-HER2 targeted therapy at an early stage.
The heterogeneity between primary and metastatic lesions is crucial for treatment evaluation and clinical decision-making.The high lesion-level heterogeneity of GC makes it difficult to accurately assess the overall treatment response from a single lesion [19,20].Most previous radiomics studies used only primary [9,21] or target lesions [10] to predict prognosis.In our ablation experiments, we found that using primary lesions alone was not effective while using target lesions alone achieved higher but unsatisfactory performance.Our model simultaneously considered primary and target lesions, achieving promising performance and suggesting that the complementary information was useful (Table S5).We exploited GradCAM maps to visualize the importance of inter-tumor interactions in LDLM.The lesion interaction pattern of high-risk patients was more complex.It focused the most on the relationship between peritoneal metastases with primary tumors and lymph nodes, which might be related to the highly invasive biological behavior.
Furthermore, tracking tumor evolution is essential to predict the prognosis of targeted therapies.Previous radiomics studies focused on pre-treatment images at a single time point.In this study, we provided explicit temporal information by introducing temporal position, enabling the model to distinguish baseline from followup scans.Embedding without temporal position would weaken the model's performance because it focused only on the overall characteristics of a patient at multiple time points rather than sequential information (Table S6).The follow-up CT scans are a routine part of antitumor treatment practice.Our model provided additional dynamic information over time without extra invasive examinations, helping clinicians assess patients' suitability for anti-HER2 targeted therapy.
Further, previous studies used only the cross-entropy loss function [22,23] or survival loss [24,25] for training the prognostic model.However, Table S7 shows that using " l ce + l surv " outperformed using either alone.The cross-entropy loss paid more attention to the discernible representations between different groups, and the survival loss was responsible for ordering the relationship between all samples, providing additional information for marginal samples.We verified through ablation experiments that aggregating both losses improved the model's performance (Text S6 and Tables S7 and S8).
This study had several limitations.We included patients from multiple centers; however, the sample size was still limited.Hence, we collected up to four follow-up CT images in the training cohort to reduce time-dependent signal noise.We also set multiple random seeds to ensure reproducibility (Text S7 and Table S9).Furthermore, although we used the bounding boxes to reduce workload, the radiologists still manually segmented the ROIs.Moreover, this was a single imaging modality prediction model.We hope to incorporate other modalities such as magnetic resonance imaging or pathological images to make it more reliable and robust.We have developed in-house software at our center (Additional file 2: video), experienced fellows quickly annotated the ROIs for risk probability calculations, and combine clinical information and RECIST evaluation results to assess patients' long-term prognosis.Thus, data from larger  In conclusion, this study demonstrated, based on baseline and early follow-up CT images, that deep learning models could predict the OS in patients with stage IV GC receiving anti-HER2 targeted therapy.The analysis of multi-lesion and time series CT images can simultaneously focus on the spatiotemporal heterogeneity of stage IV GC, which may help clinicians make early treatment decisions.

Fig. 2
Fig. 2 Overall survival Kaplan-Meier analysis was performed with the cutoff value derived from the training cohort by the Youden index.Internal validation, external validation, and prospective cohorts were stratified into high-and low-risk groups by LDLM with no follow-up to two follow-ups (p < 0.05, log-rank test)

Fig. 3 Fig. 4 A
Fig. 3 Developed deep learning nomogram.The nomogram was built in the training cohort, incorporating the deep learning signature, RECIST, tumor markers, sex, and HER2 status (A).Calibration curve of the nomogram in the training, internal validation, and prospective cohorts (B)

Fig. 5
Fig. 5 Visualization of the LDLM model using the GradCAM method.Attention maps for two patients with stage IV GC are shown in this figure.They were both assessed as SD at the second follow-up and had varying survival times.The LDLM model paid more attention to the strong response areas valuable for OS prediction.LDLM, lesion-based deep learning model

Table 1
Characteristics of patients, lesions, and tumor markers * Chi's square test or Fisher's

Table 3
Performance comparisons of different models in predicting overall survival and AUCs in predicting one-year survival in four cohorts AUC Area under the curve, C-index Concordance index, HR Hazard ratio