Has the quality of reporting improved since it became mandatory to use the Standards for Reporting Diagnostic Accuracy?

Objectives To investigate whether making the Standards for Reporting Diagnostic Accuracy (STARD) mandatory by the leading journal ‘Radiology’ in 2016 improved the quality of reporting of diagnostic accuracy studies. Methods A validated search term was used to identify diagnostic accuracy studies published in Radiology in 2015 and 2019. STARD adherence was assessed by two independent reviewers. Each item was scored as yes (1 point) if adequately reported or as no (0 points) if not. The total STARD score per article was calculated. Wilcoxon–Mann–Whitney tests were used to evaluate differences of the total STARD scores between 2015 and 2019. In addition, the total STARD score was compared between studies stratified by study design, citation rate, and data collection. Results The median number of reported STARD items for the total of 66 diagnostic accuracy studies from 2015 to 2019 was 18.5 (interquartile range [IQR] 17.5–20.0) of 29. Adherence to the STARD checklist significantly improved the STARD score from a median of 18.0 (IQR 15.5–19.5) in 2015 to a median of 19.5 (IQR 18.5–21.5) in 2019 (p < 0.001). No significant differences were found between studies stratified by mode of data collection (prospective vs. retrospective studies, p = 0.68), study design (cohort vs. case–control studies, p = 0.81), and citation rate (two groups divided by median split [< 0.56 citations/month vs. ≥ 0.56 citations/month], p = 0.54). Conclusions Making use of the STARD checklist mandatory significantly increased the adherence with reporting standards for diagnostic accuracy studies and should be considered by editors and publishers for widespread implementation. Critical relevance statement Editors may consider making reporting guidelines mandatory to improve the scientific quality. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s13244-023-01432-7.


Introduction
Diagnostic accuracy studies play an important role in introducing a new diagnostic test into clinical practice [1] because diagnostic test accuracy compared with an established reference standard provides information about how well diagnostic tests may improve clinical decision making [2]. Diagnostic accuracy studies are at risk of bias [3,4] because measures of diagnostic accuracy, such as sensitivity and specificity, are not fixed values but reflect the performance of the index test under certain study and test circumstances [2,[4][5][6]. Therefore, a detailed description of the methodology, setting, and subjects is crucial for readers to judge the trustworthiness of the results (internal validity) and appraise the applicability of the medical test in clinical practice (external validity, i.e., generalizability) [5].
In the past, studies published in journals with high impact factors had shortcomings in reporting diagnostic accuracy, leading to overestimation of test performance and improper recommendations with disadvantages for patient outcomes [7]. Furthermore, "incomplete reporting has been identified as a major source of avoidable waste in biomedical research" [8] and growing health care costs [9,10]. Following the successful CONSORT (Consolidated Standards of Reporting Trials) initiative [11], the Standards for Reporting Diagnostic Accuracy (STARD) statement was published in 2003 [12] and updated in 2015 [8]. It consists of a checklist of 30 essential items to guide authors in planning and reporting diagnostic accuracy studies [8]. Since then, STARD has been endorsed by more than 200 biomedical journals [13].
In February 2016, the use of reporting guideline checklists became mandatory for all original research manuscripts submitted to Radiology, which had endorsed STARD since its publication [14,15]. We used this as an opportunity to investigate the reporting quality of diagnostic accuracy studies published in Radiology before and after guideline implementation and to evaluate whether reporting quality improved after mandating reporting guideline use. Further, we analyzed whether the total STARD score differed between studies stratified by study design, citation rate, and data collection.

Methods
This analysis, even not fulfilling all criteria of a metaanalysis, complied with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16]. Our analysis was therefore not eligible for registration in the international prospective register of systematic reviews (PROSPERO) [17].

Literature search
To identify diagnostic accuracy studies published in Radiology in 2015 and 2019, we performed a systematic literature review in MEDLINE (using PubMed) using a validated search strategy proposed by Devillé et al. [18], which served as the basis (Additional file 1: Table S1) for our search strategy. The search strategy is detailed in Additional file 1: Table S2. Additionally, we manually searched the website of Radiology for additionally eligible studies which were not identified in MEDLINE. PubMed was last searched on April 8, 2020; the website of Radiology on June 23, 2020.

Study selection
Articles were included if (1) there was at least one measure of diagnostic accuracy (sensitivity, specificity, likelihood ratios, predictive values, area under the receiver operator curve, accuracy), (2) the results of at least one medical imaging test were compared against a reference standard, and (3) the study was conducted in human subjects. Articles dealing with predictive or prognostic accuracy as well as commentaries, editorials, letters, reviews and the development of models were excluded. Two reviewers (A.S., an advanced medical student, with 3 years of experience in performing literature reviews of diagnostic accuracy studies, and A.T., a dentist, with 1 year of experience in this field) independently reviewed all studies for inclusion; discrepancies were resolved in consensus meetings with a third reviewer (B.K., a physician with 8 years of experience in radiological research). First, we went through all titles, keywords, and abstracts to identify potentially eligible articles. Finally, the full texts of the articles remaining after this step were assessed for eligibility. The following information was extracted from each included article: publication date (2015 vs. 2019), mode of data collection (prospective vs. retrospective), and study design (cohort vs. case-control study).

Adherence to STARD
Although two studies reported good reproducibility of the STARD checklist [19,20], two reviewers (A.S., A.T.) independently pilot-tested the STARD checklist on four articles from 2014 TO 2020. Uncertainties regarding the explanation and elaboration of each item were discussed to make sure that the reviewers agreed about the interpretation of the STARD criteria. For the purpose of our analysis, we excluded item 11 (rationale for choosing the reference standard (if alternatives exist)) from the STARD checklist following the approach of Wilczynski [5,21] because in case that no information regarding this item was found in an article, it was not possible to reliably determine whether the authors simply forgot to mention it in the manuscript or ignored it in their study because no alternatives existed. Thus, the finally used checklist consisted of 29 items. Each adequately reported item was scored yes (1 point) or no (0 points). As items 10, 12, 13, and 21 refer to both the index test and the reference standard, we split these items and counted each of the two modalities as ½ item (0.5 points). Both reviewers (A.S., A.T.) evaluated independently all included articles according to the 29-item checklist. Discrepancies were resolved in consensus meetings. If no consensus could be reached, a third reviewer (B.K.) helped to make the final decision. Reviewers were not blinded to journal, publication year, and authors. The reviewers did not evaluate the methodological quality [22] of the study but the quality of reporting [8].

Data and statistical analysis
We calculated the total STARD score for each included article by adding the number of reported STARD items (range, 0-29). The median and interquartile range (IQR) for the total STARD scores were calculated. Assuming that each item is of equal weight, a higher score suggests a better reporting quality. Wilcoxon-Mann-Whitney's test was used to compare the STARD score between papers published in 2015 and papers published in 2019. This comparison was performed with inclusion of all studies as well as with inclusion of the following subgroups: prospective studies, retrospective studies, cohort studies, case control studies, studies with a citation rate above median, and studies with citation rate below median. In addition, Wilcoxon-Mann-Whitney's test was applied to analyze whether the total STARD score differed between studies stratified by study design (cohort vs. case control studies), citation rate (equal or above vs. below median citation rate), and data collection (prospective vs. retrospective). Vargha and Delaney's A was used as effect size measure.
The citation rate was calculated by dividing the total number of times each article had been cited by April 30, 2021, by the total number of months since publication (print version). These numbers were provided by the citation index reported in Web of Science (Thomson Reuters, New York, NY, USA).
Cohen´s κ statistics was used to calculate interrater reliability. According to Landis and Koch [23], a κ value of 0.4-0.60 indicates moderate; a κ value of 0.61-0.80, substantial; and a κ value of 0.81-1.00, (almost) perfect agreement between the reviewers. p values less than < 0.05 were considered statistically significant. The code for the statistical analysis was written in R language, version 4.2.0.

Search results and study characteristics
The  Table 1.

Adherence to STARD
The median number of reported STARD items for the 66 diagnostic accuracy studies analyzed was 18.5 (IQR 17.5-20.0) of 29, with a range of 13 to 24.5. A list of all included studies with individual total STARD scores is provided in Additional file 1: Table S3.
Studies published in 2019 showed a 2.2 points higher (95% CI 1. No difference in the total STARD score was found between studies stratified by mode of data collection (p = 0.68, Vargha and Delaney's A = 0.47), study design (p = 0.81, Vargha and Delaney's A = 0.53), and citation rate (p = 0.54, Vargha and Delaney's A = 0.54). Detailed results are provided in Table 2.

Item-specific adherence to STARD
The results for adherence to individual STARD items and comparisons of reporting frequencies between studies published in 2015 and 2019 are shown in Table 3 and

Discussion
Shortcomings in reporting diagnostic accuracy studies hamper an objective assessment of the clinical performance of diagnostic tests [24]. To improve reporting quality, the STARD statement was developed [12]. In our analysis, we assessed the reporting quality of 66 diagnostic accuracy studies published before and after using the STARD guidelines became mandatory. We found that (1) adherence to the STARD 2015 checklist was moderate    (median 18.5 of 29 items), (2) mandating guideline use had a significant effect on the total STARD score (p < 0.001), and (3) that further improvement is especially necessary to ensure adequate reporting of items that are prone to bias and variation [3,8], such as prespecified definitions of test positivity cutoffs, handling of indeterminate and missing results, providing sample size calculations, and cross-tabulations.  Compared with a previous study by this author group, we found a higher average number of reported items than in diagnostic accuracy studies published in European Radiology [25]. This could be due to the fact that European Radiology is a STARD-endorsing journal, while the use of the STARD checklist is mandatory for studies submitted to Radiology. Therefore making STARD and other checklists mandatory may be considered by Insights into Imaging and other journals of the European Society of Radiology Journal Family to improve the  [26]. The mean total STARD score of their analysis was 20 of 27 items (74%), indicating a relatively high overall reporting quality. This could be due to the fact that the authors excluded item 28 (providing a registration number). In our study, we found the lowest adherence rate for this item (9%, 6/66), which might have affected our total scores. Furthermore, Choi et al. also found no effect of the citation rate on STARD adherence. This is in-line with the results reported by Hogan et al. [27] in 2020 and in contrast with the results of the large assessment by Dilauro et al. [28] who found a weak positive correlation between the total STARD score and the citation rate. Most of the above-mentioned studies additionally compared the reporting quality of diagnostic accuracy studies in journals that had endorsed STARD with those that did not. Their results revealed that STARD endorsement had a relevant impact on the total STARD score [26,27,29].
To the best of our knowledge, ours is the first investigation explicitly assessing the impact of mandatory guideline use on reporting quality over time.
A summary of the relevant literature on STARD adherence is provided in Table 4.
Our study has some potential limitations. First, we searched MEDLINE using a validated search strategy to identify relevant diagnostic accuracy studies. Since the search strategy has 80.0% sensitivity and 97.3% specificity [18], some studies may not have been recognized by our search filter. We minimized this risk by additionally identifying further studies by a manual search of the website. Second, we excluded item 11 with the qualifier "if alternatives exist" from the original STARD 2015 checklist for reasons mentioned above. This may have affected the results of our analysis depending on the performance of item 11. Additionally, we focused on a single journal to be able to draw direct comparisons after a policy change in 2016. Due to these two points, the generalizability of our results may be limited and further studies in journals making such policy change are warranted. Also, by choosing articles published in 2019 instead of 2020 or 2021, the immediacy of our data might be affected. We made this decision due to the ongoing COVID-19 crisis since 2020, which brought a great increase in submissions about this single topic with reductions in diagnostic accuracy studies. Third, we were rather strict in assigning scores. For example, baseline characteristics (item 20) were only judged as being satisfactorily reported when some information other than sex and age, such as underlying conditions, was also provided. In addition, several items are prone to subjective assessment. To reduce rater bias, we explicitly defined each item, did pilot exercises, and resolved discrepancies in consensus meetings. Finally, the update of STARD was released in October 2015. Consequently, some authors of studies published in 2015 may not yet have had access to the revised checklist. Nevertheless, we decided to use this list for all studies because the update was intended to facilitate the use of STARD and to highlight items prone to bias and variation, as suggested by recent evidence [8]. Interestingly, five of nine new checklist items were already frequently reported in our study sample: Item 2 (structured summary), 3 (clinical background), 26 (study limitations), 27 (implications for practice), and 30 (sources of funding), which may suggest that reporting these items has already been adopted.
In conclusion, our results showed overall adherence to reporting guidelines in diagnostic accuracy studies to be moderate to good. With the STARD guidelines being mandatory since 2016, studies published in 2019 had a relevantly higher total STARD score than those published in 2015. Making the STARD guidelines mandatory may thus positively affect the reporting quality of diagnostic accuracy studies. This should encourage journals and publishers to add mandatory reporting guidelines to their author instructions.

CI
Confidence interval IQR Interquartile range STARD Standards for Reporting Diagnostic Accuracy