Skip to main content

Table 5 Results of the NLP analysis for all three questions. Results were compared with the gold standard (markings provided by the independent board-certified reviewers) and averaged over ten cycles

From: Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination

Dataset

Macro precision

Macro recall

Macro F1 score

Weighted precision

Weighted recall

Weighted F1 score

Case 980 “unstructured question 1” n = 96

0.24

0.24

0.24

0.26

0.27

0.26

Case 959 “unstructured question 2” n = 327

0.29

0.32

0.26

0.4

0.32

0.33

Case 457 “more structured question” n = 111

0.60

0.47

0.45

0.62

0.55

0.50