Natural language processing for automatic evaluation of free-text answers — a feasibility study based on the European Diploma in Radiology examination

Insights into Imaging

Table 5 Results of the NLP analysis for all three questions. Results were compared with the gold standard (markings provided by the independent board-certified reviewers) and averaged over ten cycles

Dataset	Macro precision	Macro recall	Macro F1 score	Weighted precision	Weighted recall	Weighted F1 score
Case 980 “unstructured question 1” n = 96	0.24	0.24	0.24	0.26	0.27	0.26
Case 959 “unstructured question 2” n = 327	0.29	0.32	0.26	0.4	0.32	0.33
Case 457 “more structured question” n = 111	0.60	0.47	0.45	0.62	0.55	0.50