Abstract
This paper introduces a novel, entity-aware metric, termed as RadiologicalReport (Text) Evaluation (RaTEScore), to assess the quality of medical reportsgenerated by AI models. RaTEScore emphasizes crucial medical entities such asdiagnostic outcomes and anatomical details, and is robust against complexmedical synonyms and sensitive to negation expressions. Technically, wedeveloped a comprehensive medical NER dataset, RaTE-NER, and trained an NERmodel specifically for this purpose. This model enables the decomposition ofcomplex radiological reports into constituent medical entities. The metricitself is derived by comparing the similarity of entity embeddings, obtainedfrom a language model, based on their types and relevance to clinicalsignificance. Our evaluations demonstrate that RaTEScore aligns more closelywith human preference than existing metrics, validated both on establishedpublic benchmarks and our newly proposed RaTE-Eval benchmark.