The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian

Abstract

While Italian is by all metrics a high resource language, currently, thereare isn't a Language Model pre-trained exclusively in this language. Thisresults in a lower number of available benchmarks to evaluate the performanceof language models in Italian. This work presents two new benchmarks to evaluate the models performance onmathematical understanding and language understanding in Italian. Thesebenchmarks are based on real tests that are undertaken by students of agebetween 11 and 18 within the Italian school system and have therefore beenvalidated by several experts in didactics and pedagogy. To validate this dataset we evaluate the performance of 9 language modelsthat are the best performing when writing in Italian, including our ownfine-tuned models. We show that this is a challenging benchmark where currentlanguage models are bound by 60\% accuracy. We believe that the release of this dataset paves the way for improvingfuture models mathematical and language understanding in Italian.

Quick Read (beta)

loading the full paper ...