BAMBI: Developing Baby Language Models for Italian

Abstract

This paper presents BAMBI (BAby language Models Boostrapped for Italian), aseries of Baby Language Models (BabyLMs) trained on data that mimic thelinguistic input received by a five-year-old Italian-speaking child. The BAMBImodels are tested using a benchmark specifically designed to evaluate languagemodels, which takes into account the amount of training input the modelsreceived. The BAMBI models are compared against a large language model (LLM)and a multimodal language model (VLM) to study the contribution ofextralinguistic information for language acquisition. The results of ourevaluation align with the existing literature on English language models,confirming that while reduced training data support the development ofrelatively robust syntactic competence, they are insufficient for fosteringsemantic understanding. However, the gap between the training resources (dataand computation) of the BAMBI models and the LLMs is not fully reflected intheir performance: despite LLMs' massive training, their performance is notmuch better than that of BAMBI models. This suggests that strategies beyondscaling training resources, such as data curation, inclusion of multimodalinput, and other training strategies such as curriculum learning, could play acrucial role in shaping model performance.

Quick Read (beta)

loading the full paper ...