Abstract
While large language models exhibit certain cross-lingual generalizationcapabilities, they suffer from performance degradation (PD) on unseenclosely-related languages (CRLs) and dialects relative to their high-resourcelanguage neighbour (HRLN). However, we currently lack a fundamentalunderstanding of what kinds of linguistic distances contribute to PD, and towhat extent. Furthermore, studies of cross-lingual generalization areconfounded by unknown quantities of CRL language traces in the training data,and by the frequent lack of availability of evaluation data in lower-resourcerelated languages and dialects. To address these issues, we model phonological,morphological, and lexical distance as Bayesian noise processes to synthesizeartificial languages that are controllably distant from the HRLN. We analyse PDas a function of underlying noise parameters, offering insights on modelrobustness to isolated and composed linguistic phenomena, and the impact oftask and HRL characteristics on PD. We calculate parameter posteriors on realCRL-HRLN pair data and show that they follow computed trends of artificiallanguages, demonstrating the viability of our noisers. Our framework offers acheap solution for estimating task performance on an unseen CRL given HRLNperformance using its posteriors, as well as for diagnosing observed PD on aCRL in terms of its linguistic distances from its HRLN, and opens doors toprincipled methods of mitigating performance degradation.