Abstract
Seminal work by Huebner et al. (2021) showed that language models (LMs)trained on English Child-Directed Language (CDL) can reach similar syntacticabilities as LMs trained on much larger amounts of adult-directed written text,suggesting that CDL could provide more effective LM training material than thecommonly used internet-crawled data. However, the generalizability of theseresults across languages, model types, and evaluation settings remains unclear.We test this by comparing models trained on CDL vs. Wikipedia across two LMobjectives (masked and causal), three languages (English, French, German), andthree syntactic minimal-pair benchmarks. Our results on these benchmarks showinconsistent benefits of CDL, which in most cases is outperformed by Wikipediamodels. We then identify various shortcomings in previous benchmarks, andintroduce a novel testing methodology, FIT-CLAMS, which uses afrequency-controlled design to enable balanced comparisons across trainingcorpora. Through minimal pair evaluations and regression analysis we show thattraining on CDL does not yield stronger generalizations for acquiring syntaxand highlight the importance of controlling for frequency effects whenevaluating syntactic ability.