Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Abstract

Recent improvement in large language model performance have, in alllikelihood, been accompanied by improvement in how well they can approximatethe distribution of their training data. In this work, we explore the followingquestion: which properties of text domains do LLMs faithfully approximate, andhow well do they do so? Applying observational approaches familiar from corpuslinguistics, we prompt a commonly used, opensource LLM to regenerate text fromtwo domains of permissively licensed English text which are often contained inLLM training data -- Wikipedia and news text. This regeneration paradigm allowsus to investigate whether LLMs can faithfully match the original human textdomains in a fairly semantically-controlled setting. We investigate varyinglevels of syntactic abstraction, from more simple properties like sentencelength, and article readability, to more complex and higher order propertiessuch as dependency tag distribution, parse depth, and parse complexity. We findthat the majority of the regenerated distributions show a shifted mean, a lowerstandard deviation, and a reduction of the long tail, as compared to the humanoriginals.

Quick Read (beta)

loading the full paper ...