Theoretical and Methodological Framework for Studying Texts Produced by Large Language Models

Abstract

This paper addresses the conceptual, methodological and technical challengesin studying large language models (LLMs) and the texts they produce from aquantitative linguistics perspective. It builds on a theoretical framework thatdistinguishes between the LLM as a substrate and the entities the modelsimulates. The paper advocates for a strictly non-anthropomorphic approach tomodels while cautiously applying methodologies used in studying humanlinguistic behavior to the simulated entities. While natural languageprocessing researchers focus on the models themselves, their architecture,evaluation, and methods for improving performance, we as quantitative linguistsshould strive to build a robust theory concerning the characteristics of textsproduced by LLMs, how they differ from human-produced texts, and the propertiesof simulated entities. Additionally, we should explore the potential of LLMs asan instrument for studying human culture, of which language is an integralpart.

Quick Read (beta)

loading the full paper ...