Abstract
We perform a critical examination of the scientific methodology behindcontemporary large language model (LLM) research. For this we assess over 2,000research works released between 2020 and 2024 based on criteria typical of whatis considered good research (e.g. presence of statistical tests andreproducibility), and cross-validate it with arguments that are at the centreof controversy (e.g., claims of emergent behaviour). We find multiple trends,such as declines in ethics disclaimers, a rise of LLMs as evaluators, and anincrease on claims of LLM reasoning abilities without leveraging humanevaluation. We note that conference checklists are effective at curtailing someof these issues, but balancing velocity and rigour in research cannot solelyrely on these. We tie all these findings to findings from recent meta-reviewsand extend recommendations on how to address what does, does not, and shouldwork in LLM research.