Abstract
Building effective and efficient Transformer-based large language models(LLMs) has recently become a research focus, requiring maximizing modellanguage capabilities and minimizing training and deployment costs. Existingefforts have primarily described complex relationships among model performance,parameter size, and data size, as well as searched for the optimal computeallocation to train LLMs. However, they overlook the impacts of context lengthand attention head configuration (the number of query and key-value heads ingrouped-query attention) on training and inference. In this paper, wesystematically compare models with different parameter sizes, context lengths,and attention head configurations in terms of model performance, computationalcost, and memory cost. Then, we extend the existing scaling methods, which arebased solely on parameter size and training compute, to guide the constructionof cost-optimal LLMs during both training and inference. Our quantitativescaling studies show that, when processing sufficiently long sequences, alarger model with fewer attention heads can achieve a lower loss whileincurring lower computational and memory costs. Our findings provide valuableinsights for developing practical LLMs, especially in long-context processingscenarios. We will publicly release our code and data.