Context-Scaling versus Task-Scaling in In-Context Learning

Abstract

Transformers exhibit In-Context Learning (ICL), where these models solve newtasks by using examples in the prompt without additional training. In our work,we identify and analyze two key components of ICL: (1) context-scaling, wheremodel performance improves as the number of in-context examples increases and(2) task-scaling, where model performance improves as the number ofpre-training tasks increases. While transformers are capable of bothcontext-scaling and task-scaling, we empirically show that standard Multi-LayerPerceptrons (MLPs) with vectorized input are only capable of task-scaling. Tounderstand how transformers are capable of context-scaling, we first propose asignificantly simplified transformer architecture without key, query, valueweights. We show that it performs ICL comparably to the original GPT-2 model invarious statistical learning tasks including linear regression, teacher-studentsettings. Furthermore, a single block of our simplified transformer can beviewed as data dependent feature map followed by an MLP. This feature map onits own is a powerful predictor that is capable of context-scaling but is notcapable of task-scaling. We show empirically that concatenating the output ofthis feature map with vectorized data as an input to MLPs enables bothcontext-scaling and task-scaling. This finding provides a simple setting tostudy context and task-scaling for ICL.

Quick Read (beta)

loading the full paper ...