AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings

Abstract

Current language models rely on static vocabularies determined at pretrainingtime, which can lead to decreased performance and increased computational costfor domains underrepresented in the original vocabulary. New tokens can beadded to solve this problem, when coupled with a good initialization for theirnew embeddings. However, existing embedding initialization methods eitherrequire expensive further training or pretraining of additional modules. Inthis paper, we propose AweDist and show that by distilling representationsobtained using the original tokenization, we can quickly learn high-qualityinput embeddings for new tokens. Experimental results with a wide range ofopen-weight models show that AweDist is able to outperform even strongbaselines.

Quick Read (beta)

loading the full paper ...