Abstract
Large Language Models have not yet been broadly adapted for the analysis ofscientific datasets due in part to the unique difficulties of tokenizingnumbers. We propose xVal, a numerical encoding scheme that represents any realnumber using just a single token. xVal represents a given real number byscaling a dedicated embedding vector by the number value. Combined with amodified number-inference approach, this strategy renders the model end-to-endcontinuous when considered as a map from the numbers of the input string tothose of the output string. This leads to an inductive bias that is generallymore suitable for applications in scientific domains. We empirically evaluateour proposal on a number of synthetic and real-world datasets. Compared withexisting number encoding schemes, we find that xVal is more token-efficient anddemonstrates improved generalization.