BenchCLAMP: A Benchmark for Evaluating Language Models on Semantic Parsing

Abstract

We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage ModelParsing, which produces semantic outputs based on the analysis of input textthrough constrained decoding of a prompted or fine-tuned language model.Developers of pretrained language models currently benchmark on classification,span extraction and free-text generation tasks. Semantic parsing is neglectedin language model evaluation because of the complexity of handlingtask-specific architectures and representations. Recent work has shown thatgeneration from a prompted or fine-tuned language model can perform well atsemantic parsing when the output is constrained to be a valid semanticrepresentation. BenchCLAMP includes context-free grammars for six semanticparsing datasets with varied output meaning representations, as well as aconstrained decoding interface to generate outputs covered by these grammars.We provide low, medium, and high resource splits for each dataset, allowingaccurate comparison of various language models under different data regimes.Our benchmark supports both prompt-based learning as well as fine-tuning, andprovides an easy-to-use toolkit for language model developers to evaluate onsemantic parsing.

Quick Read (beta)

loading the full paper ...