TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Abstract

Understanding the relationship between training data and model behaviorduring pretraining is crucial, but existing workflows make this processcumbersome, fragmented, and often inaccessible to researchers. We presentTokenSmith, an open-source library for interactive editing, inspection, andanalysis of datasets used in Megatron-style pretraining frameworks such asGPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range ofoperations including searching, viewing, ingesting, exporting, inspecting, andsampling data, all accessible through a simple user interface and a modularbackend. It also enables structured editing of pretraining data withoutrequiring changes to training code, simplifying dataset debugging, validation,and experimentation. TokenSmith is designed as a plug and play addition to existing large languagemodel pretraining workflows, thereby democratizing access to production-gradedataset tooling. TokenSmith is hosted on GitHub1, with accompanyingdocumentation and tutorials. A demonstration video is also available onYouTube.

Quick Read (beta)

loading the full paper ...