TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

  • 2025-07-25 16:37:58
  • Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia
  • 0

Abstract

Understanding the relationship between training data and model behaviorduring pretraining is crucial, but existing workflows make this processcumbersome, fragmented, and often inaccessible to researchers. We presentTokenSmith, an open-source library for interactive editing, inspection, andanalysis of datasets used in Megatron-style pretraining frameworks such asGPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range ofoperations including searching, viewing, ingesting, exporting, inspecting, andsampling data, all accessible through a simple user interface and a modularbackend. It also enables structured editing of pretraining data withoutrequiring changes to training code, simplifying dataset debugging, validation,and experimentation. TokenSmith is designed as a plug and play addition to existing large languagemodel pretraining workflows, thereby democratizing access to production-gradedataset tooling. TokenSmith is hosted on GitHub1, with accompanyingdocumentation and tutorials. A demonstration video is also available onYouTube.

 

Quick Read (beta)

loading the full paper ...