Abstract
Understanding the relationship between training data and model behaviorduring pretraining is crucial, but existing workflows make this processcumbersome, fragmented, and often inaccessible to researchers. We presentTokenSmith, an open-source library for interactive editing, inspection, andanalysis of datasets used in Megatron-style pretraining frameworks such asGPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range ofoperations including searching, viewing, ingesting, exporting, inspecting, andsampling data, all accessible through a simple user interface and a modularbackend. It also enables structured editing of pretraining data withoutrequiring changes to training code, simplifying dataset debugging, validation,and experimentation. TokenSmith is designed as a plug and play addition to existing large languagemodel pretraining workflows, thereby democratizing access to production-gradedataset tooling. TokenSmith is hosted on GitHub1, with accompanyingdocumentation and tutorials. A demonstration video is also available onYouTube.