Mechanistic Interpretability for Neural TSP Solvers

Abstract

Neural networks have advanced combinatorial optimization, withTransformer-based solvers achieving near-optimal solutions on the TravelingSalesman Problem (TSP) in milliseconds. However, these models operate as blackboxes, providing no insight into the geometric patterns they learn or theheuristics they employ during tour construction. We address this opacity byapplying sparse autoencoders (SAEs), a mechanistic interpretability technique,to a Transformer-based TSP solver, representing the first application ofactivation-based interpretability methods to operations research models. Wetrain a pointer network with reinforcement learning on 100-node instances, thenfit an SAE to the encoder's residual stream to discover an overcompletedictionary of interpretable features. Our analysis reveals that the solvernaturally develops features mirroring fundamental TSP concepts: boundarydetectors that activate on convex-hull nodes, cluster-sensitive featuresresponding to locally dense regions, and separator features encoding geometricpartitions. These findings provide the first model-internal account of whatneural TSP solvers compute before node selection, demonstrate that geometricstructure emerges without explicit supervision, and suggest pathways towardtransparent hybrid systems that combine neural efficiency with algorithmicinterpretability. Interactive feature explorer:https://reubennarad.github.io/TSP_interp

Quick Read (beta)

loading the full paper ...