Abstract
Synthetic datasets constructed from formal languages allow fine-grainedexamination of the learning and generalization capabilities of machine learningsystems for sequence classification. This article presents a new benchmark formachine learning systems on sequence classification called MLRegTest, whichcontains training, development, and test sets from 1,800 regular languages.Different kinds of formal languages represent different kinds of long-distancedependencies, and correctly identifying long-distance dependencies in sequencesis a known challenge for ML systems to generalize successfully. MLRegTestorganizes its languages according to their logical complexity (monadic secondorder, first order, propositional, or monomial expressions) and the kind oflogical literals (string, tier-string, subsequence, or combinations thereof).The logical complexity and choice of literal provides a systematic way tounderstand different kinds of long-distance dependencies in regular languages,and therefore to understand the capacities of different ML systems to learnsuch long-distance dependencies. Finally, the performance of different neuralnetworks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. Themain conclusion is that performance depends significantly on the kind of testset, the class of language, and the neural network architecture.