Abstract
We study the learnability of languages in the Next Symbol Prediction (NSP)setting, where a learner receives only positive examples from a languagetogether with, for every prefix, (i) whether the prefix itself is in thelanguage and (ii) which next symbols can lead to an accepting string. Thissetting has been used in prior works to empirically analyze neural sequencemodels, and additionally, we observe that efficient algorithms for the NSPsetting can be used to learn the (truncated) support of language models. Weformalize the setting so as to make it amenable to PAC-learning analysis. Whilethe setting provides a much richer set of labels than the conventionalclassification setting, we show that learning concept classes such as DFAs andBoolean formulas remains computationally hard. The proof is via a constructionthat makes almost all additional labels uninformative, yielding a reductionfrom the conventional learning problem to learning with NSP labels. Undercryptographic assumptions, the reduction implies that the problem of learningDFAs is computationally hard in the NSP setting.