Abstract
Language learning refers to the problem of inferring a mathematical modelwhich accurately represents a formal language. Many language learningalgorithms learn by asking certain types of queries about the language beingmodeled. Language learning is of practical interest in the field ofcybersecurity, where it is used to model the language accepted by a program'sinput parser (also known as its input processor). In this setting, a learnercan only query a string of its choice by executing the parser on it, whichlimits the language learning algorithms that can be used. Most practicalparsers can indicate not only whether the string is valid or not, but alsowhere the parsing failed. This extra information can be leveraged intoproducing a type of query we call the prefix query. Notably, no existinglanguage learning algorithms make use of prefix queries, though some askmembership queries i.e., they ask whether or not a given string is valid. Whenthese approaches are used to learn the language of a parser, the prefixinformation provided by the parser remains unused. In this work, we present PL*, the first known language learning algorithm tomake use of the prefix query, and a novel modification of the classical L*algorithm. We show both theoretically and empirically that PL* is able to learnmore efficiently than L* due to its ability to exploit the additionalinformation given by prefix queries over membership queries. Furthermore, weshow how PL* can be used to learn the language of a parser, by adapting it to amore practical setting in which prefix queries are the only source ofinformation available to it; that is, it does not have access to any labelledexamples or any other types of queries. We demonstrate empirically that, evenin this more constrained setting, PL* is still capable of accurately learning arange of languages of practical interest.