Abstract
Dates often contribute towards highly impactful medical decisions, but it israrely clear how to extract this data. AI has only just begun to be usedtranscribe such documents, and common methods are either to trust that theoutput produced by a complex AI model, or to parse the text using regularexpressions. Recent work has established that regular expressions are anexplainable form of logic, but it is difficult to decompose these into thecomponent parts that are required to construct precise UNIX timestamps. First,we test publicly-available regular expressions, and we found that these wereunable to capture a significant number of our dates. Next, we manually createdeasily-decomposable regular expressions, and we found that these were able todetect the majority of real dates, but also a lot of sequences of text thatlook like dates. Finally, we used regular expression synthesis to automaticallyidentify regular expressions from the reverse-engineered UNIX timestamps thatwe created. We find that regular expressions created by regular expressionsynthesis detect far fewer sequences of text that look like dates than thosethat were manually created, at the cost of a slight increase to the number ofmissed dates. Overall, our results show that regular expressions can be createdthrough regular expression synthesis to identify complex dates and date rangesin text transcriptions. To our knowledge, our proposed way of learningdeterministic logic by reverse-engineering several many-one mappings andfeeding these into a regular expression synthesiser is a new approach.