Abstract
The performance of most error-correction algorithms that operate on genomicsequencer reads is dependent on the proper choice of its configurationparameters, such as the value of k in k-mer based techniques. In this work, wetarget the problem of finding the best values of these configuration parametersto optimize error correction. We perform this in a data-driven manner, due tothe observation that different configuration parameters are optimal fordifferent datasets, i.e., from different instruments and organisms. We uselanguage modeling techniques from the Natural Language Processing (NLP) domainin our algorithmic suite, Athena, to automatically tune theperformance-sensitive configuration parameters. Through the use of N-Gram andRecurrent Neural Network (RNN) language modeling, we validate the intuitionthat the EC performance can be computed quantitatively and efficiently usingthe perplexity metric, prevalent in NLP. After training the language model, weshow that the perplexity metric calculated for runtime data has a strongnegative correlation with the correction of the erroneous NGS reads. Therefore,we use the perplexity metric to guide a hill climbing-based search, convergingtoward the best $k$-value. Our approach is suitable for both de novo andcomparative sequencing (resequencing), eliminating the need for a referencegenome to serve as the ground truth. This is important because the use of areference genome often carries forward the biases along the stages of thepipeline.