Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

Abstract

Viral sequence classification is an important task in pathogen detection,epidemiological surveys and evolutionary studies. Statistical learning methodsare widely used to classify and identify viral sequences in samples fromenvironments. These methods face several challenges associated with the natureand properties of viral genomes such as recombination, mutation rate anddiversity. Also, new generations of sequencing technologies rise otherdifficulties by generating massive amounts of fragmented sequences. Whilelinear classifiers are often used to classify viruses, there is a lack ofexploration of the accuracy space of existing models in the context ofalignment free approaches. In this study, we present an exhaustive assessmentprocedure exploring the power of linear classifiers in genotyping and subtypingpartial and complete genomes. It is applied to the Hepatitis C viruses (HCV).Several variables are considered in this investigation such as classifier types(generative and discriminative) and their hyper-parameters (smoothing value andregularization penalty function), the classification task (genotyping andsubtyping), the length of the tested sequences (partial and complete) and thelength of k-mer words. Overall, several classifiers perform well given a set ofprecise combination of the experimental variables mentioned above. Finally, weprovide the procedure and benchmark data to allow for more robust assessment ofclassification from virus genomes.

Quick Read (beta)

loading the full paper ...