Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

  • 2019-11-06 18:23:53
  • Amine M. Remita, Abdoulaye Baniré Diallo
  • 0

Abstract

Viral sequence classification is an important task in pathogen detection,epidemiological surveys and evolutionary studies. Statistical learning methodsare widely used to classify and identify viral sequences in samples fromenvironments. These methods face several challenges associated with the natureand properties of viral genomes such as recombination, mutation rate anddiversity. Also, new generations of sequencing technologies rise otherdifficulties by generating massive amounts of fragmented sequences. Whilelinear classifiers are often used to classify viruses, there is a lack ofexploration of the accuracy space of existing models in the context ofalignment free approaches. In this study, we present an exhaustive assessmentprocedure exploring the power of linear classifiers in genotyping and subtypingpartial and complete genomes. It is applied to the Hepatitis C viruses (HCV).Several variables are considered in this investigation such as classifier types(generative and discriminative) and their hyper-parameters (smoothing value andregularization penalty function), the classification task (genotyping andsubtyping), the length of the tested sequences (partial and complete) and thelength of k-mer words. Overall, several classifiers perform well given a set ofprecise combination of the experimental variables mentioned above. Finally, weprovide the procedure and benchmark data to allow for more robust assessment ofclassification from virus genomes.

 

Quick Read (beta)

Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

Amine M. Remita Department of Computer Science
Université du Québec à Montréal
[email protected]
   Abdoulaye Baniré Diallo Department of Computer Science
Université du Québec à Montréal
[email protected]
Abstract

Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.

Viral sequence classification, Statistical Linear models for classification, Generative and discriminative models.
publicationid: pubid: 978-1-7281-1867-3/19/$31.00 ©2019 IEEE \bstctlcite

IEEEexample:BSTcontrol

I Introduction

Nucleotide sequence classification aims to assign an unlabeled or a new sequence (complete or partial) to a group of known sequences based on their characteristics. Sequence classification is used in several biomedical domains and comparative genomic fields such as pathogen detection [1, 2], taxonomic assignation of metagenomics reads [3, 4], and epidemiological and evolutionary studies [5]. This task could be performed using an alignment-free approach which does not rely on building a multiple sequence alignment of known and unknown sequences. Such alignment-free approach could avoid some inconveniences since the sequence alignment can be time and resource consuming [6]. Alignment-free based approaches for sequence classification have shown promising results and performances compared to approaches based on alignment and phylogeny [7, 8, 6, 9, 10, 11].

In previous works tackling taxonomic classification of metagenomics and viromics sequences, various tools implement statistical learning methods. For instance, in metagenomics, RDP [3] and NBC [12] both implement a naive Bayes classifier. Also for taxonomic assignment PhyloPythiaS+ implements structured output Support Vector Machines (SVM) framework [13]. A logistic regression model with L1 regularization was used in VirFinder, a tool for identifying viral sequences from metagenomic data [2]. For virus genome typing, a variable-order Markov model was implemented in COMET to classify Human Immunodeficiency Viruses 1 and 2 (HIV-1 and HIV-2, respectively) and Hepatitis C viruses (HCV) [8]. SVM was used as a core model in CASTOR-KRFE method to extract a minimal set of features and classify viruses [10]. Other tools offer different types of classifiers depending on their best performances on a specific family of virus such as CASTOR [9] and KAMERIS [11]. In particular, some methods were developed and implemented in tools that perform genotyping and subtyping of HCV genomic sequences (for more details see Table A.I, available in the online supplemental material). Most of statistical learning classifiers previously mentioned are linear and can be classified into two categories: 1) Generative classifiers, which model the distribution of the input and output data (sequences and their taxonomic classes); 2) discriminative classifiers, which either model the posterior distribution of output data (taxonomic classes given a sequence) or find a discriminant function to map directly a sequence to its class [14, 15].

In the context of alignment-free methods, composition representations of nucleotide sequences are widely used specially in sequence comparison and classification. They are based on the counts or frequencies of overlapping sub-sequences, with length k, for a given sequence [6]. These sub-sequences are known as words or k-mers in the literature of sequence classification.

Here we assess the performance of generative and discriminative linear classifiers in genotyping and subtyping of partial and complete genomes of HCV. We examine the potential of these classifiers in classifying partial genomic sequences trained with complete genomes. Classification of genomic fragments is a challenge posed by current technologies of DNA sequencing that generate massive amounts of nucleotide fragments. We show that a global profile of k-mer counts from complete genomes is sufficient to estimate the parameters of the models and classify correctly genomic fragments, and therefore no need to an explicit sampling of fragments in training setp. Several variables are considered in our assessment such as classifier types and their hyper-parameters, classification task, tested sequence lengths and k-mer lengths.

This article is divided in three main section. We start by an overview of linear models that can be used in virus sequence classification in Section II. Then we introduce the datasets and the benchmark procedure in Sections III and IV respectively. Section V highlights the overall performance of each model and discusses the choice in choosing adequate experimental settings.

II Linear models for sequence classification

II-A Sequence representation

A nucleotide fragment S is a sequence of l* ordered nucleotides S=(s1,s2,,sl)𝒜l, where 𝒜={A,C,G,T} is the alphabet of nucleotides. A k-mer word ua is a subsequence of length k in S at position a such ua=S[a,a+k-1] for a=1,2,,d and d=l-k+1. Also, S could be represented as a vector count x of m k-mer words 𝐱=(x1,x2,,xm), where m=|𝒜|k and xi is the number of occurrences of k-mer word ui in a sequence S, given by

xi=j=1d𝕀(ui,S[j,j+k-1]) (1)

and 𝕀(ui,v)=1 if ui=v, 0 otherwise. Thus, a dataset D of n nucleotide sequences will be represented by a m×n matrix 𝐗=(𝐱1,𝐱2,,𝐱n).
Finally, given a taxonomic rank (e.g. species or genotype), nucleotide sequences are labeled by a set of taxonomic disjoint classes 𝒯={T1,T2,,Tt},t*.

TABLE I: Hepatitis C virus datasets from [10].
Datasets Group Taxum Avg Seq Len Classification No. of Instances [min-max] No. of Classes
HCVGENCG \@slowromancap[email protected] ((+)ssRNA) Hepatitis C virus 9538 Genotypes 284 [17-80] 6
HCVSUBCG \@slowromancap[email protected] ((+)ssRNA) Hepatitis C virus 9538 Subtypes 284 [4-25] 18

II-B Linear classifiers

In multiclass classification problems, a linear classifier generates linear decision boundaries to separate instances of different classes [15]. is defined by a m×t matrix of weights 𝐖=(𝐰1,𝐰2,,𝐰t) and an activation function f() which could be nonlinear. It has the form f(𝐰T𝐱+w0) [14]. w0 represents the intercept of the model and 𝐰t=(wt1,wt2,,wtm) the vector of weights for class Tt. Learning classifiers differs in how they calculate the weights W in order to optimize the classification and how to define the function f(). To determine the class of a new sequence represented by a vector x, a classifier either models the posterior class probabilities P(Tt|𝐱) and assigns the vector x to a taxonomic class T^ that maximizes the posterior density as

T^=argmaxtP(Tt|𝐱) (2)

or finds a discriminant function to map directly the vector x onto a class label [14].

Moreover, modeling the posterior class probabilities could be done using generative or discriminative approaches.

II-B1 Generative classifiers

A generative approach models a joint probability density of the class and the sequence P(Tt,𝐱) and uses Bayes’ theorem (equation 3) to compute the posterior.

P(Tt|𝐱)=P(Tt,𝐱)P(𝐱)=P(𝐱|Tt)P(Tt)P(𝐱) (3)

With classifiers based on this approach, we can sample data from the modeled joint density P(Tt,𝐱) in the space of vectors x. Substituting equation 3 in equation 2 gives us:

T^=argmaxtP(𝐱|Tt)P(Tt) (4)

The probability density P(𝐱) is constant over all classes and could be dropped in estimation of equation 4. The prior density of taxonomic classes P(Tt) could be estimated using either a Maximum Likelihood Estimation (MLE) method or a Bayesian approach. Several probabilistic classifiers adopt different approaches to model the class-conditional density P(𝐱|Tt). Here we provide details of two types of generative classifiers: multinomial Bayes and Markov chain classifiers.

Multinomial Bayes (MB) classifiers model the class-conditional density by a multinomial distribution given by

P(𝐱|Tt)=d!i=1mxi!i=1mP(ui|Tt)xi. (5)

Class-conditional probabilities of k-mers P(ui|Tt) are the MB classifier parameters. Each parameter could be estimated by a MLE method or by a Bayesian approach (by adding a smoothing value α).

Markov chain (Markov) classifiers model each nucleotide sequence as a (k-1)-order Markov chain model [16].

P(S|Tt) =i=klP(si|S[i-k-1,i-1],Tt) (6)
=i=klP(ui-k-1|Tt)P(vi-k-1|Tt),

where ui-k-1=S[i-k-1,i] is a k-mer and vi-k-1=S[i-k-1,i-1] is a (k-1)-mer at position i-k-1. Therefore, the sequence S will be represented by two vectors x and z corresponding to the profiles of words w and v respectively. The class-conditional density could be approximated by:

P(S|Tt)=P(𝐱,𝐳|Tt)i=1mP(ui|Tt)xiP(vi|Tt)zi. (7)

Similarly to a MB classifier, class-conditional probabilities of words ui and vi of Markov chain classifier could be determined by MLE or Bayesian methods.

II-B2 Discriminative classifiers

A discriminative approach models directly the posterior density P(Tt|𝐱) without assuming any distribution P(𝐱|Tt) for the input data. Hence, a binary logistic regression (LR) models the posterior density using the logistic function which has the form:

P(𝒯=t0|𝐱) =exp(𝐰T𝐱+w0)1+exp(𝐰T𝐱+w0) (8)
and
P(𝒯=t1|𝐱) =11+exp(𝐰T𝐱+w0).

Regularized LR fits the vector of parameters w by joint minimization of the loss function and the regularization penalty function R as follows:

min𝐰C𝐱,tlog(1+exp(-t(𝐰T𝐱+w0)))+λR(𝐰) (9)

where t{-1,1}, C is a cost parameter and λ is the regularization rate. R could take several forms such as L1 norm (||𝐰||1=i|wi|) or squared L2 norm (||𝐰||22=iwi2).

Contrary to LR, linear support vector machine (LSVM) yields a discriminative function that maps input sequences to taxonomic classes. For soft-margin LSVM classifier, the optimal parameters w are obtained by joint minimization of the hinge loss function or its squared and the penalty function as follows:

min𝐰C𝐱,tmax(0,1-t(𝐰T𝐱+w0))+λR(𝐰). (10)

LR and LSVM described here are binary classifiers. For multiclass taxonomic assignment we could use a one-versus-rest strategy where a classifier is learned per a taxonomic class against the other classes.

III Benchmark datasets: Hepatitis C viruses

Hepatitis C viruses (HCV) are an important cause of chronic liver disease and cancer [17]. The World Health Organization has estimated that 71 million people have chronic hepatitis C infection and approximately 399 000 people die each year from hepatitis C [18]. HCV have a positive-sense single-stranded RNA genome of about 9600 nucleotides. They belong to the Flaviviridae family of viruses. HCV are classified into six confirmed genotypes with at least 30% divergence among their genomes. At lower taxonomic level, genotypes are divided into several subtypes with 20% divergence [19]. In this paper, we used two datasets of HCV that were constructed in our previous work [10]. Each dataset contains 284 complete genomes labeled by 6 genotypes in HCVGENCG dataset and 18 subtypes in HCVSUBCG dataset (Table I).

IV Experimental setting

We investigated the behavior and performance of generative and discriminative linear classifiers in gentoyping and subtyping HCV genomic sequences. Two cross-validation based strategies were devised to assess the abilities of both classifier types to classify whole-length (complete) and partial (fragment) genomes (described in Algorithm 1 and Algorithm 2, respectively). For both strategies the classifiers were trained with complete genomes. Five-fold cross-validation is performed in all classification tasks. In each iteration the weighted F-measure was calculated on the test sequence data. Then, the overall performance of all iterations was averaged. The F-measure is given by the equation:

F-measure=2×Recall×PrecisionRecall+Precision (11)
\SetKwInOutInputInput \SetKwInOutOutputOutput \SetKwBlockBeginBeginEnd \DontPrintSemicolon\InputD: Complete genome sequences
𝒯: Respective taxonomic classes of D
CLF: Classifier with hyper-parameters
\OutputFlist list of F-measure scores \Begin \ForEachk [kminkmax] 𝐗build_kmer_matrix(D,k)
Stratified split 𝐗D and 𝒯 into n folds
\ForEach fold i [1n] \tcpBuild 𝐗train and 𝐗test from complete genomes 𝐗train𝐗-𝐗[i] ; 𝒯train𝒯-𝒯[i]
𝐗test𝐗[i] ; 𝒯test𝒯[i]
\tcpLearn and test model learn_model(CLF,𝐗train,𝒯train)
𝒯predtest_model(,𝐗test)
Fk+=compute_f_measure(𝒯test,𝒯pred)/n append(Flist,Fk)
\algorithmcfname 1 Evaluation with complete genomes
\SetKwInOutInputInput \SetKwInOutOutputOutput \SetKwBlockBeginBeginEnd \DontPrintSemicolon\InputD: Complete genome sequences
𝒯: Respective taxonomic classes of D
CLF: Classifier with hyper-parameters
ft_size: size of genomic fragments \OutputFlist list of F-measure scores \Begin \ForEachk [kminkmax] Stratified and shuffled split D and 𝒯 into n folds
\ForEach fold i [1n] \tcpBuild 𝐗train from complete genomes DtrainD-D[i] ; 𝒯train𝒯-𝒯[i]
𝐗trainbuild_kmer_matrix(Dtrain,k)
\tcpBuild 𝐗test from fragments DtestD[i] ; 𝒯test𝒯[i]
Gtest,𝒯Gtestfragment_genomes(Dtest,𝒯test,ft_size) 𝐗testbuild_kmer_matrix(Gtest,k)
\tcpLearn and test model learn_model(CLF,𝐗train,𝒯train)
𝒯predtest_model(,𝐗test)
Fk+=compute_f_measure(𝒯Gtest,𝒯pred)/n append(Flist,Fk)
\algorithmcfname 2 Evaluation with genomic fragments

V Results and discussion

Here, we present the experimental results when classifiers are trained on complete genomes but tested either with complete geneomes or genomic fragments. Then we highlight the overall observations that could drive subsequent studies.

V-A Evaluation with complete genomes

The classifier performances were assessed in genotyping and subtyping of HCV complete genomes. The evaluation was based on a cross-validation strategy as described in Algorithm 1. In subtyping data (HCVSUBCG) the class 6f is underrepresented since it contains only four sequences. To ensure a large coverage of subtypes, we opted to not discard this class and use a 4-fold cross-validation. At least, this class will be represented by one sequence for each training fold. The results are presented in function of k-mer lengths (k) from 4 to 15 nucleotides.

Fig. 1: Averaged weighted F-measures of generative models tested on complete genomes. Filled regions correspond to the mean ± standard deviation of weighted F-measures of cross-validation iterations.
TABLE II: Best and worst averaged weighted F-measures of linear models tested on complete genomes and their corresponding k lengths.
Genotyping Subtyping
Best Worst Best Worst
Classifier Model F-measure k lengths F-measure k lengths F-measure k lengths F-measure k lengths
Multinomial
Bayes
MLE 0.837 ± 0.019 6 0.448 ± 0.073 4 0.654 ± 0.039 5 0.364 ± 0.061 9-15
alpha=1e-100 0.997 ± 0.007 9-15 0.448 ± 0.073 4 0.975 ± 0.019 11-15 0.545 ± 0.067 4
alpha=1e-10 0.997 ± 0.007 9-15 0.448 ± 0.073 4 0.975 ± 0.019 9-15 0.545 ± 0.067 4
alpha=1e-5 0.997 ± 0.007 9-15 0.448 ± 0.073 4 0.977 ± 0.017 9 0.545 ± 0.067 4
alpha=1e-2 0.997 ± 0.007 10-13 0.448 ± 0.073 4 0.977 ± 0.017 9 0.545 ± 0.067 4
alpha=1 0.997 ± 0.007 11-15 0.447 ± 0.073 4 0.977 ± 0.017 10-13 0.545 ± 0.067 4
Markov MLE 0.900 ± 0.045 5 0.002 ± 0.003 7 0.849 ± 0.036 5 0.000 ± 0.001 6
alpha=1e-100 0.997 ± 0.007 8-9 0.509 ± 0.060 14 0.975 ± 0.019 8 0.432 ± 0.051 15
alpha=1e-10 0.997 ± 0.007 8-9 0.499 ± 0.054 15 0.975 ± 0.019 8 0.372 ± 0.034 15
alpha=1e-5 0.997 ± 0.007 8 0.487 ± 0.038 15 0.975 ± 0.019 8 0.275 ± 0.023 15
alpha=1e-2 0.997 ± 0.007 8 0.303 ± 0.022 15 0.975 ± 0.019 8 0.124 ± 0.017 15
alpha=1 0.959 ± 0.018 7 0.017 ± 0.017 13 0.939 ± 0.007 6 0.007 ± 0.011 14
Logistic
Regression
LR_L1 0.997 ± 0.007 4-11 0.996 ± 0.007 12-15 0.975 ± 0.019 4 0.941 ± 0.006 15
LR_L2 0.997 ± 0.007 4-15 0.997 ± 0.007 4-15 0.975 ± 0.019 4-5 0.967 ± 0.009 13-15
Linear
SVM
LSVM_L1 1.000 ± 0.000 9-10 0.989 ± 0.015 13-14 0.975 ± 0.019 4 0.950 ± 0.018 12
LSVM_L2 0.997 ± 0.007 4-15 0.997 ± 0.007 4-15 0.975 ± 0.019 4-5 0.971 ± 0.018 13

V-A1 Generative models

In this study, we assessed the performance of two types of generative classifiers: multinomial Bayes (MB) and Markov chain (Markov) classifiers. In order to infer the parameters of these classifiers (class-conditional densities) we used Maximum Likelihood Estimation (MLE) and Bayesian approaches. For the Bayesian approach to estimating MB and Markov parameters we used different α values for smoothing: (α=1e-100, 1e-10, 1e-5, 1e-2 or 1). The evaluation results of generative models on complete genomes are shown in Figure 1 and Table II.

All MB models have almost the same performance in terms of weighted F-measure when k{4,5} (in genotyping from 0.447±0.072 to 0.789±0.068 and in subtyping from 0.545±0.06 to 0.676±0.032, respectively). However, starting from k=6, the performance of MB models based on MLE (MLE-MB) was lower than those of models based on a Bayesian approach (B-MB). The weighted F-measure of MLE-MB model drops from its maximum value 0.837±0.019 with k=6 to 0.465±0.035 with k[11,15] in genotyping and from 0.654±0.039 with k=5 to 0.364±0.061 with k[9,15] in subtyping. We noted an improvement in the performance of B-MB models when increasing k-mer lengths either for genotyping or subtyping. Furthermore, with k[11,13] in genotyping and with k{9,10} in subtyping all B-MB models reach a maximum weighted F-measure of 0.997.

Markov chain models based on MLE (MLE-Markov) show their best performance at k=5 with weighted F-measures of 0.900±0.045 and 0.849±0.036, in genotyping and subtyping respectively. At k=7 in genotyping and at k=6 in subtyping their performance drops drastically to reach an F-measure of 0.002±0.003 and 0.000±0.001, respectively and remain very low for longer k-mers. Unlike B-MB models, Markov models based on a Bayesian approach (B-Markov) do not maintain their performance high after reaching the maximum weighted F-measure values. The maximum weighted F-measures occur at k=8 and equal to 0.997±0.007 in genotyping and 0.975±0.019 in subtyping for the majority of Markov models with smoothing. Moreover, in overall classification experiments with B-Markov models, smoothing with α=1 shows the best performance with k[4,6] but lower performance than the other Bayesian models when k>7. However, α=1e-100 smoothing shows the best performance with all lengths of k except with k[4,7].

V-A2 Discriminative models

Discriminative models were represented by two classifiers in our work: logistic regression (LR) and linear support vector machine (LSVM). For both classifiers, we evaluated L1 and L2 penalties for regularization. The squared hinge loss function was used for LSVM classifier. Performance results on complete genomes are shown in Figures 2 and A.1 for each model. In genotype taxonomic classification, LR and LSVM models classify the data with near perfect weighted F-measures (>0.989±0.015) across all k-mer lengths. The best performance is shown by LSVM using L1-based regularization with k[9,10] (weighted F-measure =1.000±0.000). In subtype classification, LR and LSVM model performances decrease slightly although the weighted F-measures remain greater than 0.941 for all experiments (see Table II). The maximum weighted F-measure value of 0.975±0.019 is reached by all discriminative models in subtyping. In general, L2-based models perform better than L1-based models which is clearly seen specially when k>8.

The evaluation with complete genomes shows that the overall performance varies substantially according to classifier types, their hyper-parameters and k-mer lengths. We observed different trend patterns of classification performances between generative and discriminative models, but same trends when comparing genotyping and subtyping tasks. Most models could achieve a high weighted F-measure value of 0.997 in genotyping and 0.939 in subtyping except for MLE-based models. However, k-mer lengths differ for these weighted F-measure values depending on each model instance. Using a SVM classifier, [10] reported quite similar results when classifying the same HCV genomic datasets (weighted F-measure =1.000 and 0.986 in genotyping and subtyping, respectively [10]).

V-B Evaluation with genomic fragments

Fig. 2: Averaged weighted F-measures of generative and discriminative models tested on different fragment lengths at subtyping (HCVSUBCG dataset). Filled regions correspond to the mean ± standard deviation of weighted F-measures of cross-validation iterations.

All proposed classifiers were trained with HCV complete genomes and tested with genomic fragments belonging to the same taxonomic classes as the genomes. We used a cross-validation strategy with 5-fold stratified and shuffled splits on the complete genome data. As described in Algorithm 2, in each iteration, a model is learned on a train set composed of complete genomes. After, it will be tested on fragmented sequences from the complete genomes of the test set. Each class was sampled up to 1000 fragments. The classifiers were evaluated separately with fragments of lengths 100 bp, 250 bp, 500 bp and 1000 bp. The results of this evaluation are shown in Figure 2, and in Figure A.1 and Tables A.II-V available in the online supplemental material.

V-B1 Generative models

With all fragment lengths, all MB models (including MLE-MB) have almost the same performance with k lengths of 4 and 5 at genotyping and subtyping classification tasks. For these k-mer lengths, the weighted F-measure ranges from 0.299±0.021 to 0.857±0.019 for genotyping and from 0.079±0.006 to 0.829±0.017 for subtyping (see Tables A.II-V). At k=7, the performance of MB models starts to diverge where we can note lower F-measure value for MLE-MB and B-MB with α=1e-100 than the other models. After reaching its maximum, MLE-MB performance starts to drop at k=7 for all fragment lengths in genotyping. However, the performance of B-MB models (including the model with α=1e-100) increases after k=8 to exceed a weighted F-measure of 0.980 at some values of k for all fragment lengths and classification tasks.

Concerning Markov chain models, the performance of MLE-Markov attains its maximum at k=6 for all fragment lengths and both classification tasks except for testing on 500 bp and 1000 bp fragments at subtyping where the maximums were at k=5. The maximum values range from 0.769±0.023 to 0.883±0.016 at genotyping and from 0.752±0.040 to 0.826±0.016 at subtyping (see Tables A.II-V). After that, MLE-Markov performance on fragments falls and remains very low. Comparing B-Markov models at genotyping the α=1e-100-based model has consistently the maximal weighted F-measure values for all length fragments and with k=9 (values up to 0.988±0.016). Furthermore, in fragment subtyping, the performance of B-Markov models is highest at k[6,8] depending on the smoothing value. The model with α=1e-5 reaches the best weighted F-measure at k=8 with all fragment lengths for this classification task (values up to 0.987±0.013). Subsequently, for both classification tasks, the performance drops gradually when k>8 to remain low with weighted F-measure <0.472±0.068.

V-B2 Discriminative models

LR and LSVM models show similar behaviors when testing with different fragment lengths. As shown in Figures 2 and A.1 and in Tables A.II-V, models with regularization penalty L2 perform better than those with penalty L1 independently of the classification variables (k-mer lengths, fragment lengths and typing tasks). For LR and LSVM L2-based models, the weighted F-measure exceeds a value of 0.900 when k-mers lengths are equal to or greater than 8, 7, 6 and 5 for 100 bp, 250 bp, 500 bp, 1000 bp fragments respectively. Furthermore, in both classification, the worst performance of L2-based models occurs when k equals 4 and fragment lengths is 100 bp (weighted F-measures <0.332±0.011). However, L2-based models keep a performance higher when k is larger. Models with L1 regularization penalty have lower performance and their weighted F-measure does not exceed 0.767±0.026. In fragment genotyping LR models with L1 penalty have highest weighted F-measure between 0.376±0.016 and 0.767±0.026 with k=7 for 100 bp fragments and k=8 for other fragment lengths. Moreover, in subtyping, they highlight a maximum weighted F-measure between 0.206±0.006 and 0.662±0.017 with k=8 for 100 and 500 bp, k=9 for 250 bp and k=6 for 1000 bp fragments. LSVM models regularized by L1 penalty have in general lower results compared to L1-based LR models. In fragment genotyping, maximum weighted F-measure for L1-based LSVM is between 0.323±0.007 and 0.730±0.019 with k=7 for 100 bp and 250 bp fragments, k=8 for 500 bp fragments and k=6 for 1000 bp fragments. In fragment subtyping, maximum performance is obtained with k=8 for all fragment lengths, except 1000 bp fragments(k=6), and have weighted F-measure between 0.187±0.009 and 0.642±0.021. Unlike L2-based models, the capability of L1-based models in both classification declines when k-mer lengths are longer.

V-C Overall remarks

We evaluated a set of linear classifiers in genotyping and subtyping of HCV genomic sequence represented by k-mer profiles. This study evaluated various variables related to the level of taxonomic classification tasks (genotyping or subtyping), classifier types (generative or discriminative) and their hyper-parameters, genomic sequence lengths (complete or partial) and k-mer lengths (from 4 to 15).

The results of this evaluation show that classification at low-level taxonomic clades is more difficult than at higher levels. In general, the classifiers performed better at genotyping than at subtyping HCV sequences. Several studies for viral [9, 10] and metagenomic [3, 20, 21] taxonomic classification have reported similar results where the performance is better at high-level classifications. Genomic sequences are more similar at low-level than at high-level clades, which makes more difficult to discriminate between sequences at low-level clades.

In both taxonomic levels, at least one or more generative and discriminative models reached a weighted F-measure >0.950 depending on their hyper-parameters and lengths of k. Hence, we did not notice a clear advantage between both types of classifiers in term of weighted F-measure. However within each type, one model setting gave a clear advantage. For instance B-MB with α=1e-100 and LSVM with L2 penalty are the best choice in generative and discriminative models respectively as they were stable among all experimental classifications.
As observed in previous studies [20, 21, 22], generative classifiers (MB and Markov) are sensitive to how they infer their parameters (class-conditional densities), either by MLE or Bayesian approaches. MLE approach could overfit and produce a sparse parameter matrix W when unseen k-mers in training step will have null estimates [20, 23] or very small probabilities will underflow the numerical precision [14]. Bayesian-based models could avoid null estimates and numerical underflow by adding positive and non-zero values α (pseudo-counts) [24]. Hence, the MLE approach leads to inferior results compared to the Bayesian approach except for short k-mers (4 and 5), where they have similar performance and it is expected that most k-mers are seen in the dataset. Moreover, when k{6,7}, Bayesian models (MB and Markov) with larger α perform better than with smaller α. However, this trend reverses when k-mers are longer than 7 nucleotides. [21] reported comparable results in classifying microbial 16S and fungal 28S rRNA sequences with MB classifiers. At k=8, models with α>0.1 have lower accuracies than models with α<0.0001 in classifying full-length rRNA sequences [21].
After all, the B-Markov model performance decreases when k>10 despite the parameter smoothing. Larger k lengths produce more sparse k-mer profiles and when estimating Markov model parameters the division of k-mer by (k-1)-mer probability densities in equation 7 gives small values that could underflow the numerical precision.

Discriminative classifiers (LR and LSVM) classified HCV genotypes and subtypes almost perfectly with complete genomes represented by any length of k-mers. LR and LSVM models have similar behavior when they implement the same regularization penalty across all classification experiments. This suggests that their loss functions (logistic loss for LR and squared hing loss for LSVM) converge towards comparable results. When evaluating different classifiers for HIV-1 genome subtyping, [11] concluded that SVM-based classifiers, and logistic regression achieved the highest performances. The authors reported accuracies of 96.49% and 95.32% for LSVM and LR respectively at k=6 [11].
In our study, the form of regularization did not influence the performance of neither classifiers, whereas it played a crucial role in classifying genomic fragments when model parameters are learned with global profiles of k-mers from complete genomes. Regularization with L1 penalty decreases substantially the performance of classifiers on partial genomes unlike with L2 penalty. On one hand, linear classifiers with L1 regularization produce sparse parameter matrix W but dense matrix with L2 regularization [23]. On the other hand, genomic fragments generate also sparse vectors X since they are partial. Hence, with L1-based models, a disagreement between W values and those of X could easily happen if W was not learned from the same distribution of X as in our evaluation with fragments.

Evaluated classifiers have better performance when tested with genomic sequences homologous to the sequences used in the learning step. Although surprising, some generative and discriminative models trained with complete genomes perform very well when they classify partial sequences. We observed that longer fragments are classified better than shorter ones. This was also observed in previous studies in virus genomic [8] and metagenomic [12, 21, 25] taxonomic classifications. Longer fragments generate less sparse data vectors and provide more information for classification [21]. Moreover, maximum performance of the evaluated models needs longer k-mers when fragments are shorter, as with MB classifier in [12] and LR in [2].

Within the HCV datasets, the choice of the k-mer length depends on all classification variables including taxonomic classification task, classifier types, hyper-parameters and sequence lengths. Mostly k-mers with lengths between 8 and 10 are good options for classification tasks since they maximize the chance of achieving an optimal performance. However this observation does not hold for MLE-based models where the best option will be k[5,6]. Previous works in virus typing and identification reported optimal k-mer lengths between 6 and 9 [8, 2, 11, 10]. Moreover, studies in metagenomic taxonomic assignment showed that MB classifiers need k[12,15] to achieve good performance [12, 20, 25].

VI Conclusion

In this paper we provide an exhaustive procedure to assess the classification performance of different discriminative and generative classifiers with complete and partial genomes. We apply it to a benchmark of HCV viruses for genotyping and subtyping using several k-mer lengths. The results highlight that there is no leading classifiers to perform on different experimental settings. Thus, the exploration of adequate experimental settings is required to capture the best performance. For HCV genomic data, this experimental settings allow to capture the highest predictive performance and to compare to the state-of-the art tools. Furthermore, most models perform well in either fragment and complete genome predictions. However, the hyper-parameters for estimating the model parameters and the k-mer lengths vary in order to approximate the optimal performance. This study will be generalized to other viruses and the framework will be released to allow reproducible and accurate experimental settings for virus classification.

Acknowledgment

We would like to thank Dylan Lebatteux, Golrokh Vitae and Hayda Almeida for helpful discussion.
This research was enabled in part by support provided by Calcul Québec and Compute Canada. It has also been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de recherche du Québec - Nature et technologies (FRQNT), Génome Québec and Genome Canada for the grants to ABD. AMR is NSERC and FRQNT fellow.

References

  • [1] S. Flygare et al., “Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling,” Genome Biology, vol. 17, no. 1, p. 111, 2016.
  • [2] J. Ren et al., “VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data,” Microbiome, vol. 5, no. 1, p. 69, 2017.
  • [3] Q. Wang et al., “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy,” Applied and Environmental Microbiology, vol. 73, no. 16, pp. 5261–5267, 2007.
  • [4] K. R. Patil et al., “The phylopythias web server for taxonomic assignment of metagenome sequences,” PLoS ONE, vol. 7, no. 6, p. e38581, 2012.
  • [5] A. van Belkum et al., “Role of Genomic Typing in Taxonomy, Evolutionary Genetics, and Microbial Epidemiology,” Clinical Microbiology Reviews, vol. 14, no. 3, pp. 547–560, 2001.
  • [6] A. Zielezinski et al., “Alignment-free sequence comparison: benefits, applications, and tools,” Genome biology, vol. 18, no. 1, p. 186, 2017.
  • [7] A. L. Bazinet and M. P. Cummings, “A comparative evaluation of sequence classification programs,” BMC Bioinformatics, vol. 13, no. 1, p. 92, 2012.
  • [8] D. Struck et al., “Comet: adaptive context-based modeling for ultrafast hiv-1 subtype identification,” Nucleic acids research, vol. 42, no. 18, p. e144, 2014.
  • [9] M. A. Remita et al., “A machine learning approach for viral genome classification,” BMC bioinformatics, vol. 18, no. 1, p. 208, 2017.
  • [10] D. Lebatteux et al., “Toward an alignment-free method for feature extraction and accurate classification of viral sequences,” Journal of Computational Biology, vol. 26, no. 6, pp. 519–535, 2019.
  • [11] S. Solis-Reyes et al., “An open-source k-mer based machine learning tool for fast and accurate subtyping of hiv-1 genomes,” PLOS ONE, vol. 13, no. 11, pp. 1–21, 2018.
  • [12] G. Rosen et al., “Metagenome Fragment Classification Using N-Mer Frequency Profiles,” Advances in Bioinformatics, vol. 2008, pp. 1–12, 2008.
  • [13] I. Gregor et al., “PhyloPythiaS+ : a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes,” PeerJ, vol. 4, p. e1603, 2016.
  • [14] C. M. Bishop, Pattern recognition and machine learning.   Springer, New York, 2006.
  • [15] T. Hastie et al., The elements of statistical learning: data mining, inference, and prediction, 2nd ed.   Springer, New York, 2009.
  • [16] R. Durbin et al., Biological sequence analysis: probabilistic models of proteins and nucleic acids.   Cambridge University Press, 1998.
  • [17] C. Giannini and C. Brechot, “Hepatitis c virus biology,” Cell death and differentiation, vol. 10, no. S1, p. S27, 2003.
  • [18] “World health organization - hepatitis c,” https://www.who.int/en/news-room/fact-sheets/detail/hepatitis-c, accessed: 2019-06-03.
  • [19] P. Simmonds et al., “Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes,” Hepatology, vol. 42, no. 4, pp. 962–973, 2005.
  • [20] G. L. Rosen and S. D. Essinger, “Comparison of Statistical Methods to Classify Environmental Genomic Fragments,” IEEE Transactions on NanoBioscience, vol. 9, no. 4, pp. 310–316, 2010.
  • [21] K.-L. Liu and T.-T. Wong, “Naïve bayesian classifiers with multinomial models for rrna taxonomic assignment,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, no. 5, pp. 1–1, 2013.
  • [22] H. Vinje et al., “Comparing K-mer based methods for improved classification of 16S sequences,” BMC Bioinformatics, vol. 16, no. 1, p. 205, 2015.
  • [23] K. P. Murphy, Machine Learning: A Probabilistic Perspective.   The MIT Press, 2012.
  • [24] T.-T. Wong, “Alternative prior assumptions for improving the performance of naïve bayesian classifiers,” Data Mining and Knowledge Discovery, vol. 18, no. 2, pp. 183–213, 2009.
  • [25] N. Matsushita et al., “Metagenome fragment classification based on multiple motif-occurrence profiles,” PeerJ, vol. 2, p. e559, 2014.
  • [26] C. Kuiken et al., “The Los Alamos hepatitis C sequence database,” Bioinformatics, vol. 21, no. 3, pp. 379–384, 2004.
  • [27] T. de Oliveira et al., “An automated genotyping system for analysis of HIV-1 and other microbial sequences,” Bioinformatics, vol. 21, no. 19, pp. 3797–3800, 2005.
  • [28] J. Kim et al., “A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis,” BMC bioinformatics, vol. 11, no. 1, p. 434, 2010.
  • [29] P. Qiu et al., “HCV genotyping using statistical classification approach.” Journal of biomedical science, vol. 16, 2009.
  • [30] A. Fabijanska and S. Grabowski, “Viral Genome Deep Classifier,” IEEE Access, vol. 7, pp. 81 297–81 307, 2019.
TABLE A.I: Software and methods for HCV typing
Software Method Availability Reference
Alignment-based LANL HCVBlast Blast against a database of HCV sequences
Web interface
https://hcv.lanl.gov
[26]
HCV Typing Tool Construct a phylogenetic tree
Web interface
https://www.genomedetective.com/app/typingtool/hcv
[27]
MuLDAS
Construct distance matrix
+ Multidimensional scaling
+ Linear discriminant analysis (LDA)
Web interface
http://gsa.muldas.org/index.cgi
[28]
Qiu et al. (2009)
Construct Position Weight Matrix
+ SVM or random forest
Not available [29]
Alignment-free COMET
K-mer profiles
+ Variable-order Markov model
Web interface
https://comet.lih.lu/index.php?cat=hcv
[8]
CASTOR
RFLP-based features
+ Feature selection
+ Machine learning classifiers
Web interface
http://castor.bioinfo.uqam.ca
[9]
KAMERIS
K-mer profiles
+ Machine learning classifiers
Open source code (MIT license)
https://github.com/stephensolis/kameris.git
[11]
CASTOR-KRFE
K-mer profiles
+ SVM-RFE feature selection
+ SVM classifier
Open source code (MIT license)
https://github.com/bioinfoUQAM/CASTOR_KRFE.git
[10]
VGDC Deep convolutional neural network
Free source code
https://github.com/afabijanska/VGDC.git
[30]

RFLP stands for restriction fragment length polymorphism

TABLE A.II: Averaged weighted F-measures of linear models tested on fragments of length 100 bp and their corresponding k lengths.
Genotyping Subtyping
Best Worst Best Worst
Classifier Model F-measure k lengths F-measure k lengths F-measure k lengths F-measure k lengths
Multinomial
Bayes
MLE 0.749 ± 0.027 7 0.264 ± 0.007 4 0.728 ± 0.035 7 0.128 ± 0.004 4
alpha=1e-100 0.982 ± 0.011 14 0.264 ± 0.007 4 0.952 ± 0.024 13 0.128 ± 0.004 4
alpha=1e-10 0.982 ± 0.011 14 0.264 ± 0.007 4 0.956 ± 0.022 11 0.128 ± 0.004 4
alpha=1e-5 0.981 ± 0.011 12 0.264 ± 0.007 4 0.957 ± 0.020 11 0.128 ± 0.004 4
alpha=1e-2 0.975 ± 0.010 12 0.264 ± 0.007 4 0.955 ± 0.019 11 0.128 ± 0.004 4
alpha=1 0.963 ± 0.008 12 0.264 ± 0.007 4 0.940 ± 0.021 13 0.128 ± 0.004 4
Markov MLE 0.769 ± 0.023 6 0.016 ± 0.005 9 0.752 ± 0.040 6 0.001 ± 0.001 8
alpha=1e-100 0.967 ± 0.014 9 0.074 ± 0.008 15 0.937 ± 0.029 8 0.030 ± 0.002 15
alpha=1e-10 0.965 ± 0.014 9 0.073 ± 0.007 15 0.941 ± 0.026 8 0.030 ± 0.002 15
alpha=1e-5 0.963 ± 0.017 9 0.073 ± 0.007 15 0.944 ± 0.024 8 0.030 ± 0.002 15
alpha=1e-2 0.965 ± 0.016 8 0.066 ± 0.007 15 0.941 ± 0.024 8 0.029 ± 0.003 15
alpha=1 0.891 ± 0.018 7 0.042 ± 0.006 15 0.735 ± 0.036 7 0.013 ± 0.001 14
Logistic
Regression
LR_L1 0.376 ± 0.016 7 0.113 ± 0.009 15 0.206 ± 0.006 8 0.031 ± 0.006 15
LR_L2 0.960 ± 0.011 13 0.332 ± 0.011 4 0.936 ± 0.023 10 0.176 ± 0.006 4
Linear
SVM
LSVM_L1 0.323 ± 0.007 7 0.071 ± 0.006 15 0.187 ± 0.009 8 0.015 ± 0.002 13
LSVM_L2 0.963 ± 0.007 12 0.308 ± 0.007 4 0.941 ± 0.021 10 0.165 ± 0.006 4
TABLE A.III: Averaged weighted F-measures of linear models tested on fragments of length 250 bp and their corresponding k lengths.
Genotyping Subtyping
Best Worst Best Worst
Classifier Model F-measure k lengths F-measure k lengths F-measure k lengths F-measure k lengths
Multinomial
Bayes
MLE 0.804 ± 0.009 6 0.301 ± 0.019 4 0.715 ± 0.028 6 0.212 ± 0.012 4
alpha=1e-100 0.990 ± 0.010 13 0.301 ± 0.019 4 0.978 ± 0.014 14 0.212 ± 0.012 4
alpha=1e-10 0.989 ± 0.010 13 0.301 ± 0.019 4 0.980 ± 0.013 14 0.212 ± 0.012 4
alpha=1e-5 0.988 ± 0.012 13 0.301 ± 0.019 4 0.980 ± 0.014 13 0.212 ± 0.012 4
alpha=1e-2 0.985 ± 0.011 13 0.301 ± 0.019 4 0.977 ± 0.014 13 0.212 ± 0.012 4
alpha=1 0.974 ± 0.015 14 0.301 ± 0.019 4 0.965 ± 0.016 15 0.211 ± 0.011 4
Markov MLE 0.850 ± 0.012 6 0.010 ± 0.005 8 0.782 ± 0.028 6 0.002 ± 0.001 7
alpha=1e-100 0.985 ± 0.014 9 0.232 ± 0.030 15 0.968 ± 0.017 8 0.066 ± 0.005 15
alpha=1e-10 0.984 ± 0.013 9 0.230 ± 0.029 15 0.972 ± 0.016 8 0.067 ± 0.005 15
alpha=1e-5 0.983 ± 0.013 9 0.193 ± 0.023 15 0.973 ± 0.015 8 0.065 ± 0.005 15
alpha=1e-2 0.979 ± 0.015 8 0.127 ± 0.016 15 0.972 ± 0.016 8 0.053 ± 0.005 15
alpha=1 0.930 ± 0.014 7 0.057 ± 0.006 15 0.840 ± 0.023 6 0.021 ± 0.003 15
Logistic
Regression
LR_L1 0.489 ± 0.012 8 0.201 ± 0.009 15 0.344 ± 0.021 9 0.093 ± 0.003 14
LR_L2 0.973 ± 0.012 15 0.434 ± 0.016 4 0.964 ± 0.018 14 0.300 ± 0.002 4
Linear
SVM
LSVM_L1 0.441 ± 0.018 7 0.089 ± 0.010 15 0.315 ± 0.010 8 0.033 ± 0.006 13
LSVM_L2 0.972 ± 0.009 15 0.422 ± 0.023 4 0.967 ± 0.014 11 0.288 ± 0.005 4
TABLE A.IV: Averaged weighted F-measures of linear models tested on fragments of length 500 bp and their corresponding k lengths.
Genotyping Subtyping
Best Worst Best Worst
Classifier Model F-measure k lengths F-measure k lengths F-measure k lengths F-measure k lengths
Multinomial
Bayes
MLE 0.831 ± 0.011 6 0.358 ± 0.027 4 0.755 ± 0.022 6 0.340 ± 0.016 4
alpha=1e-100 0.991 ± 0.010 14 0.358 ± 0.027 4 0.985 ± 0.014 12 0.340 ± 0.016 4
alpha=1e-10 0.990 ± 0.011 14 0.358 ± 0.027 4 0.986 ± 0.013 12 0.340 ± 0.016 4
alpha=1e-5 0.989 ± 0.014 11 0.358 ± 0.027 4 0.986 ± 0.013 13 0.340 ± 0.016 4
alpha=1e-2 0.987 ± 0.014 11 0.358 ± 0.027 4 0.985 ± 0.012 14 0.340 ± 0.016 4
alpha=1 0.982 ± 0.012 14 0.358 ± 0.027 4 0.975 ± 0.013 15 0.340 ± 0.015 4
Markov MLE 0.867 ± 0.015 6 0.012 ± 0.007 8 0.778 ± 0.014 5 0.002 ± 0.001 7
alpha=1e-100 0.987 ± 0.014 9 0.378 ± 0.050 15 0.981 ± 0.013 8 0.105 ± 0.003 15
alpha=1e-10 0.985 ± 0.013 9 0.344 ± 0.047 15 0.984 ± 0.013 8 0.097 ± 0.003 15
alpha=1e-5 0.985 ± 0.013 9 0.283 ± 0.036 15 0.985 ± 0.013 8 0.084 ± 0.003 15
alpha=1e-2 0.982 ± 0.014 8 0.180 ± 0.021 15 0.983 ± 0.013 8 0.067 ± 0.003 15
alpha=1 0.930 ± 0.018 7 0.063 ± 0.008 15 0.897 ± 0.020 6 0.029 ± 0.006 15
Logistic
Regression
LR_L1 0.621 ± 0.009 8 0.296 ± 0.009 15 0.493 ± 0.009 8 0.184 ± 0.011 14
LR_L2 0.979 ± 0.011 8 0.558 ± 0.013 4 0.976 ± 0.014 13 0.480 ± 0.009 4
Linear
SVM
LSVM_L1 0.587 ± 0.003 8 0.160 ± 0.031 15 0.462 ± 0.006 8 0.057 ± 0.014 14
LSVM_L2 0.982 ± 0.011 13 0.540 ± 0.011 4 0.978 ± 0.013 10 0.460 ± 0.011 4
TABLE A.V: Averaged weighted F-measures of linear models tested on fragments of length 1000 bp and their corresponding k lengths.
Genotyping Subtyping
Best Worst Best Worst
Classifier Model F-measure k lengths F-measure k lengths F-measure k lengths F-measure k lengths
Multinomial
Bayes
MLE 0.848 ± 0.012 6 0.417 ± 0.029 4 0.727 ± 0.019 6 0.407 ± 0.029 4
alpha=1e-100 0.994 ± 0.009 15 0.417 ± 0.029 4 0.987 ± 0.013 14 0.407 ± 0.029 4
alpha=1e-10 0.993 ± 0.010 13 0.417 ± 0.029 4 0.987 ± 0.012 11 0.407 ± 0.029 4
alpha=1e-5 0.990 ± 0.015 14 0.417 ± 0.029 4 0.988 ± 0.012 11 0.407 ± 0.029 4
alpha=1e-2 0.989 ± 0.015 11 0.417 ± 0.029 4 0.987 ± 0.011 10 0.407 ± 0.029 4
alpha=1 0.986 ± 0.014 13 0.417 ± 0.030 4 0.984 ± 0.011 14 0.407 ± 0.028 4
Markov MLE 0.883 ± 0.016 6 0.036 ± 0.013 8 0.826 ± 0.016 5 0.002 ± 0.001 7
alpha=1e-100 0.988 ± 0.016 9 0.472 ± 0.068 15 0.983 ± 0.011 9 0.199 ± 0.015 15
alpha=1e-10 0.987 ± 0.016 9 0.409 ± 0.059 15 0.986 ± 0.013 8 0.139 ± 0.013 15
alpha=1e-5 0.987 ± 0.016 8 0.347 ± 0.051 15 0.987 ± 0.013 8 0.104 ± 0.011 15
alpha=1e-2 0.984 ± 0.015 8 0.209 ± 0.035 15 0.987 ± 0.012 8 0.057 ± 0.008 15
alpha=1 0.943 ± 0.014 7 0.044 ± 0.007 15 0.917 ± 0.022 6 0.014 ± 0.005 14
Logistic
Regression
LR_L1 0.767 ± 0.026 8 0.411 ± 0.016 15 0.662 ± 0.017 6 0.272 ± 0.021 15
LR_L2 0.989 ± 0.008 14 0.695 ± 0.014 4 0.979 ± 0.014 12 0.675 ± 0.014 4
Linear
SVM
LSVM_L1 0.730 ± 0.019 6 0.220 ± 0.027 15 0.642 ± 0.021 6 0.132 ± 0.013 13
LSVM_L2 0.990 ± 0.010 15 0.691 ± 0.011 4 0.985 ± 0.013 10 0.653 ± 0.015 4
Fig. A.1: Averaged weighted F-measures of generative and discriminative models tested on different fragment lengths at genotyping (HCVGENCG dataset). Filled regions correspond to the mean ± standard deviation of weighted F-measures of cross-validation iterations.