KiloGrams: Very Large N-Grams for Malware Classification

Abstract

N-grams have been a common tool for information retrieval and machinelearning applications for decades. In nearly all previous works, only a fewvalues of $n$ are tested, with $n > 6$ being exceedingly rare. Larger values of$n$ are not tested due to computational burden or the fear of overfitting. Inthis work, we present a method to find the top-$k$ most frequent $n$-grams thatis 60$\times$ faster for small $n$, and can tackle large $n\geq1024$. Despitethe unprecedented size of $n$ considered, we show how these features still havepredictive ability for malware classification tasks. More important, large$n$-grams provide benefits in producing features that are interpretable bymalware analysis, and can be used to create general purpose signaturescompatible with industry standard tools like Yara. Furthermore, the counts ofcommon $n$-grams in a file may be added as features to publicly availablehuman-engineered features that rival efficacy of professionally-developedfeatures when used to train gradient-boosted decision tree models on the EMBERdataset.

Quick Read (beta)

loading the full paper ...