Abstract
Recent growth and proliferation of malware have tested practitioners abilityto promptly classify new samples according to malware families. In contrast tolabor-intensive reverse engineering efforts, machine learning approaches havedemonstrated increased speed and accuracy. However, most existing deep-learningmalware family classifiers must be calibrated using a large number of samplesthat are painstakingly manually analyzed before training. Furthermore, as novelmalware samples arise that are beyond the scope of the training set, additionalreverse engineering effort must be employed to update the training set. Thesheer volume of new samples found in the wild creates substantial pressure onpractitioners ability to reverse engineer enough malware to adequately trainmodern classifiers. In this paper, we present MalMixer, a malware familyclassifier using semi-supervised learning that achieves high accuracy withsparse training data. We present a domain-knowledge-aware data augmentationtechnique for malware feature representations, enhancing few-shot performanceof semi-supervised malware family classification. We show that MalMixerachieves state-of-the-art performance in few-shot malware family classificationsettings. Our research confirms the feasibility and effectiveness oflightweight, domain-knowledge-aware data augmentation methods for malwarefeatures and shows the capabilities of similar semi-supervised classifiers inaddressing malware classification issues.