Abstract
Large Vision-Language Models (VLMs) deliver exceptional performance butrequire significant computational resources, limiting their deployment onmobile and edge devices. Smaller VLMs typically mirror design choices of largermodels, such as extensive image tokenization, leading to inefficient GPU memoryusage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specificallyengineered for resource-efficient inference. We systematically explorearchitectural configurations, tokenization strategies, and data curationoptimized for low computational overhead. Through this, we identify key designchoices that yield substantial performance gains on image and video tasks withminimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory duringinference and outperforms the 300-times larger Idefics-80B model, despite an18-month development gap. Our largest model, at 2.2B parameters, rivalsstate-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extendbeyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressiveyet efficient tokenization, and carefully curated training data significantlyenhance multimodal performance, facilitating practical, energy-efficientdeployments at significantly smaller scales.