Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Abstract

Deploying multiple machine learning models on resource-constrained roboticplatforms for different perception tasks often results in redundantcomputations, large memory footprints, and complex integration challenges. Inresponse, this work presents Visual Perception Engine (VPEngine), a modularframework designed to enable efficient GPU usage for visual multitasking whilemaintaining extensibility and developer accessibility. Our frameworkarchitecture leverages a shared foundation model backbone that extracts imagerepresentations, which are efficiently shared, without any unnecessary GPU-CPUmemory transfers, across multiple specialized task-specific model heads runningin parallel. This design eliminates the computational redundancy inherent infeature extraction component when deploying traditional sequential models whileenabling dynamic task prioritization based on application demands. Wedemonstrate our framework's capabilities through an example implementationusing DINOv2 as the foundation model with multiple task (depth, objectdetection and semantic segmentation) heads, achieving up to 3x speedup comparedto sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngineoffers efficient GPU utilization and maintains a constant memory footprintwhile allowing per-task inference frequencies to be adjusted dynamically duringruntime. The framework is written in Python and is open source with ROS2 C++(Humble) bindings for ease of use by the robotics community across diverserobotic platforms. Our example implementation demonstrates end-to-end real-timeperformance at $\geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimizedmodels.

Quick Read (beta)

loading the full paper ...