Abstract
We present VITA, a Vision-To-Action flow matching policy that evolves latentvisual representations into latent actions for visuomotor control. Traditionalflow matching and diffusion policies sample from standard source distributions(e.g., Gaussian noise) and require additional conditioning mechanisms likecross-attention to condition action generation on visual information, creatingtime and space overheads. VITA proposes a novel paradigm that treats latentimages as the flow source, learning an inherent mapping from vision to actionwhile eliminating separate conditioning modules and preserving generativemodeling capabilities. Learning flows between fundamentally differentmodalities like vision and action is challenging due to sparse action datalacking semantic structures and dimensional mismatches between high-dimensionalvisual representations and raw actions. We address this by creating astructured action latent space via an autoencoder as the flow matching target,up-sampling raw actions to match visual representation shapes. Crucially, wesupervise flow matching with both encoder targets and final action outputsthrough flow latent decoding, which backpropagates action reconstruction lossthrough sequential flow matching ODE solving steps for effective end-to-endlearning. Implemented as simple MLP layers, VITA is evaluated on challengingbi-manual manipulation tasks on the ALOHA platform, including 5 simulation and2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms ormatches state-of-the-art generative policies while reducing inference latencyby 50-130% compared to conventional flow matching policies requiring differentconditioning mechanisms or complex architectures. To our knowledge, VITA is thefirst MLP-only flow matching policy capable of solving complex bi-manualmanipulation tasks like those in ALOHA benchmarks.