Abstract
Trustworthy robot behavior requires not only high levels of task success butalso that the robot can reliably quantify how likely it is to succeed. To thisend, we present the first systematic study of confidence calibration invision-language-action (VLA) foundation models, which map visual observationsand natural-language instructions to low-level robot motor commands. We beginwith extensive benchmarking to understand the critical relationship betweentask success and calibration error across multiple datasets and VLA variants,finding that task performance and calibration are not in tension. Next, weintroduce prompt ensembles for VLAs, a lightweight, Bayesian-inspired algorithmthat averages confidence across paraphrased instructions and consistentlyimproves calibration. We further analyze calibration over the task timehorizon, showing that confidence is often most reliable after making someprogress, suggesting natural points for risk-aware intervention. Finally, wereveal differential miscalibration across action dimensions and proposeaction-wise Platt scaling, a method to recalibrate each action dimensionindependently to produce better confidence estimates. Our aim in this study isto begin to develop the tools and conceptual understanding necessary to renderVLAs both highly performant and highly trustworthy via reliable uncertaintyquantification.