Abstract
Fusing information from human observations can help robots overcome sensinglimitations in collaborative tasks. However, an uncertainty-aware fusionframework requires a grounded likelihood representing the uncertainty of humaninputs. This paper presents a Feature Pyramid Likelihood Grounding Network(FP-LGN) that grounds spatial language by learning relevant map image featuresand their relationships with spatial relation semantics. The model is trainedas a probability estimator to capture aleatoric uncertainty in human languageusing three-stage curriculum learning. Results showed that FP-LGN matchedexpert-designed rules in mean Negative Log-Likelihood (NLL) and demonstratedgreater robustness with lower standard deviation. Collaborative sensing resultsdemonstrated that the grounded likelihood successfully enableduncertainty-aware fusion of heterogeneous human language observations and robotsensor measurements, achieving significant improvements in human-robotcollaborative task performance.