Abstract
This work targets what we consider to be the foundational step for urbanairborne robots, a safe landing. Our attention is directed toward what we deemthe most crucial aspect of the safe landing perception stack: segmentation. Wepresent a streamlined reactive UAV system that employs visual servoing byharnessing the capabilities of open vocabulary image segmentation. Thisapproach can adapt to various scenarios with minimal adjustments, bypassing thenecessity for extensive data accumulation for refining internal models, thanksto its open vocabulary methodology. Given the limitations imposed by localauthorities, our primary focus centers on operations originating from altitudesof 100 meters. This choice is deliberate, as numerous preceding works havedealt with altitudes up to 30 meters, aligning with the capabilities of smallstereo cameras. Consequently, we leave the remaining 20m to be navigated usingconventional 3D path planning methods. Utilizing monocular cameras and imagesegmentation, our findings demonstrate the system's capability to successfullyexecute landing maneuvers at altitudes as low as 20 meters. However, thisapproach is vulnerable to intermittent and occasionally abrupt fluctuations inthe segmentation between frames in a video stream. To address this challenge,we enhance the image segmentation output by introducing what we call a dynamicfocus: a masking mechanism that self adjusts according to the current landingstage. This dynamic focus guides the control system to avoid regions beyond thedrone's safety radius projected onto the ground, thus mitigating the problemswith fluctuations. Through the implementation of this supplementary layer, ourexperiments have reached improvements in the landing success rate of almosttenfold when compared to global segmentation. All the source code is opensource and available online (github.com/MISTLab/DOVESEI).