Abstract
Foundation segmentation models achieve reasonable leaf instance extractionfrom top-view crop images without training (i.e., zero-shot). However,segmenting entire plant individuals with each consisting of multipleoverlapping leaves remains challenging. This problem is referred to as ahierarchical segmentation task, typically requiring annotated trainingdatasets, which are often species-specific and require notable human labor. Toaddress this, we introduce ZeroPlantSeg, a zero-shot segmentation forrosette-shaped plant individuals from top-view images. We integrate afoundation segmentation model, extracting leaf instances, and a vision-languagemodel, reasoning about plants' structures to extract plant individuals withoutadditional training. Evaluations on datasets with multiple plant species,growth stages, and shooting environments demonstrate that our method surpassesexisting zero-shot methods and achieves better cross-domain performance thansupervised methods. Implementations are available athttps://github.com/JunhaoXing/ZeroPlantSeg.