Abstract
Recent advancements in Generative AI, particularly in Large Language Models(LLMs) and Large Vision-Language Models (LVLMs), offer new possibilities forintegrating cognitive planning into robotic systems. In this work, we present anovel framework for solving the object goal navigation problem that generatesefficient exploration strategies. Our approach enables a robot to navigateunfamiliar environments by leveraging LLMs and LVLMs to understand the semanticstructure of the scene. To address the challenge of representing complexenvironments without overwhelming the system, we propose a 3D modular scenerepresentation, enriched with semantic descriptions. This representation isdynamically pruned using an LLM-based mechanism, which filters irrelevantinformation and focuses on task-specific data. By combining these elements, oursystem generates high-level sub-goals that guide the exploration of the robottoward the target object. We validate our approach in simulated environments,demonstrating its ability to enhance object search efficiency while maintainingscalability in complex settings.