Abstract
In this paper, we present Optimized Prompt-based Unified System (OPUS), aframework that utilizes a Large Language Model (LLM) to control Pan-Tilt-Zoom(PTZ) cameras, providing contextual understanding of natural environments. Toachieve this goal, the OPUS system improves cost-effectiveness by generatingkeywords from a high-level camera control API and transferring knowledge fromlarger closed-source language models to smaller ones through SupervisedFine-Tuning (SFT) on synthetic data. This enables efficient edge deploymentwhile maintaining performance comparable to larger models like GPT-4. OPUSenhances environmental awareness by converting data from multiple cameras intotextual descriptions for language models, eliminating the need for specializedsensory tokens. In benchmark testing, our approach significantly outperformedboth traditional language model techniques and more complex prompting methods,achieving a 35% improvement over advanced techniques and a 20% higher taskaccuracy compared to closed-source models like Gemini Pro. The systemdemonstrates OPUS's capability to simplify PTZ camera operations through anintuitive natural language interface. This approach eliminates the need forexplicit programming and provides a conversational method for interacting withcamera systems, representing a significant advancement in how users can controland utilize PTZ camera technology.