Abstract
Deploying conversational voice agents with large language models faces acritical challenge: cloud-based foundation models provide deep reasoning anddomain knowledge but introduce latency that disrupts natural conversation,while on-device models respond immediately but lack sophistication. We proposeconversational infill, a task where a lightweight on-device model generatescontextually appropriate dialogue while seamlessly incorporating streamingknowledge from a powerful backend model. This approach decouples responselatency from model capability, enabling systems that feel responsive whileaccessing the full power of large-scale models. We present ConvFill, a 360Mparameter model trained on synthetic multi-domain conversations. Evaluationacross multiple backend models shows that conversational infill can besuccessfully learned, with ConvFill achieving accuracy improvements of 36-42%over standalone small models of the same size while consistently retainingsub-200ms response latencies. Our results demonstrate the promise of thisapproach for building on-device conversational agents that are both immediatelyresponsive and knowledgeable.