Abstract
With the rapid development of mobile intelligent assistant technologies,multi-modal AI assistants have become essential interfaces for daily userinteractions. However, current evaluation methods face challenges includinghigh manual costs, inconsistent standards, and subjective bias. This paperproposes an automated multi-modal evaluation framework based on large languagemodels and multi-agent collaboration. The framework employs a three-tier agentarchitecture consisting of interaction evaluation agents, semantic verificationagents, and experience decision agents. Through supervised fine-tuning on theQwen3-8B model, we achieve a significant evaluation matching accuracy withhuman experts. Experimental results on eight major intelligent agentsdemonstrate the framework's effectiveness in predicting users' satisfaction andidentifying generation defects.