Abstract
Large language models (LLMs) have advanced virtual educators and learners,bridging NLP with AI4Education. Existing work often lacks scalability and failsto leverage diverse, large-scale course content, with limited frameworks forassessing pedagogic quality. To this end, we propose WikiHowAgent, amulti-agent workflow leveraging LLMs to simulate interactive teaching-learningconversations. It integrates teacher and learner agents, an interactionmanager, and an evaluator to facilitate procedural learning and assesspedagogic quality. We introduce a dataset of 114,296 teacher-learnerconversations grounded in 14,287 tutorials across 17 domains and 727 topics.Our evaluation protocol combines computational and rubric-based metrics withhuman judgment alignment. Results demonstrate the workflow's effectiveness indiverse setups, offering insights into LLM capabilities across domains. Ourdatasets and implementations are fully open-sourced.