Abstract
Generative AI and large language models hold great promise in enhancingcomputing education by powering next-generation educational technologies forintroductory programming. Recent works have studied these models for differentscenarios relevant to programming education; however, these works are limitedfor several reasons, as they typically consider already outdated models or onlyspecific scenario(s). Consequently, there is a lack of a systematic study thatbenchmarks state-of-the-art models for a comprehensive set of programmingeducation scenarios. In our work, we systematically evaluate two models,ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with humantutors for a variety of scenarios. We evaluate using five introductory Pythonprogramming problems and real-world buggy programs from an online platform, andassess performance using expert-based annotations. Our results show that GPT-4drastically outperforms ChatGPT (based on GPT-3.5) and comes close to humantutors' performance for several scenarios. These results also highlightsettings where GPT-4 still struggles, providing exciting future directions ondeveloping techniques to improve the performance of these models.