Automatically Answering and Generating Machine Learning Final Exams

Abstract

Can a machine learn machine learning? We propose to answer this questionusing the same criteria we use to answer a similar question: can a human learnmachine learning? We automatically answer final exams in MIT's, Harvard's andCornell's large machine learning courses and generate new questions at a humanlevel. Recently, program synthesis and few-shot learning solveduniversity-level problem set questions in mathematics and STEM courses at ahuman level. In this work, we solve questions from final exams that differ fromproblem sets in several ways: the questions are longer, have multiple parts,are more complicated, and span a broader set of topics. We provide a newdataset and benchmark of questions from machine learning final exams and codefor automatically answering these questions and generating new questions. Tomake our dataset a reproducible benchmark, we use automatic checkers formultiple choice questions, questions with numeric answers, and questions withexpression answers, and evaluate a large free language model, Meta's OPT, andcompare the results with Open AI's GPT-3, ChatGPT, and Codex. A student surveycomparing the quality, appropriateness, and difficulty of machine-generatedquestions with human-written questions shows that across multiple aspects,machine-generated questions are indistinguishable from human-generatedquestions and are suitable for final exams. We perform ablation studiescomparing zero-shot learning with few-shot learning, chain-of-thoughtprompting, GPT-3, ChatGPT, and OPT pre-trained on text and Codex fine-tuned oncode on a range of machine learning topics and find that few-shot learningmethods perform best. We make our data and code publicly available for themachine learning community.

Quick Read (beta)

loading the full paper ...