Auditing an Automatic Grading Model with deep Reinforcement Learning

Abstract

We explore the use of deep reinforcement learning to audit an automatic shortanswer grading (ASAG) model. Automatic grading may decrease the time burden ofrating open-ended items for educators, but a lack of robust evaluation methodsfor these models can result in uncertainty of their quality. Currentstate-of-the-art ASAG models are configured to match human ratings from atraining set, and researchers typically assess their quality with accuracymetrics that signify agreement between model and human scores. In this paper,we show that a high level of agreement to human ratings does not givesufficient evidence that an ASAG model is infallible. We train a reinforcementlearning agent to revise student responses with the objective of achieving ahigh rating from an automatic grading model in the least number of revisions.By analyzing the agent's revised responses that achieve a high grade from theASAG model but would not be considered a high scoring responses according to ascoring rubric, we discover ways in which the automated grader can beexploited, exposing shortcomings in the grading model.

Quick Read (beta)

loading the full paper ...