Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions

Abstract

The use of large language models (LLMs) in hiring promises to streamlinecandidate screening, but it also raises serious concerns regarding accuracy andalgorithmic bias where sufficient safeguards are not in place. In this work, webenchmark several state-of-the-art foundational LLMs - including models fromOpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with ourproprietary domain-specific hiring model (Match Score) for job candidatematching. We evaluate each model's predictive accuracy (ROC AUC,Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysisacross declared gender, race, and intersectional subgroups). Our experiments ona dataset of roughly 10,000 real-world recent candidate-job pairs show thatMatch Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs0.77) and achieves significantly more equitable outcomes across demographicgroups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957(near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for theintersectionals, respectively). We discuss why pretraining biases may causeLLMs with insufficient safeguards to propagate societal biases in hiringscenarios, whereas a bespoke supervised model can more effectively mitigatethese biases. Our findings highlight the importance of domain-specific modelingand bias auditing when deploying AI in high-stakes domains such as hiring, andcaution against relying on off-the-shelf LLMs for such tasks without extensivefairness safeguards. Furthermore, we show with empirical evidence that thereshouldn't be a dichotomy between choosing accuracy and fairness in hiring: awell-designed algorithm can achieve both accuracy in hiring and fairness inoutcomes.

Quick Read (beta)

loading the full paper ...