This study examines the use of natural language processing (NLP) models toevaluate whether language patterns used by item writers in a medical licensureexam might contain evidence of biased or stereotypical language. This type ofbias in item language choices can be particularly impactful for items in amedical licensure assessment, as it could pose a threat to content validity anddefensibility of test score validity evidence. To the best of our knowledge,this is the first attempt using machine learning (ML) and NLP to explorelanguage bias on a large item bank. Using a prediction algorithm trained onclusters of similar item stems, we demonstrate that our approach can be used toreview large item banks for potential biased language or stereotypical patientcharacteristics in clinical science vignettes. The findings may guide thedevelopment of methods to address stereotypical language patterns found in testitems and enable an efficient updating of those items, if needed, to reflectcontemporary norms, thereby improving the evidence to support the validity ofthe test scores.