Abstract
Speech disfluency commonly occurs in conversational and spontaneous speech.However, standard Automatic Speech Recognition (ASR) models struggle toaccurately recognize these disfluencies because they are typically trained onfluent transcripts. Current research mainly focuses on detecting disfluencieswithin transcripts, overlooking their exact location and duration in thespeech. Additionally, previous work often requires model fine-tuning andaddresses limited types of disfluencies. In this work, we present an inference-only approach to augment any ASR modelwith the ability to detect open-set disfluencies. We first demonstrate that ASRmodels have difficulty transcribing speech disfluencies. Next, this workproposes a modified Connectionist Temporal Classification(CTC)-based forcedalignment algorithm from \cite{kurzinger2020ctc} to predict word-leveltimestamps while effectively capturing disfluent speech. Additionally, wedevelop a model to classify alignment gaps between timestamps as eithercontaining disfluent speech or silence. This model achieves an accuracy of81.62% and an F1-score of 80.07%. We test the augmentation pipeline ofalignment gap detection and classification on a disfluent dataset. Our resultsshow that we captured 74.13% of the words that were initially missed by thetranscription, demonstrating the potential of this pipeline for downstreamtasks.