RepsNet: Combining Vision with Language for Automated Medical Reports

Abstract

Writing reports by analyzing medical images is error-prone for inexperiencedpractitioners and time consuming for experienced ones. In this work, we presentRepsNet that adapts pre-trained vision and language models to interpret medicalimages and generate automated reports in natural language. RepsNet consists ofan encoder-decoder model: the encoder aligns the images with natural languagedescriptions via contrastive learning, while the decoder predicts answers byconditioning on encoded images and prior context of descriptions retrieved bynearest neighbor search. We formulate the problem in a visual questionanswering setting to handle both categorical and descriptive natural languageanswers. We perform experiments on two challenging tasks of medical visualquestion answering (VQA-Rad) and report generation (IU-Xray) on radiology imagedatasets. Results show that RepsNet outperforms state-of-the-art methods with81.08 % classification accuracy on VQA-Rad 2018 and 0.58 BLEU-1 score onIU-Xray. Supplementary details are available athttps://sites.google.com/view/repsnet

Quick Read (beta)

loading the full paper ...