On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation

Abstract

Vision-language models in pathology enable multimodal case retrieval andautomated report generation. Many of the models developed so far, however, havebeen trained on pathology reports that include information which cannot beinferred from paired whole slide images (e.g., patient history), potentiallyleading to hallucinated sentences in generated reports. To this end, weinvestigate how the selection of information from pathology reports forvision-language modeling affects the quality of the multimodal representationsand generated reports. More concretely, we compare a model trained on fullreports against a model trained on preprocessed reports that only includesentences describing the cell and tissue appearances based on the H&E-stainedslides. For the experiments, we built upon the BLIP-2 framework and used acutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide imagesand 19,636 corresponding pathology reports. Model performance was assessedusing image-to-text and text-to-image retrieval, as well as qualitativeevaluation of the generated reports by an expert pathologist. Our resultsdemonstrate that text preprocessing prevents hallucination in reportgeneration. Despite the improvement in the quality of the generated reports,training the vision-language model on full reports showed better cross-modalretrieval performance.

Quick Read (beta)

loading the full paper ...