Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Abstract

Large language models (LLMs) are transforming the way information isretrieved with vast amounts of knowledge being summarized and presented vianatural language conversations. Yet, LLMs are prone to highlight the mostfrequently seen pieces of information from the training set and to neglect therare ones. In the field of biomedical research, latest discoveries are key toacademic and industrial actors and are obscured by the abundance of anever-increasing literature corpus (the information overload problem). Surfacingnew associations between biomedical entities, e.g., drugs, genes, diseases,with LLMs becomes a challenge of capturing the long-tail knowledge of thebiomedical scientific production. To overcome this challenge, RetrievalAugmented Generation (RAG) has been proposed to alleviate some of theshortcomings of LLMs by augmenting the prompts with context retrieved fromexternal datasets. RAG methods typically select the context via maximumsimilarity search over text embeddings. In this study, we show that RAG methodsleave out a significant proportion of relevant information due to clusters ofover-represented concepts in the biomedical literature. We introduce a novelinformation-retrieval method that leverages a knowledge graph to downsamplethese clusters and mitigate the information overload problem. Its retrievalperformance is about twice better than embedding similarity alternatives onboth precision and recall. Finally, we demonstrate that both embeddingsimilarity and knowledge graph retrieval methods can be advantageously combinedinto a hybrid model that outperforms both, enabling potential improvements tobiomedical question-answering models.

Quick Read (beta)

loading the full paper ...