TopClustRAG at SIGIR 2025 LiveRAG Challenge

Abstract

We present TopClustRAG, a retrieval-augmented generation (RAG) systemdeveloped for the LiveRAG Challenge, which evaluates end-to-end questionanswering over large-scale web corpora. Our system employs a hybrid retrievalstrategy combining sparse and dense indices, followed by K-Means clustering togroup semantically similar passages. Representative passages from each clusterare used to construct cluster-specific prompts for a large language model(LLM), generating intermediate answers that are filtered, reranked, and finallysynthesized into a single, comprehensive response. This multi-stage pipelineenhances answer diversity, relevance, and faithfulness to retrieved evidence.Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd infaithfulness and 7th in correctness on the official leaderboard, demonstratingthe effectiveness of clustering-based context filtering and prompt aggregationin large-scale RAG systems.

Quick Read (beta)

loading the full paper ...