MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement tolarge language models (LLMs), often overlooks the crucial aspect of textchunking within its pipeline. This paper initially introduces a dual-metricevaluation method, comprising Boundary Clarity and Chunk Stickiness, to enablethe direct quantification of chunking quality. Leveraging this assessmentmethod, we highlight the inherent limitations of traditional and semanticchunking in handling complex contextual nuances, thereby substantiating thenecessity of integrating LLMs into chunking process. To address the inherenttrade-off between computational efficiency and chunking precision in LLM-basedapproaches, we devise the granularity-aware Mixture-of-Chunkers (MoC)framework, which consists of a three-stage processing mechanism. Notably, ourobjective is to guide the chunker towards generating a structured list ofchunking regular expressions, which are subsequently employed to extract chunksfrom the original text. Extensive experiments demonstrate that both ourproposed metrics and the MoC framework effectively settle challenges of thechunking task, revealing the chunking kernel while enhancing the performance ofthe RAG system.

Quick Read (beta)

loading the full paper ...