Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement tolarge language models (LLMs), often overlooks the crucial aspect of textchunking within its pipeline, which impacts the quality of knowledge-intensivetasks. This paper introduces the concept of Meta-Chunking, which refers to agranularity between sentences and paragraphs, consisting of a collection ofsentences within a paragraph that have deep linguistic logical connections. Toimplement Meta-Chunking, we designed two strategies based on LLMs: MarginSampling Chunking and Perplexity Chunking. The former employs LLMs to performbinary classification on whether consecutive sentences need to be segmented,making decisions based on the probability difference obtained from marginsampling. The latter precisely identifies text chunk boundaries by analyzingthe characteristics of perplexity distribution. Additionally, considering theinherent complexity of different texts, we propose a strategy that combinesMeta-Chunking with dynamic merging to achieve a balance between fine-grainedand coarse-grained text chunking. Experiments conducted on eleven datasetsdemonstrate that Meta-Chunking can more efficiently improve the performance ofsingle-hop and multi-hop question answering based on RAG. For instance, on the2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while onlyconsuming 45.8% of the time. Our code is available athttps://github.com/IAAR-Shanghai/Meta-Chunking.

Quick Read (beta)

loading the full paper ...