Abstract
Assessing the quality of long-form, model-generated text is challenging, evenwith advanced LLM-as-a-Judge methods, due to performance degradation as inputlength increases. To address this issue, we propose a divide-and-conquerapproach, which breaks down the comprehensive evaluation task into a series oflocalized scoring tasks, followed by a final global assessment. This strategyallows for more granular and manageable evaluations, ensuring that each segmentof the text is assessed in isolation for both coherence and quality, while alsoaccounting for the overall structure and consistency of the entire piece.Moreover, we introduce a hybrid in-context learning approach that leverageshuman annotations to enhance the performance of both local and globalevaluations. By incorporating human-generated feedback directly into theevaluation process, this method allows the model to better align with humanjudgment. Finally, we develop an uncertainty-based active learning algorithmthat efficiently selects data samples for human annotation, thereby reducingannotation costs in practical scenarios. Experimental results show that theproposed evaluation framework outperforms several representative baselines,highlighting the effectiveness of our approach.