Abstract
Recent LLM benchmarks have tested models on a range of phenomena, but arestill focused primarily on natural language understanding for extraction ofexplicit information, such as QA or summarization, with responses often tar-geting information from individual sentences. We are still lacking morechallenging, and im- portantly also multilingual, benchmarks focus- ing onimplicit information and pragmatic infer- ences across larger documents in thecontext of discourse tracking: integrating and aggregating information acrosssentences, paragraphs and multiple speaker utterances. To this end, we presentDiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languagesand four levels of discourse understanding: salience recognition, entitytracking, discourse relations and bridging inference. Our evaluation shows thatthese tasks remain challenging, even for state-of-the-art models.