Long-context Non-factoid Question Answering in Indic Languages

Abstract

Question Answering (QA) tasks, which involve extracting answers from a givencontext, are relatively straightforward for modern Large Language Models (LLMs)when the context is short. However, long contexts pose challenges due to thequadratic complexity of the self-attention mechanism. This challenge iscompounded in Indic languages, which are often low-resource. This studyexplores context-shortening techniques, including Open Information Extraction(OIE), coreference resolution, Answer Paragraph Selection (APS), and theircombinations, to improve QA performance. Compared to the baseline ofunshortened (long) contexts, our experiments on four Indic languages (Hindi,Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yieldan average improvement of 4\% in semantic scores and 47\% in token-level scoreswhen evaluated on three popular LLMs without fine-tuning. Furthermore, withfine-tuning, we achieve an average increase of 2\% in both semantic andtoken-level scores. Additionally, context-shortening reduces computationaloverhead. Explainability techniques like LIME and SHAP reveal that when the APSmodel confidently identifies the paragraph containing the answer, nearly alltokens within the selected text receive high relevance scores. However, thestudy also highlights the limitations of LLM-based QA systems in addressingnon-factoid questions, particularly those requiring reasoning or debate.Moreover, verbalizing OIE-generated triples does not enhance systemperformance. These findings emphasize the potential of context-shorteningtechniques to improve the efficiency and effectiveness of LLM-based QA systems,especially for low-resource languages. The source code and resources areavailable at https://github.com/ritwikmishra/IndicGenQA.

Quick Read (beta)

loading the full paper ...