AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns

Abstract

We present and evaluate a suite of proof-of-concept (PoC), structuredworkflow prompts designed to elicit human-like hierarchical reasoning whileguiding Large Language Models (LLMs) in the high-level semantic and linguisticanalysis of scholarly manuscripts. The prompts target two non-trivialanalytical tasks within academic summaries (abstracts and conclusions):identifying unsubstantiated claims (informational integrity) and flaggingsemantically confusing ambiguous pronoun references (linguistic clarity). Weconducted a systematic, multi-run evaluation on two frontier models (Gemini Pro2.5 Pro and ChatGPT Plus o3) under varied context conditions. Our results forthe informational integrity task reveal a significant divergence in modelperformance: while both models successfully identified an unsubstantiated headof a noun phrase (95% success), ChatGPT consistently failed (0% success) toidentify an unsubstantiated adjectival modifier that Gemini correctly flagged(95% success), raising a question regarding the potential influence of thetarget's syntactic role. For the linguistic analysis task, both modelsperformed well (80-90% success) with full manuscript context. Surprisingly, ina summary-only setting, Gemini's performance was substantially degraded, whileChatGPT achieved a perfect (100%) success rate. Our findings suggest that whilestructured prompting is a viable methodology for complex textual analysis,prompt performance may be highly dependent on the interplay between the model,task type, and context, highlighting the need for rigorous, model-specifictesting.

Quick Read (beta)

loading the full paper ...