Abstract
Large language models (LLMs) have emerged as a potential solution to automatethe complex processes involved in writing literature reviews, such asliterature collection, organization, and summarization. However, it is yetunclear how good LLMs are at automating comprehensive and reliable literaturereviews. This study introduces a framework to automatically evaluate theperformance of LLMs in three key tasks of literature writing: referencegeneration, literature summary, and literature review composition. We introducemultidimensional evaluation metrics that assess the hallucination rates ingenerated references and measure the semantic coverage and factual consistencyof the literature summaries and compositions against human-writtencounterparts. The experimental results reveal that even the most advancedmodels still generate hallucinated references, despite recent progress.Moreover, we observe that the performance of different models varies acrossdisciplines when it comes to writing literature reviews. These findingshighlight the need for further research and development to improve thereliability of LLMs in automating academic literature reviews.