ChatGPT Prompting Cannot Estimate Predictive Uncertainty in High-Resource Languages

Abstract

ChatGPT took the world by storm for its impressive abilities. Due to itsrelease without documentation, scientists immediately attempted to identify itslimits, mainly through its performance in natural language processing (NLP)tasks. This paper aims to join the growing literature regarding ChatGPT'sabilities by focusing on its performance in high-resource languages and on itscapacity to predict its answers' accuracy by giving a confidence level. Theanalysis of high-resource languages is of interest as studies have shown thatlow-resource languages perform worse than English in NLP tasks, but no study sofar has analysed whether high-resource languages perform as well as English.The analysis of ChatGPT's confidence calibration has not been carried outbefore either and is critical to learn about ChatGPT's trustworthiness. Inorder to study these two aspects, five high-resource languages and two NLPtasks were chosen. ChatGPT was asked to perform both tasks in the fivelanguages and to give a numerical confidence value for each answer. The resultsshow that all the selected high-resource languages perform similarly and thatChatGPT does not have a good confidence calibration, often being overconfidentand never giving low confidence values.

Quick Read (beta)

loading the full paper ...