Abstract
Conversational assistants are becoming more and more popular, including inhealthcare, partly because of the availability and capabilities of LargeLanguage Models. There is a need for controlled, probing evaluations with realstakeholders which can highlight advantages and disadvantages of moretraditional architectures and those based on generative AI. We present awithin-group user study to compare two versions of a conversational assistantthat allows heart failure patients to ask about salt content in food. Oneversion of the system was developed in-house with a neurosymbolic architecture,and one is based on ChatGPT. The evaluation shows that the in-house system ismore accurate, completes more tasks and is less verbose than the one based onChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errorsand requires fewer clarifications to complete the task. Patients show nopreference for one over the other.