GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

Abstract

Large vision-language models (LVLMs), such as the Generative Pre-trainedTransformer 4-omni (GPT-4o), are emerging multi-modal foundation models whichhave great potential as powerful artificial-intelligence (AI) assistance toolsfor a myriad of applications, including healthcare, industrial, and academicsectors. Although such foundation models perform well in a wide range ofgeneral tasks, their capability without fine-tuning is often limited inspecialized tasks. However, full fine-tuning of large foundation models ischallenging due to enormous computation/memory/dataset requirements. We showthat GPT-4o can decode hand gestures from forearm ultrasound data even with nofine-tuning, and improves with few-shot, in-context learning.

Quick Read (beta)

loading the full paper ...