Abstract
Humans can develop internal world models that encode common sense knowledge,telling them how the world works and predicting the consequences of theiractions. This concept has emerged as a promising direction for establishinggeneral-purpose machine-learning models in recent preliminary works, e.g., forvisual representation learning. In this paper, we present CheXWorld, the firsteffort towards a self-supervised world model for radiographic images.Specifically, our work develops a unified framework that simultaneously modelsthree aspects of medical knowledge essential for qualified radiologists,including 1) local anatomical structures describing the fine-grainedcharacteristics of local tissues (e.g., architectures, shapes, and textures);2) global anatomical layouts describing the global organization of the humanbody (e.g., layouts of organs and skeletons); and 3) domain variations thatencourage CheXWorld to model the transitions across different appearancedomains of radiographs (e.g., varying clarity, contrast, and exposure caused bycollecting radiographs from different hospitals, devices, or patients).Empirically, we design tailored qualitative and quantitative analyses,revealing that CheXWorld successfully captures these three dimensions ofmedical knowledge. Furthermore, transfer learning experiments across eightmedical image classification and segmentation benchmarks showcase thatCheXWorld significantly outperforms existing SSL methods and large-scalemedical foundation models. Code & pre-trained models are available athttps://github.com/LeapLabTHU/CheXWorld.