While pretraining on large-scale image-text data from the Web has facilitatedrapid progress on many vision-and-language (V&L) tasks, recent work hasdemonstrated that pretrained models lack "fine-grained" understanding, such asthe ability to recognise relationships, verbs, and numbers in images. This hasresulted in an increased interest in the community to either develop newbenchmarks or models for such capabilities. To better understand and quantifyprogress in this direction, we investigate four competitive V&L models on fourfine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al.,2022) consistently outperforms other baselines, and that modelling innovationscan impact performance more than scaling Web data, which even degradesperformance sometimes. Through a deeper investigation of X-VLM, we highlightthe importance of both novel losses and rich data sources for learningfine-grained skills. Finally, we inspect training dynamics, and discover thatfor some tasks, performance peaks early in training or significantlyfluctuates, never converging.