Computer musicians refer to mesostructures as the intermediate levels ofarticulation between the microstructure of waveshapes and the macrostructure ofmusical forms. Examples of mesostructures include melody, arpeggios,syncopation, polyphonic grouping, and textural contrast. Despite their centralrole in musical expression, they have received limited attention in deeplearning. Currently, autoencoders and neural audio synthesizers are onlytrained and evaluated at the scale of microstructure: i.e., local amplitudevariations up to 100 milliseconds or so. In this paper, we formulate andaddress the problem of mesostructural audio modeling via a composition of adifferentiable arpeggiator and time-frequency scattering. We empiricallydemonstrate that time--frequency scattering serves as a differentiable model ofsimilarity between synthesis parameters that govern mesostructure. By exposingthe sensitivity of short-time spectral distances to time alignment, we motivatethe need for a time-invariant and multiscale differentiable time--frequencymodel of similarity at the level of both local spectra and spectrotemporalmodulations.