Human infants are able to acquire natural language seemingly easily at anearly age. Their language learning seems to occur simultaneously with learningother cognitive functions as well as with playful interactions with theenvironment and caregivers. From a neuroscientific perspective, naturallanguage is embodied, grounded in most, if not all, sensory and sensorimotormodalities, and acquired by means of crossmodal integration. However,characterising the underlying mechanisms in the brain is difficult andexplaining the grounding of language in crossmodal perception and actionremains challenging. In this paper, we present a neurocognitive model forlanguage grounding which reflects bio-inspired mechanisms such as an implicitadaptation of timescales as well as end-to-end multimodal abstraction. Itaddresses developmental robotic interaction and extends its learningcapabilities using larger-scale knowledge-based data. In our scenario, weutilise the humanoid robot NICO in obtaining the EMIL data collection, in whichthe cognitive robot interacts with objects in a children's playgroundenvironment while receiving linguistic labels from a caregiver. The modelanalysis shows that crossmodally integrated representations are sufficient foracquiring language merely from sensory input through interaction with objectsin an environment. The representations self-organise hierarchically and embedtemporal and spatial information through composition and decomposition. Thismodel can also provide the basis for further crossmodal integration ofperceptually grounded cognitive representations.