We investigate the behaviour of attention in neural models of visuallygrounded speech trained on two languages: English and Japanese. Experimentalresults show that attention focuses on nouns and this behaviour holds true fortwo very typologically different languages. We also draw parallels betweenartificial neural attention and human attention and show that neural attentionfocuses on word endings as it has been theorised for human attention. Finally,we investigate how two visually grounded monolingual models can be used toperform cross-lingual speech-to-speech retrieval. For both languages, theenriched bilingual (speech-image) corpora with part-of-speech tags and forcedalignments are distributed to the community for reproducible research.