An estimated half of the world's languages do not have a written form, makingit impossible for these languages to benefit from any existing text-basedtechnologies. In this paper, a speech-to-image generation (S2IG) framework isproposed which translates speech descriptions to photo-realistic images withoutusing any text information, thus allowing unwritten languages to potentiallybenefit from this technology. The proposed S2IG framework, named S2IGAN,consists of a speech embedding network (SEN) and a relation-superviseddensely-stacked generative model (RDG). SEN learns the speech embedding withthe supervision of the corresponding visual information. Conditioned on thespeech embedding produced by SEN, the proposed RDG synthesizes images that aresemantically consistent with the corresponding speech descriptions. Extensiveexperiments on two public benchmark datasets CUB and Oxford-102 demonstrate theeffectiveness of the proposed S2IGAN on synthesizing high-quality andsemantically-consistent images from the speech signal, yielding a goodperformance and a solid baseline for the S2IG task.