Abstract
Computational models of syntax are predominantly text-based. Here we proposethat the most basic first step in the evolution of syntax can be modeleddirectly from raw speech in a fully unsupervised way. We focus on one of themost ubiquitous and elementary suboperation of syntax -- concatenation. Weintroduce spontaneous concatenation: a phenomenon where convolutional neuralnetworks (CNNs) trained on acoustic recordings of individual words startgenerating outputs with two or even three words concatenated without everaccessing data with multiple words in the input. We replicate this finding inseveral independently trained models with different hyperparameters andtraining data. Additionally, networks trained on two words learn to embed wordsinto novel unobserved word combinations. We also show that the concatenatedoutputs contain precursors to compositionality. To our knowledge, this is apreviously unreported property of CNNs trained in the ciwGAN/fiwGAN setting onraw speech and has implications both for our understanding of how thesearchitectures learn as well as for modeling syntax and its evolution in thebrain from raw acoustic inputs. We also propose a potential neural mechanismcalled disinhibition that outlines a possible neural pathway towardsconcatenation and compositionality and suggests our modeling is useful forgenerating testable prediction for biological and artificial neural processingof speech.