Abstract
Learning in the brain is local and unsupervised (Hebbian). We derive thefoundations of an effective human language model inspired by these microscopicconstraints. It has two parts: (1) a hierarchy of neurons which learns totokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additionalneurons which bind the learned symanticless patterns of the tokenizer into asymanticful token (an embedding). The model permits continuous parallellearning without forgetting; and is a powerful tokenizer which performsrenormalization group. This allows it to exploit redundancy, such that itgenerates tokens which are always decomposable into a basis set (e.g analphabet), and can mix features learned from multiple languages. We find thatthe structure of this model allows it to learn a natural language morphologyWITHOUT data. The language data generated by this model predicts the correctdistribution of word-forming patterns observed in real languages, and furtherdemonstrates why microscopically human speech is broken up into words. Thismodel provides the basis for understanding the microscopic origins of languageand human creativity.