Ab Antiquo: Proto-language Reconstruction with RNNs

  • 2019-08-07 08:03:08
  • Carlo Meloni, Shauli Ravfogel, Yoav Goldberg
  • 2

Abstract

Historical linguists have identified regularities in the process of historicsound change. The comparative method utilizes those regularities to reconstructproto-words based on observed forms in daughter languages. Can this process beefficiently automated? We address the task of proto-word reconstruction, inwhich the model is exposed to cognates in contemporary daughter languages, andhas to predict the proto word in the ancestor language. We provide a noveldataset for this task, encompassing over 8,000 comparative entries, and showthat neural sequence models outperform conventional methods applied to thistask so far. Error analysis reveals a variability in the ability of neuralmodel to capture different phonological changes, correlating with thecomplexity of the changes. Analysis of learned embeddings reveals the modelslearn phonologically meaningful generalizations, corresponding to well-attestedphonological shifts documented by historical linguistics.

 

Quick Read (beta)

\appendix

\thesection Appendix

\thesubsection Dataset Creation

In order to perform the reconstruction task, we required a large dataset of cognates and their proto-words, in both orthographic and phonetic (IPA) forms.

Despite growing interest in recent years, high-quality digital resources for the tasks of proto-word reconstruction and cognates detection are scarce. Our departure point is the dataset provided by \citetciobanu-dataset, which, to the best of our knowledge, is the most extensive dataset for proto-word reconstruction of a well-attested proto-language. The dataset contains 3,218 complete cognate sets in five Romance languages (Spanish, Italian, Portuguese, French, Romanian) together with their Latin etymological ancestor. Although being a valuable resource, this dataset was constructed via automatic method of cognate extraction, and a comparison with references on the development of Romance languages [LatinToRomance, HistoricalIntroduction] reveals some problems, such as false cognates, truncated forms, non-existent words and mismatch between the part of speech of the cognates and the ancestor. Another salient problem of the dataset regarded the grammatical case of Latin nouns: Romance languages derived their words from the accusative Latin case [RomanceLanguages], while in the dataset Latin words were displayed in the nominative case, an inconsistency making the reconstruction inherently more challenging.

Lastly, as neural models often requires large amounts of training data, we aimed to expand the dataset. We thus created a cleaned and extended dataset by Wiktionary scrapping, followed by manual validation and cleansing.

Wikitionary scraping

We augment the existing dataset with a freely available resource: Wiktionary. Wiktionary entries for Latin words usually contain inflection tables, and often list the descendants in Romance languages; these descendents are, by definition, cognates. We scraped all Latin entries from Wiktionary, and extracted the forms of the daughter languages (available in the “Descendants” section). This resulted in 5,598 additional comparative entries, for a total of 22,361 new individual words. Contrary to the previous dataset, the Wiktionary-derived cognates are not based on automatic alignment between translations, but rather on direct human annotation. On the other hand, the Wiktionary-based entries are often incomplete, and include cognates in only a subset of the daughter languages.

Form normalization

Using the Wiktionary-provided inflection tables, we decline the Latin nouns to the accusative case, and conjugate verbs to the infinitive form. We do this both to the Wiktionary-based entries and to the ones in the original dataset. We selected a sample of around 100 Latin words to check the accuracy of the automatic conjugation, against \citetGaffiot, finding them all correct. Finally, Latin words in the \citetciobanu-dataset dataset for which we did not find a Wiktionary entry were conjugated “manually” by consulting [Charlton, Gaffiot].

Manual verification and cleaning

After the collection of the Wikitionary dataset, we went manually through all the Latin words contained in \citetciobanu-dataset, checking them against \citetCharlton, Gaffiot. Additionally, we went over the some suspicious-looking words from the daughter language and verified them against \citetetymologicalDict to ensure their etymological relatedness with the Latin source, fixing if necessary.This sort of fix was not performed systematically, but we did fix or remove around 170 words.

Finally, we sample 300 entries from the original [ciobanu-dataset] dataset prior to cleaning and 300 words from our cleaned and unified version of the dataset, and manually verified them. We find 43 mistakes in the original dataset and only 4 in our version, indicating that, while still not perfect, it is of substantial higher quality.

IPA transcription

To obtain the phonetic transcriptions into IPA, we utilized the transcription module of the eSpeak library, which offers transcriptions for all languages in our dataset, including Latin. While a human transcription would be preferable, a manual evaluation of 200 of the resulting transcriptions by comparing them against several sources [LtOrtho, ItnOrtho, FrOrtho, SpOrtho, PtOrtho, RmOrtho] show high accuracy: all the 200 words were correct, except for minor systematic changes which we fixed globally to better suit the transcription to phonological conventions. Specifically, we deleted the vowel symbols <{IPA}U> and <{IPA}I> in Italian and Romanian, which resulted to be alien to those languages, changed the sequence <{IPA}RR> to <{IPA}r> in Spanish, and regularized the Portuguese transcriptions, which showed some phonological traits of Brazilian Portuguese.

Final dataset

The resulting dataset, used for all experiments in this work, contains 8,799 entries. The dataset was randomly splitted into train, evaluation and test sets, with 7,038 examples (80%) used for training, 703 (8%) for evaluation and 1,055 (12%) for testing.

Overall, the dataset contains 41,563 distinct words across the different languages (for a total of 83,126 words counting both the orthographic and the phonetic datasets), with 7,384 Italian words, 7,183 Spanish words, 6,806 Portuguese words, 6,505 French words and 4,886 Romanian words. As vowel lengths were found to be difficult to recover, we created the following variations of the dataset: with and without vowel length (for both the orthographic and phonetic datasets), and without a contrast (for the phonetic dataset).

\thesubsection Phoneme representations

\includegraphics

[scale=0.31]clusts-ward-all.jpeg

Figure \thefigure:

In this appendix, we show the hierarchical clustering created for all the languages in our dataset. As it can be noted from figure id1, the results for the different languages exhibit representations similar to those found in the French clustering: the primary division in each language is between vowels and consonants. In Portuguese, Latin, Spanish and Romanian some consonants are grouped together with vowels. These consonants are restricted to nasals, liquids or glides. The inclusion of these consonants can be explained by the peculiarity of their nature: all of them have a special phonological status, displaying similarities in their behavior to vowels. In all languages phonologically related phonemes tend to be group under the same nodes. Among the others, glides are either found together with each other (as in French, Italian and Romanian) or with their vocalic counterparts (Latin, Spanish and Portuguese), consonants differentiated only in voicing are usually paired ({IPA}[S] and {IPA}[Z]), front and back vowels forms clusters and allophones usually shares the same node (Italian, French, Romanian, Spanish, Portuguese).

\thesubsection Rules of Phonetic Change

{scalebox}

0.8 Latin phoneme Romanian French Italian Spanish Portuguese Latin Latin - reconstruction Correct {IPA}/e/ blocked syllable {IPA}pep {IPA}pep {IPA}pep {IPA}pep {IPA}pep {IPA}pep {IPA}pIp no {IPA}/o/ blocked syllable {IPA}pop {IPA}pup {IPA}pop {IPA}pop {IPA}pop {IPA}pop {IPA}pUp no {IPA}/E/ blocked syllable {IPA}pjep {IPA}pEp {IPA}pEp {IPA}pjep {IPA}pEp {IPA}pEp {IPA}pep no {IPA}/kt/ medially, before nasals - {IPA}anta {IPA}anta {IPA}anta {IPA}anta {IPA}ankta {IPA}antam no {IPA}/aI/ {IPA}pe {IPA}pe {IPA}pe {IPA}pe {IPA}pe {IPA}paI {IPA}pEm no {IPA}/OI/ {IPA}pe {IPA}pe {IPA}pe {IPA}pe {IPA}pe {IPA}pOI {IPA}pEm no {IPA}/b/ intervocalic {IPA}aa {IPA}ava {IPA}ava {IPA}aBa {IPA}ava {IPA}aba {IPA}awam no {IPA}/e/ free syllable {IPA}pe {IPA}pwa {IPA}pe {IPA}pe {IPA}pe {IPA}pe {IPA}pEm no {IPA}/o/ free syllable {IPA}po {IPA}pø {IPA}po {IPA}po {IPA}po {IPA}po {IPA}pUm no {IPA}/I/ free syllable {IPA}pe {IPA}pwa {IPA}pe {IPA}pe {IPA}pe {IPA}pI {IPA}pEm no {IPA}/n/ before front vowels {IPA}ji {IPA}\textltailni {IPA}\textltailni {IPA}\textltailni {IPA}\textltailni {IPA}ni {IPA}ŋidEm no {IPA}/a/ before nasal {IPA}p1n {IPA}pan {IPA}pan {IPA}pan {IPA}pan {IPA}pan {IPA}pan yes {IPA}/a/ blocked syllable {IPA}pap {IPA}pap {IPA}pap {IPA}pap {IPA}pap {IPA}pap {IPA}pap yes {IPA}/i/ {IPA}pi {IPA}pi {IPA}pi {IPA}pi {IPA}pi {IPA}pi {IPA}pi yes {IPA}/u/ {IPA}pu {IPA}py {IPA}pu {IPA}pu {IPA}pu {IPA}pu {IPA}pu yes {IPA}/I/ blocked syllable {IPA}pep {IPA}pep {IPA}pep {IPA}pep {IPA}pep {IPA}pIp {IPA}pIp yes {IPA}/U/ blocked syllable {IPA}pup {IPA}pup {IPA}pop {IPA}pop {IPA}pop {IPA}pUp {IPA}pUp yes {IPA}/O/ blocked syllable {IPA}pop {IPA}pOp {IPA}pOp {IPA}pwep {IPA}pOp {IPA}pOp {IPA}pOp yes {IPA}/k/ before front vowels {IPA}tSi {IPA}si {IPA}tSi {IPA}Ti {IPA}si {IPA}ki {IPA}ki yes {IPA}/sk/ before front vowels {IPA}Sti {IPA}si {IPA}Si {IPA}Ti {IPA}Si {IPA}ski {IPA}ski yes {IPA}/kt/ medially, elsewhere {IPA}apta {IPA}ata {IPA}atta {IPA}atSa {IPA}ata {IPA}akta {IPA}aktam yes {IPA}/aU/ {IPA}pau {IPA}pO {IPA}pO {IPA}po {IPA}po {IPA}paU {IPA}paUm yes {IPA}/pl/ word initial {IPA}pla {IPA}pla {IPA}pja {IPA}La {IPA}Sa {IPA}pla {IPA}plam yes {IPA}/a/ free syllable {IPA}pa {IPA}pa {IPA}pa {IPA}pa {IPA}pa {IPA}pa {IPA}pam yes {IPA}/E/ free syllable {IPA}pje {IPA}pje {IPA}pje {IPA}pje {IPA}pE {IPA}pE {IPA}pEm yes {IPA}/w/ {IPA}va {IPA}va {IPA}va {IPA}ba {IPA}va {IPA}wa {IPA}wam yes {IPA}/b/ word initial {IPA}ba {IPA}ba {IPA}ba {IPA}ba {IPA}ba {IPA}ba {IPA}bam yes {IPA}/j/ word initial {IPA}Za {IPA}Za {IPA}dZa {IPA}xa {IPA}Za {IPA}ja {IPA}jam yes {IPA}/f/ word initial {IPA}fa {IPA}fa {IPA}fa {IPA}a {IPA}fa {IPA}fa {IPA}fam yes {IPA}/f/ elsewhere {IPA}afa {IPA}afa {IPA}afa {IPA}afa {IPA}afa {IPA}afa {IPA}affam yes {IPA}/U/ free syllable {IPA}pu {IPA}pø {IPA}po {IPA}po {IPA}po {IPA}pU {IPA}pUpUm yes {IPA}/O/ free syllable {IPA}po {IPA}pø {IPA}pwO {IPA}pwe {IPA}pO {IPA}pO {IPA}pOdEm yes {IPA}/l/ before front vowels {IPA}ji {IPA}ji {IPA}Li {IPA}xi {IPA}Li {IPA}li {IPA}gIlUm yes

Table \thetable:

Table id1 displays the set of test phonemes used to evaluate the model’s generalizations. Each row represents a distinct rule of phonetic change, which focuses on a single phoneme. The phoneme in question is bolded, and other consonants / vowels are added to simulate the phonological environment of the rule. The added consonants / vowels were chosen because they did not affect the evolution of the examined phonemes from Latin to the Romance languages.