Sampling the Swadesh List to Identify Similar Languages with Tree Spaces

Abstract

Communication plays a vital role in human interaction. Studying language is aworthwhile task and more recently has become quantitative in nature withdevelopments of fields like quantitative comparative linguistics andlexicostatistics. With respect to the authors own native languages, theancestry of the English language and the Latin alphabet are of the primaryinterest. The Indo-European Tree traces many modern languages back to theProto-Indo-European root. Swadesh's cognates played a large role in developingthat historical perspective where some of the primary branches are Germanic,Celtic, Italic, and Balto-Slavic. This paper will use data analysis on openbooks where the simplest singular space is the 3-spider - a union T3 of threerays with their endpoints glued at a point 0 - which can represent these treespaces for language clustering. These trees are built using a single linkagemethod for clustering based on distances between samples from languages whichuse the Latin Script. Taking three languages at a time, the barycenter isdetermined. Some initial results have found both non-sticky and sticky samplemeans. If the mean exhibits non-sticky properties, then one language may comefrom a different ancestor than the other two. If the mean is considered sticky,then the languages may share a common ancestor or all languages may havedifferent ancestry.

Quick Read (beta)

loading the full paper ...