On Language Clustering: A Non-parametric Statistical Approach

Abstract

Any approach aimed at pasteurizing and quantifying a particular phenomenonmust include the use of robust statistical methodologies for data analysis.With this in mind, the purpose of this study is to present statisticalapproaches that may be employed in nonparametric nonhomogeneous dataframeworks, as well as to examine their application in the field of naturallanguage processing and language clustering. Furthermore, this paper discussesthe many uses of nonparametric approaches in linguistic data mining andprocessing. The data depth idea allows for the centre-outward ordering ofpoints in any dimension, resulting in a new nonparametric multivariatestatistical analysis that does not require any distributional assumptions. Theconcept of hierarchy is used in historical language categorisation andstructuring, and it aims to organise and cluster languages into subfamiliesusing the same premise. In this regard, the current study presents a novelapproach to language family structuring based on non-parametric approachesproduced from a typological structure of words in various languages, which isthen converted into a Cartesian framework using MDS. Thisstatistical-depth-based architecture allows for the use of data-depth-basedmethodologies for robust outlier detection, which is extremely useful inunderstanding the categorization of diverse borderline languages and allows forthe re-evaluation of existing classification systems. Other depth-basedapproaches are also applied to processes such as unsupervised and supervisedclustering. This paper therefore provides an overview of procedures that can beapplied to nonhomogeneous language classification systems in a nonparametricframework.

Quick Read (beta)

loading the full paper ...