A Review of Language and Speech Features for Cognitive-Linguistic Assessment

  • 2019-06-04 02:17:18
  • Rohit Voleti, Julie M. Liss, Visar Berisha
  • 0

Abstract

It is widely accepted that information derived from analyzing speech (theacoustic signal) and language production (words and sentences) serves as auseful window into the health of an individual's cognitive ability. In fact,most neuropsychological batteries used in cognitive assessment have a componentrelated to speech and language where clinicians elicit speech from patients forsubjective evaluation across a broad set of dimensions. With advances in speechsignal processing and natural language processing, there has been recentinterest in developing tools to detect more subtle changes incognitive-linguistic function. This work relies on extracting a set of featuresfrom recorded and transcribed speech for objective assessments of cognition,early diagnosis of neurological disease, and objective tracking of diseaseafter diagnosis. In this paper we provide a review of existing speech andlanguage features used in this domain, discuss their clinical application, andhighlight their advantages and disadvantages. Broadly speaking, the review issplit into two categories: language features based on natural languageprocessing and speech features based on speech signal processing. Within eachcategory, we consider features that aim to measure complementary dimensions ofcognitive-linguistics, including language diversity, syntactic complexity,semantic coherence, and timing. We conclude the review with a proposal of newresearch directions to further advance the field.

 

Quick Read (beta)

A Review of Language and Speech Features for Cognitive-Linguistic Assessment

Rohit Voleti,  Julie M. Liss, Visar Berisha,  R. Voleti is with the School of Electrical, Computer, & Energy Engineering, Arizona State University, Tempe, AZ, 85281 USA e-mail: [email protected] received _; revised _
Abstract

It is widely accepted that information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual’s cognitive ability. In fact, most neuropsychological batteries used in cognitive assessment have a component related to speech and language where clinicians elicit speech from patients for subjective evaluation across a broad set of dimensions. With advances in speech signal processing and natural language processing, there has been recent interest in developing tools to detect more subtle changes in cognitive-linguistic function. This work relies on extracting a set of features from recorded and transcribed speech for objective assessments of cognition, early diagnosis of neurological disease, and objective tracking of disease after diagnosis. In this paper we provide a review of existing speech and language features used in this domain, discuss their clinical application, and highlight their advantages and disadvantages. Broadly speaking, the review is split into two categories: language features based on natural language processing and speech features based on speech signal processing. Within each category, we consider features that aim to measure complementary dimensions of cognitive-linguistics, including language diversity, syntactic complexity, semantic coherence, and timing. We conclude the review with a proposal of new research directions to further advance the field.

cognitive linguistics, clinical speech analytics, Alzheimer’s disease, schizophrenia, cognition
publicationid: pubid: © 2019 IEEE.

I Introduction

Early detection of neurodegenerative disease and mental illness that impact cognitive function is a major goal of current research trends in speech and language processing. These afflictions have both significant societal and economic impacts on affected individuals. It is estimated that approximately one in six adults in the United States lives with some form of mental illness, according to the National Institute of Mental Health (NIMH), totaling 44.6 million people in 2016 [1]. In the United States alone, some estimate that economic burden of mental illness is approximately $1 trillion annually [2].

Many forms of neurodegenerative disease and mental illness have widespread effects on speech and language production, providing us with one useful mechanism with which to study these conditions. Speech and language production both require significant levels of neurological function. Therefore, information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual’s cognitive ability. As a result, most neuropsychological batteries that assess cognitive health include a language component. This has motivated current research trends in quantitative speech and language analytics, allowing for better diagnosis, prediction, and characterization of these conditions, with the objective to improve treatment outcomes and reduce economic burden. In this paper, we overview several ways in which speech and language features serve as biomarkers of various form neurodegenerative disease and mental illness.

Fig. 1: Overview of the general process of using natural language processing and speech signal processing for automated extraction of speech and language features for clinical decision making. Example language features include lexical complexity, syntactic complexity, semantic coherence, etc. Example of auditory speech features include pause rate, prosody, articulation, etc.

With access to large clinical speech and language databases along with recent developments in the fields of speech signal processing, computational linguistics, and machine learning, there is an increased potential for using computational methods to automate the analysis of speech and language datasets for clinical applications [2]. Objective analysis of this sort has the potential to overcome some of the limitations associated with the current state-of-the-art for improved diagnosis, prediction, and characterization of neurological disorders. A high-level block diagram of existing methods in clinical-speech analytics is shown in Figure 1. Patients provide speech samples via a speech elicitation task. This could be passively collected speech, patient interviews, or recorded neuropsychological batteries. The resulting speech is transcribed, using either automatic speech recognition (ASR) or manual transcription, and a set of speech and language features are extracted that aim to measure different aspects of cognitive-linguistic change. These features become the input of a machine learning model that aims to predict a dependent variable of interest, e.g. detection of clinical conditions or assessment of social skills [3, 4, 5].

Perhaps the most important part of the analysis framework in Figure 1 are the analytical methods used to extract clinically-relevant features from the samples. With a focus on cognitive-linguistic assessment, in this paper we provide a survey of the existing literature and review the most common speech and language features used in assessing cognition. A summary of the work reviewed in this paper can be seen in Table I and will be discussed in the subsequent sections. The review is split into two parts: natural language processing (NLP) features and speech signal processing features. With NLP, we can measure the complexity and coherence of language and with speech signal processing, we can measure acoustic proxies related to cognition. The review that ensues is organized as follows: In Section II, we review and discuss several common methods in natural language processing (NLP) along with their clinical applications. In Section III we review and discuss methods in speech signal processing which have been used in clinical applications. Finally, in Section IV we discuss gaps in current research, propose future directions for studies in this area to expand our knowledge, and provide some concluding remarks.

Feature Type of Feature Disease/Mental Illness Papers
Lexical Diversity
Lexical Density [4], [6]
- Syntactical
- Mild Cognitive Impairment [4]
- Alzheimer’s Disease [6, 7]
- Schizophrenia [8, 5]
- CTE [9]
- Roark et al.  [4]
- Fraser et al.  [6]
- Kayi et al.  [8]
- Berisha et al.  [7, 9]
- Voleti et al.  [5]
Parse Tree Derived Measures:
- Yngve depth scoring [4], [6]
- Frazier scoring [4]
- Lexical Dependency Distance [4][8]
- Syntactical
- Mild Cognitive Impairment [4]
- Alzheimer’s Disease [6]
- Schizophrenia [8]
- Roark et al.  [4]
- Fraser et al.  [6]
- Kayi et al.  [8]
Part-of-Speech Tag Measures:
- Cross entropy [4]
- Propositional density (P-density) [4]
- Content density [4]
- Frequency of use of particular tags [10][11][8][6]
- Tag ratios (N/VB, etc.) [6]
- Lexical
- Syntactical
- Mild Cognitive Impairment [4]
- Alzheimer’s Disease [6]
- Schizophrenia [8]
- Psychosis/FTD [10][11]
- Roark et al.  [4]
- Fraser et al.  [6]
- Kayi et al.  [8]
- Bedi et al.  [10]
- Corcoran et al.  [11]
Speech-Graph Attributes:
- Nodes, edges, parallel edges,
loops, etc[12], [13], [14]
- Lexical [12], [13], [14]
- Syntactical [12], [13], [14]
- Schizophrenia [12], [13], [14]
- Mania/Bipolar Disorder [12], [13]
- Mota et al. (2012) [12]
- Mota et al. (2014) [13]
- Carrillo et al.  [14]
Vector Word Embeddings:
- Latent Semantic Analysis (LSA) [15],  [10],  [11] [14]
- word2vec [16],
- GloVe [16][8]
- Semantic
- Schizophrenia [15][16][8]
- Psychosis/FTD [10][11],  [14]
- Elvevåg et al.  [15]
- Bedi et al.  [10]
- Corcoran et al.  [11]
- Carrillo et al.  [14]
- Iter et al.  [16]
- Kayi et al.  [8]
- K-Means clustering of word embeddings
using GloVe [8]
- Semantic - Schizophrenia [8] - Kayi et al. [8]
Tangentiality of Coherence:
- Slope of regression line measuring
cosine similarity over time using LSA [15] or
neural embeddings (word2vec and GloVe) [16]
- Semantic - Schizophrenia [15], [16]
- Elvevåg et al. [15]
- Iter [16]
Incoherence Measures:
- First-order coherence: cosine similarity for
average embedding of consecutive
sentences or phrases [10], [11], [14],[16]
- Second-order coherence: cosine similarity for
average embedding of sentences or phrases
that are spaced apart by one
sentence/phrase in between [10][11][14][16]
- k inter-word coherence: cosine similarity
computed at the word level
by words spaced k words apart
within a given response [11]
- Semantic
- Schizophrenia [16, 5]
- Psychosis/FTD [10][11],  [14]
- Bedi et al.  [10]
- Corcoran et al.  [11]
- Carrillo et al.  [14]
- Iter et al.  [16]
- Voleti et al.  [5]
Ambiguous Pronoun Usage - Lexical - Schizophrenia [16] - Iter et al.  [16]
Semantic Role Labeling - Semantic - Schizophrenia [8] - Kayi et al. [8]
Latent Dirichlet Analysis (LDA) - Semantic - Schizophrenia [8] - Kayi et al. [8]
Temporal Speech Features:
- Duration of voiced segments and pauses
- Duration of periodic and aperiodic segments
- Ratios of segments (i.e. continuity of speech)
- Phonation rate
- Pause rate
- Total Locution Time
- etc.
- Speech
- Alzheimer’s Disease
- Mild Cognitive Impairment
- König et al.  [17]
- Roark et al.  [4]
Nonverbal Speech Cues:
- Interruptions
- Interjections
- Response time
- Natural turns
- Speech - Schizophrenia - Tahir et al.  [18]
Mel Frequency Cepstral Coefficients (MFCC) Features:
- Mean, variance,
skewness, kurtosis of
MFCCs
- Speech - Alzheimer’s Disease - Fraser et al.  [6]
TABLE I: Summary of features used to measure cognitive abilities in published speech and language research.

II Measuring Cognitive Function with Natural Language Processing

In this section, we will provide a review of several families of natural processing methods that range from simple lexical analysis to state-of-the-art language models that can be utilized for clinical assessment.

The sections below present families of approaches in order of increasing complexity. In the first section, we describe methods based on subjective evaluation of speech and language; then we discuss methods that rely on lexeme-level information, followed by methods that rely on sentence-level information, and end with methods that rely on semantics. For each section, we provide a description of representative approaches and a review of how these methods are used in clinical applications. We end each section with a discussion of the advantages and disadvantages of the approaches in that section.

II-A Early Work

Simple analysis of written language samples has long been thought to provide valuable information regarding cognitive health. One of the best-known early examples of such work is the famous “nun study” by Snowdon et al. on linguistic ability as a predictor of Alzheimer’s disease (AD) [19]. In this work, manual evaluations of the linguistic abilities of 93 nuns were conducted by analysis of autobiographical essays they had written earlier in their lives. The researchers evaluated the linguistic structure of the essays by scoring the grammatical complexity and idea density in the writing samples. In particular, the study found that low idea density in early life was a particularly strong predictor of reduced cognitive ability or the presence of AD in later life. Roughly 80% of the subjects that were determined to lack linguistic complexity in their writings developed AD or had mental and cognitive disabilities in their older age.

This work was groundbreaking in showing that linguistic structure and complexity can serve as a strong predictor for the onset of AD and potentially other forms of cognitive impairment. However, it required tedious manual analysis of writing samples and careful consideration that the scores given by different evaluators had a high correlation, due to the subjective nature of the scoring.

These factors make in-clinic use prohibitive; as a result, these methods have received limited attention in follow-on work . The development of automated and quantitative metrics to analyze language complexity can potentially save several hours of research time to conduct similar linguistic studies to understand neurodegenerative disease and mental illness. Several techniques devised in NLP literature have been utilized to address the challenge of conducting quantitative analysis to replace traditionally subjective and task-dependent methods of measuring linguistic complexity.

II-B First Order Lexeme-Level Analysis

II-B1 Methods

Automated first-order lexical analysis, i.e. at the lexeme-level or word-level, can generate objective language metrics to provide valuable insight into cognitive function. One notable consideration is the concept of lexical diversity, referring to unique vocabulary usage. The type-to-token ratio (TTR), given in Equation (1), is a well-known measure of lexical diversity, in which the number of unique words (word types, V) are compared against the total number of words (word tokens, N).

TTR=VN (1)

However, TTR is negatively impacted for longer utterances, as the diversity of unique words typically plateaus as the number of total words increase. The moving average type-to-token ratio (MATTR) [20] is one method which aims to reduce the dependence on text length by considering TTR over a sliding window of the text. This approach does not have a length-based bias, but is considerably more variable as the parameters are estimated on smaller speech samples. Brunét’s Index (BI) [21], defined in Equation (2), is another measure of lexical diversity that has a weaker dependence on text length, with a smaller value indicating a greater degree of lexical diversity,

BI=NV-0.165. (2)

An alternative is also provided by Honoré’s Statistic (HS) [22], defined in Equation (3), which emphasizes the use of words that are spoken only once (denoted by V1),

HS=100logN1-V1/V. (3)

The exponential and logarithm in the BI and the HS reduce the dependence on the text length, while still allowing the user to use all samples to estimate the diversity measure, unlike the MATTR.

Measures of lexical density, which quantify the degree of information packaging within an utterance, may also be useful for cognitive assessment. Content words11 1 Content words are also referred to as “open-class”, meaning new words are often added and removed to this category of words as language changes over time. (i.e. nouns, verbs, adjectives, adverbs) tend to carry more information than function words22 2 Function words are also referred to as “closed-class” since words are rarely added to or removed from these categories. (e.g. prepositions, conjunctions, interjections, etc.). These can be used to compute notions of content density (CD) in written or spoken language, given in Equation (4),

CD=#ofverbs+nouns+adjectives+adverbsN. (4)

Part-of-speech (POS) tagging of text samples is one way in which the word categories can be automatically determined; individual word tokens within a sentence are identified and labeled as the part-of-speech that they represent, typically from the Penn Treebank tagset [23]. Several automatic algorithms and available implementations exist for rule-based and statistical taggers, i.e. using a hidden Markov model (HMM) or maximum entropy Markov model (MEMM) implementation to determine POS tags with a statistical sequence model [24]. For example, the widely-used Stanford Tagger [25] uses a bidirectional MEMM model to accurately assign POS tags to samples of text. Several notions of content density can be computed at the lexeme-level if POS tags can be automatically determined to reflect the role of each word in an utterance. Examples of these include: the propositional density (P-density), a measure of the number of expressed propositions (verbs, adjectives, adverbs, prepositions, and conjunctions) divided by the total number of words, and the alternative content density, which is a measure of the ratio of content words to function words. One important limitation of all these methods is that they rely only on lexeme-level information. As such, these methods make no distinction between words that are

II-B2 Clinical Applications

Several studies have utilized first order lexical features to assess cognitive health by automated linguistic analysis. As an example, Roark et al. considered a variety of speech and language features to detect mild cognitive impairment (MCI), often a precursor to Alzheimer’s disease (AD) [4]. In this work, the authors compared the language of elderly healthy control subjects and patients with MCI on the Wechsler Logical Memory I/II Test [26], in which subjects are tested on their ability to retell a short narrative that has been told them at different time points33 3 Asked to retell the story immediately (LM1) and after approximately 30 minutes (LM2). Among the features considered included multiple measures of lexical density. POS tagging was performed on the transcripts of clinical interviews of patients with MCI and healthy control subjects. Two measure of lexical density derived from the POS tags were the propositional density (P-density) and the alternative content density, which is a measure of the ratio of content words to function words. In particular, the alternative content density was a strong indicator of group differences between healthy controls and patients with MCI.

The automated language features were used in conjunction with speech features and clinical test scores to train a support vector machine (SVM) classifier that achieved good leave-pair-out cross validation results in classifying the two groups (AUC=0.732,0.703,0.861 when trained on language features, language features + speech features, and language + speech features + test scores, respectively)44 4 Additional language and speech features will be discussed later.

Bucks et al. [27] and Fraser et al. [6] both used several first-order lexical features in their analysis of patients with AD. In [27], the authors successfully discriminated between a small sample of healthy older control subjects (n=16) and patients with AD (n=8) using TTR (1), HS (3), and BI (2) as measures of lexical diversity or vocabulary richness. They additionally considered the usage rates of other parts of speech (i.e. nouns, pronouns, adjectives, verbs). In particular, TTR, BI, verb-rate, and adjective-rate all indicated strong group differences between the subjects with AD and healthy controls and groups could be classified with a cross-validation accuracy of 87.5%. In [6], Fraser et al. performed similar work but considered a much larger sample size of patients with AD (n=240) and healthy control subjects (n=233) using the DementiaBank55 5 https://dementia.talkbank.org/access/ database to obtain a large number of patient transcripts. They also identified a large variety of additional language and speech features with which they could accurately classify patients with AD and healthy control subjects.

Berisha et al. performed a longitudinal analysis of non-scripted press conference transcripts from U.S. Presidents Ronald Reagan (who was diagnosed with AD late in life) and George H.W. Bush (no such diagnosis) [7]. Among the linguistic features that were tracked were the lexical diversity and lexical density for both presidents over several years worth of press conference transcripts. The study shows that the number of unique words used by Reagan over the period of his presidency steadily declined over time, while no such changes were seen for Bush. These declines predated his diagnosis of AD in 1994 by 6 years, suggesting that these computed lexical features may be useful in predicting the onset of AD pre-clinically. A related study examined the language in interview transcripts of professional American football players in the National Football League (NFL) [9], at high-risk for neurological damage in the form of chronic traumatic encephalopathy (CTE). The study longitudinally measured TTR (1) and CD66 6 The authors in [9] refer to CD simply as “lexical density” (LD) (4) in interview transcripts of NFL players (n=10) and NFL coaches/front office executives77 7 Coaches and executives were limited to those who were not former players experiencing similar head trauma to serve as a control in the language study. (n=18). Previous work has shown that TTR and CD are expected to increase as individuals age in typical cases [28, 29, 30]. However, this study demonstrated clear longitudinal declines in both variables for the NFL players while showing the expected increase in both variables for coaches and executives in similar contexts, indicating that tracking language production of this type can be useful biomarker for predicting the onset of CTE.

II-B3 Advantages & Disadvantages

It is clear from the literature that first-order lexeme-level features, i.e. those related to lexical diversity and density, are useful biomarkers for detecting the presence or predicting the onset of a variety of conditions, such as MCI, AD, CTE, and potentially several others. POS tagging has several reliable and accurate implementations, and these features are simple and easy to compute. Additionally, these linguistic measures are easily clinically interpretable for measuring cognitive-linguistic ability.

However, lexeme-level features are limited in what information they provide alone, and many of the previously discussed works used these features in conjunction with several other speech and language features to build their models for classification and prediction of disease onset. Since these measures are based on counting particular word types and tokens, they tell us little about how individual lexical units interact with each other in a full sentence or phrase. Additionally, measures of lexical diversity and lexical density provide little insight regarding semantic similarity between words. For example, the words “car”, “vehicle”, and “SUV” are all counted as unique words, despite there being a clear semantic similarity between them. In the following sections, we will discuss more complex language measures that aim to address these issues.

II-C Sentence-Level Syntactical Analysis

Generating free-flowing speech requires that we not only determine which words best convey an idea, but also to determine the order in which to sequence the words in forming sentences. The complexity of the sentences we structure provides a great deal of insight into cognitive-linguistic health. In this section we provide an overview of various methods used to measure syntactic complexity as a proxy for cognitive-linguistic health.

II-C1 Methods

(a) Constituency-based parsing of sample sentence (i.e. top-down and left to right). In the diagram, S = sentence, NP = noun phrase, VP = verb phrase, PP = prepositional phrase, PRP = personal pronoun, AUX = auxiliary verb, DT = determiner, NN = noun, and IN = preposition. The figure contains examples of both Yngve scoring (Y) [31], Frazier scoring (F) [32] for each branch of the tree. At the bottom is the total score of each type for each word token in the sentence summed up to the root of the tree.
(b) Dependency-based parsing of the same sample sentence. Lexical dependency distances can be computed.In this example, there are 7 total links, a total lexical dependency distance of 11, and an average distance of 11/7=147. Longer distances indicate greater linguistic complexity.
Fig. 2: (\subreffig:parsetree) A constituency-based and (\subreffig:dependency) dependency-based parsing of a simple sentence. Both adapted from [4].

The ordering of words in sentences and sentences in paragraphs can also provide important insight into cognitive function. Many easy-to-compute and common structural metrics of language include the mean length of a clause, mean length of sentence, ratio of number of clauses to number of sentences, and other related statistics [6]. Additionally, several more complicated methods for syntactical analysis of natural language can also be used to gain better insight for assessing linguistic complexity and cognitive health.

A commonly used technique involves the parsing of naturally produced language based on language-dependent syntactical and grammatical rules. A constituency-based parse tree is generated to decompose a sentence or phrase into lexical units or tokens. In English, for example, sentences are read left to right and are often parsed this way. An example of a common constituency-based left to right parse tree can be seen in Figure 1(a) for the sentence “She was a cook in a school cafeteria”, adapted from [4]. At the root node, the sentence is split into a noun phrase (“she”) and a verb phrase (“was a cook in a school cafeteria”). Then, the phrases are further parsed into individual tokens with a grammatical assignment (nouns, verbs, determiners, etc.). Simple sentences in the English language are often right-branching when using constituency-based parse trees. This means that the subject typically appears first and is followed by the verb, object, and other modifiers. This is primarily the case for the sentence in Figure 1(a). By contrast, left-branching sentences place verbs, objects, or modifiers before the main subject of a sentence [33]. Left-branching sentences are often cognitively more taxing as they involve more complex constructions that require a speaker (and a listener) to remember more information about the subject before the subject is explicitly mentioned. In English, one measure of syntactic complexity of a sentence structure can be thought of as a measure of the degree of left-branching within a particular parsing of that sentence.

Once a parsing method has been implemented, various measures of lexical and syntactical complexity can be computed for each sentence or phrase. Yngve proposes one such method in [31]. Given the right-branching nature of simple English sentences, he proposes a measure of complexity based on the amount of left-branching in a given sentence. At each node in the parse tree, the rightmost branch is given a score of 0. Then, each branch to the left of it is given a score that is incremented by 1 when moving from right to left at a given node. The score for each token is the sum of scores up all branches to the root of the tree. An alternative scoring scheme for the same parse tree structure was proposed by Frazier [32]. He notes that embedded clauses within a sentence are an additional modifier that can increase the complexity of the syntactical construction of that sentence. Therefore, just as with left-branching language, a listener would need to retain more information in order to properly interpret the full sentence. Frazier’s scoring method emphasizes the use of embedded clauses when evaluating the syntactic complexity. The scores are assigned to each lexeme as in Yngve’s scoring, but they are summed up to the root of the tree or the lowest node that is not the leftmost child of its parent node. Examples of both Yngve and Frazier scoring can be seen in Figure 1(a).

Another type of syntactical parsing of a sentence is known as dependency parsing, in which all nodes are treated as terminal nodes (no phrase categories such as verb phrase or noun phrase) [34]. A dependency-based parse tree aims to map the dependency of each word in a sentence or phrase to another within the same utterance. Methods proposed by Lin [35] and Gibson [36] provide some ways by which the lexical dependency distances can be determined. The general idea behind these methods is that longer lexical dependency distances within a sentence indicate a more complex linguistic structure, as the speaker and listener must remember more information of about the dependencies of one word on another within a sentence. An example of the same sentence is shown with a dependency-based parse tree in Figure 1(b), also adapted from [4].

Mota et al. also propose a graph-theory based approach for analyzing language structure as a marker of cognitive ability with the construction of speech graphs [12, 13].

(a) Sample speech-graph representation of a spoken utterance. Each of the circular nodes represents a lexical unit (e.g. a single word) and the curved arrows represent edges which connect the relevant lexemes in the utterance. Attributes can be computed using the graph.
(b) Examples of computable speech graph attributes (SGAs). In this case, the largest connected component (LCC) for the graph in part \subreffig:speechgraph is the entire graph (not shown here). The largest strongly connected component (LSC) is instead the portion shown above (all nodes can be reached from all others when considering the directionality).
Fig. 3: (\subreffig:speechgraph) A sample speech-graph for a complete spoken utterance. (\subreffig:sga) Example speech-graph attributes (SGAs). Both adapted from [13].

In this representation, the nodes are words that are connected to consecutive words in the sample text by edges representing lexical, grammatical, or semantic relationships between words in the text. Spoken language is first transcribed and tokenized into individual lexemes, with each unique word by a graph node. Directed edges then connect consecutive words. The researchers in this work suggest that structural graph features, i.e. loop density, distance between words of interest, etc.) serve as clinically relevant objective language measures that give insight into cognitive function. Some of the computed speech graph attributes (SGAs) consist of:

  • Nodes (N): Total number of nodes

  • Edges (E): Total number of edges

  • Parallel Edges (PE): Total number of edges linking the same pair of nodes more than one time (direction does not matter)

  • Repeated Edges (RE): Similar to PE, but edges must be in the same direction

  • Loops with 1 Node (L1): Total number of self-loops

  • Loops with 2 Nodes (L2): Total number of loops with 2 nodes

  • Loops with 3 Nodes (L3): Total number of loops with 3 nodes

  • Largest Connected Component (LCC): Total number of nodes that make up the longest path within the speech-graph network for an undirected graph

  • Largest Strongly Connected Component (LSC):

  • Average Total Degree (ATD): The mean of the total number of edges that either point to or depart from a node

  • Density (D): A global density measure that excludes self-loops and parallel edges, D=(E-L1-PE)/N2

  • Diameter (DI): A global measure that is the length of the longest shortest path between node pairs of the network

  • Average Shortest Path (ASP): A global measure that is the average length of the shortest path between node pairs of the network

The features extracted from the graphs provide indirect measures of lexical diversity and syntactic complexity. For example, N is the number of unique words, E is the total number of words, and repeated edges represent repeated words or phrases in text. An example speech-graph representation structure of an arbitrary utterance is seen in Figure 2(a), with sample SGAs shown in Figure 2(b).

II-C2 Clinical Applications

The structural aspects of spoken language have been shown to have clinical relevance for understanding medical conditions that affect cognitive ability. The previously mentioned work by Roark et al. also utilized several of the aforementioned methods to analyze the language of individuals with MCI and healthy control subjects [4]. In addition to the lexeme-level features described in Section II-B, they also considered Yngve [31] and Frazier [32] scoring measures from constituency-based parsing of the transcripts of subject responses88 8 Using the Charniak parser [37]. Mean, maximum, and average Yngve and Frazier scores were computed for each subject’s language samples. Roark et al. also used dependency parsing and computed lexical dependency distances, similar to the example in Figure 1(b). Along with the lexical features and speech features, subjects with MCI and healthy elderly control subjects were classified successfully, as previously described in Section II-B.

The speech-graph approach is used by Mota et al. to study the language of patients with schizophrenia and bipolar disorder (mania) [12, 13]. The researchers were able to identify structural features of the generated graphs (such as loop density, distance between words of interest, etc.) that serve as objective language measures containing clinically relevant information (e.g. flight of thoughts, poverty of speech, etc.). Using these features, the researchers were able to visualize and quantify concepts such as the logorrhea (excessive wordiness and incoherence) associated with mania, evidenced by denser networks. Similarly, the alogia (poverty of speech) typical of schizophrenia was also visible in the generated speech-graph networks, as evidenced by a greater number of nodes per word and average total degree per node. Control subjects, subjects with schizophrenia, and subjects with mania were classified with over 90% accuracy, significantly improving over traditional clinical measures, such as the Positive and Negative Syndrome Scale (PANSS) and Brief Psychiatric Rating Scale (BPRS) [12].

II-C3 Advantages & Disadvantages

Consideration of sentence-level syntactical complexity offers several advantages that address some of the drawbacks of lexeme-level analysis. As the work discussed here reveals, sentence structure metrics via syntactic parsing or speech-graph analysis offers powerful information in distinguishing healthy and clinical subjects with schizophrenia, bipolar disorder/mania, mild cognitive impairment, and potentially several other conditions. Since sentence construction further taxes the cognitive-linguistic system beyond word finding, methods that capture sentence complexity provide more insight into the neurological health of the individual producing these utterances. This provides a multi-dimensional representation of cognitive-linguistics and allows for better characterization of different clinical conditions, as Mota et al. did with patients with schizophrenia and those with bipolar disorder/mania [12].

However, while offering the ability to analyze more complex sentence structures, sentence-level syntactical analysis is also prone to increased complexity and variable implementation. There are countless methods developed over the years for parsing language with different tools for measuring complexity relying on different algorithmic implementations of the parsers. A thorough empirical evaluation of the various parsing methods is required to better characterize the performance of these methods in the context of clinical applications.

II-D Semantic Analysis

High cognitive function can also be characterized by one’s ability to convey organized and coherent thoughts through spoken or written language. Here, we will cover some of the fundamental methods in NLP and computational linguistics that have been used in clinical applications related to computing the semantic coherence of language.

II-D1 Methods

Semantic similarity in natural language is typically measured computationally by embedding text into a high-dimensional vector space that represents its semantic content. Then, a notion of distance between vectors can be used to quantify semantic similarity or difference between the words or sentences represented by the vector embeddings.

Word embeddings are motivated by the distributional hypothesis in linguistics, a concept proposed by English linguist John R. Firth who famously stated “You shall know a word by the company it keeps” [38], i.e. that the inherent meaning of words is derived from their contextual usage in natural language. One of the earliest developed word embedding methods is latent semantic analysis (LSA) [39], in which words embeddings are determined by co-occurrence. In LSA, each unit of text (such as a sentence, paragraph, document, etc.) within a corpus is modeled as a bag of words, meaning that the order of the words in that collection of text is not considered.

As per Firth’s hypothesis, a major assumption of LSA is that words which occur together within a group of words will be semantically similar. As seen in Figure 4, a matrix (A) is generated in which each row is a unique word in the text (w1,,wn) and each column represents a document or collection of text as described above (d1,dd). The matrix entry values simply consist of the count of co-occurrence statistics, that is the number of times each word appears in each document. Then a singular value decomposition (SVD) is performed on A, such that A=UΣVT. Here, U and V are orthogonal matrices consisting of the left-singular and right-singular vectors (respectively) and Σ is a rectangular diagonal matrix of singular values. The diagonal elements of Σ can be thought to represent semantic categories, the matrix U represents a mapping from the words to the categories, and the matrix V represents a mapping of documents to the same categories. A subset of the r most significant singular values is typically chosen, as shown by the matrix Σ^ in Figure 4. This determines the dimension of the desired word embeddings (typically in the range of ~100-500). Similarly, the first r columns of U form the matrix U^ and first r rows of VT form the matrix V^T. The r-dimensional word embeddings for the n unique words in the corpus are given by the resulting rows of the matrix product U^Σ^. Similarly, r-dimensional document embeddings can be generated by taking the d columns of the matrix product Σ^V^T.

Fig. 4: A visual representation of latent semantic analysis (LSA) by singular value decomposition (SVD).

In recent years, several new word embedding methods based on neural networks have gained popularity, such as word2vec [40] or GloVe [41], which have shown improved performance over LSA for semantic modeling when sufficient training data is available [42]. As an example, we take a more detailed look at word2vec. Mikolov et al. proposed the word2vec embedding method at Google in 2013, in which they present an efficient method for predicting word vectors based on very large corpora of text. They present two versions of the word2vec algorithm, a continuous bag-of-words (CBOW) model and continuous skip-gram model, seen in Figure 5. In both models, every word in a corpus of text is one-hot encoded; i.e. in a corpus of V unique words, each word is uniquely encoded as a V-dimensional vector in which all elements are 0 except for a single 1. In both models, the inputs, 𝐱V, are multiplied by a weight matrix, WV×N to obtain a hidden latent representation, 𝐡=WT𝐱N, with N<V typically. The hidden representation is then multiplied by another weight matrix, W~N×V to obtain an output representation 𝐮=W~T𝐡V. The softmax operation, given in Equation (5), is then performed on the elements uj, j=1,,V of 𝐮 to obtain an output vector, 𝐲, which approximates a one-hot encoded output prediction.

𝐲=softmax(uj)=expuji=1Vexpui,𝐮=[u1,,uV]T (5)

In the CBOW implementation (Figure 4(a)), the inputs are the context words in the particular neighborhood of a target center word, wt. In the skip-gram implementation (Figure 4(b)), the input is the center word and the objective is to predict the context words at the output. In both models, the latent hidden representation given by 𝐡=WT𝐱N provides an N-dimensional embedding for the word represented by the one-hot encoded input word, 𝐱. The training objective is to minimize the cross-entropy loss for the prediction outcomes. An interesting finding is that vectors trained in this manner inherently encode semantic information into the latent hidden representation for each word. The classic example is that the vectors used for the words “king”, “queen”, “man”, and “woman” exhibit the following relationship:

king-manqueen-woman
(a) Continuous bag-of-words (CBOW)
(b) Continuous skip-gram
Fig. 5: word2vec model architectures proposed in [40]. (\subreffig:w2v_cbow) In the CBOW model, the context words are inputs used to predict the center word. (\subreffig:w2v_sg) In the skip-gram model, the center word is used to predict the context words.

There are several other methods for word embeddings, each relying on the distributional hypothesis and each with various advantages and disadvantages. For example, word2vec and GloVe are simple to train and effective, but do not handle out-of-vocabulary (OOV) words. Some methods based on deep neural networks (DNNs), such as recurrent neural network (RNN) / long-short term memory (LSTM) networks (e.g. ELMo [43]) or transformer architectures (e.g. BERT [44]) utilize contextual information to generate embeddings for OOV words. In addition to individual words, embeddings can also be learned at the sentence level. The simplest forms of sentence embeddings involve unweighted averaging of LSA, word2vec, GloVe, or other embeddings. Weighted averages can also be computed, such as by using term frequency-inverse document frequency (tf-idf) generated weights or Smooth Inverse Frequency (SIF) [45]. Others have found success learning sentence representations directly, such as in sent2vec [46]. Whole sentence encoders, such as InferSent [47] and the Universal Sentence Encoder (USE) [48] offer the advantage of learning a full sentence encoding that considers word order within a sentence; e.g. the sentences “The man bites the dog” and the “The dog bites the man” will each have different encodings though they contain the same words.

Once an embedding has been defined, a notion of semantic similarity or difference must also be defined. Several notions of distance can be computed for vectors in high-dimensional space, such as Manhattan distance (1 norm), Euclidean distance (2 norm), or many others. Empirically, the cosine similarity (cosine of the angle, θ, between vectors) has been found to work well in defining semantic similarity between word and sentence vectors of many types. Cosine similarity can be computed using Equation (6) for vectors 𝐰1 and 𝐰2.

CosSim(𝐰1,𝐰2)=cosθ=𝐰1T𝐰2𝐰12𝐰22 (6)

For example, a cosine similarity of 𝟏 indicates that the angle between the vectors is 0.

In addition to word and sentence embedding semantic similarity measures, techniques such as topic modeling and semantic role labeling have also gained recently popularity in NLP and its applications to clinically relevant language samples. Latent dirichlet analysis (LDA) is one such statistical topic modeling method which can be used to identify overarching themes in samples of text [49]. Another option that can be utilized is semantic role labeling, a probabilistic technique which automatically attempts to identify the semantic role a particularly entity is playing in a sentence [50].

II-D2 Clinical Application

Many forms of mental illnesses often result in a condition known as formal thought disorder (FTD), which impairs an individual’s ability to produce semantically coherent language. FTD is most commonly associated with schizophrenia but is often present in other forms of mental illness such as mania, depression, and several others [51, 52]. Some common symptoms include poverty of speech (alogia), derailment of speech, and semantically incoherent speech (word salad[52, 53]. Language metrics that track semantic coherence are potentially useful in clinical applications, such as measuring the coherence of language as it relates to FTD in schizophrenia. One of the first studies to demonstrate this was conducted by Elvevåg et al.  [15]. The language of patients with varying degrees of FTD (rated by standard clinical scales) was compared with a group of healthy control subjects. The experimental tasks consisted of single word associations, verbal fluency (naming as many words as possible within a specific category), long interview responses (~1-2 minutes per response), and storytelling. LSA was utilized to embed the word tokens in the transcripts. The semantic coherence in for each of the tasks was computed as follows:

  • Word Associations: Cosine similarity between cue word and response word, with an average coherence score for each subject

  • Verbal Fluency: Cosine similarity between first and second word, second and third word, etc. were computed, with an average coherence score computed per subject

  • Interviews: Cosine similarity was computed between the question and subject responses. An average word vector was computed for the prompt question from the interviewer. Then a moving window (of size 2-6 words) for the subject response was used to average all the word vectors within the window and compute a cosine similarity between the question and response. The window was moved over the entire subject response and a new cosine similarity was computed between the question and response window until reaching the window reached the end of the response. This method tracks how the cosine similarity behaves as the subject response goes farther from the question, with the expectation that the response would be more tangential over time with decreased coherence as the subject moves farther from the question. A regression line was fit for each subject to measure the change in cosine similarity coherence over time, and the slope of the line was computed to measure the tangentiality of the response per subject.

  • Storytelling: Cosine similarity of the subjects response was compared to the centroid subject response for all narrative utterances of the same story. This was used to predict the clinical rating for thought disordered language samples when asked to tell the same story.

They demonstrated that the healthy control subjects had higher coherence scores when compared to the FTD groups across all tasks.

In a more recent study, predictive features of language for the onset of psychosis were studied by Bedi et al. Open-ended narrative-like interview transcripts of young individuals who were determined to be at clinical high-risk (CHR) for psychosis were collected and analyzed to predict which individuals would eventually develop psychosis [10]. Subjects were tracked and interviewed over a period of two and a half years. In this study, LSA was again used to generate word embeddings. An average vector for each phrase was then computed, and a cosine-similarity measure was computed to measure the semantic coherence between consecutive phrases (first-order coherence) and every other phrase (second-order coherence).

A distribution of the first and second-order coherence scores (cosine similarities) was compiled for each subject, and several statistics were computed based on the distribution of coherence scores, e.g. maximum, minimum, standard deviation, mean, median, 10th percentile, and 90th percentile. Each of these statistics was considered as a separating feature between the clinical and control samples. In addition to the semantic analysis, POS tagging was performed to compute the frequency of use of each part-of-speech to obtain information about the structure of the subjects’ naturally-produced language. The language features with the best predictive power in the classifier were the minimum coherence between consecutive phrases for each subject (maximum discontinuity) and the frequency of use of determiners (normalized by sentence length). This initial study only had 34 subjects total (only 5 CHR+ subjects) and was intended as a proof-of-principle exploration. In an expansion of this work, Corcoran et al. trained their classifier using two larger datasets, in which one group of subjects was questioned with a prompt-based protocol and another group of subjects was given a narrative protocol in which they were required to provide longer answers (similar to the previous work) [11]. They note that the first and second-order coherence metrics collected in the previous study were useful for determining semantic coherence with the narrative-style interview transcripts with longer responses. However, for the shorter prompt-based responses (often under 20 words), it is often difficult to obtain these metrics. Therefore, coherence was-computed on the word-level rather than phrase-level by computing the cosine similarity between word embeddings within a response with an inter-word distance of k, with k ranging from 5 to 8. As before, typical statistics were computed on the coherence values obtained for each subject response (maximum, minimum, mean, median, 90th percentile, 10th percentile, etc.). They were able to successfully predict the onset of schizophrenia by discriminating the speech of healthy controls and those with early onset schizophrenia with about 80% accuracy.

Other studies make use of a variety of linguistic features to predict the presence of clinical conditions. For example, Kayi et al. identified predictive linguistic features of schizophrenia by analyzing laboratory writing samples of patients and controls for their semantic, syntactic, and pragmatic (sentimental) content [8]. A second dataset of social media messages from self-reporting individuals with schizophrenia over the Twitter API was also evaluated for the same types of content. The semantic content of the language was quantified by three methods: First, semantic role labeling was performed using the Semafor tool [50] to identify the role of individual words within a sentence or phrase Then, LDA was used to identify overarching themes that separated the clinical and control writing samples [49]. LDA identifies topics in the text and also identifies the top vocabulary used in each topic. Finally, clusters of word embeddings within the writing were generated using the k-means algorithm and GloVe word vector embeddings [41]. The frequency of each cluster was computed per document by checking the use of each word of the document in each cluster. The syntactic features used in this study again were obtained by computing the frequency of use of parts of speech (found by POS tagging) and by generating parse trees, using different tools optimized for the lab writing samples and the social media data. Lastly, pragmatic features were found by performing sentiment analysis to classify the sentiment of the writing samples into distinct groups (very negative, negative, neutral, positive, very positive). Again, different tools that were optimized for the different data sets were used for sentiment analysis. They successfully showed a distinct set of predictive features that could accurately separate subjects with schizophrenia from healthy controls in all the language analysis categories. However, when using a combination of features and various machine learning classifiers (random forest and support vector machine), they found that utilizing a combination of the semantic and pragmatic features led to the most promising accuracy (81.7%) in classification of control subjects and those with schizophrenia.

II-D3 Advantages and Disadvantages

While these studies have been successful in measuring the semantic coherence of language as it relates to mental illness, there are still some limitations. Recent work by Iter et al. identifies and attempts to address some of these shortcomings when measuring semantic coherence for FTD in schizophrenia [16]. Interviews with a small sample of patients were collected and just the subject responses (of ~300 words each) were analyzed for their semantic content. They noted that when using the tangentiality model of semantic coherence (i.e. regression of the coherence over time with the sliding window) of Elvevåg et al.  [15] and the incoherence model of semantic coherence of Bedi et al.  [10], they were unable to convincingly separate their clinical and control subjects based on language analysis. One reason for this was due to the presence of verbal fillers, such as "um" or "uh" and many stop words without meaningful semantic content. Another reason is that longer sentences (or long moving windows) tend to be scored as more coherent due to a larger overlap of words. The third reason they identified (but did not address) is that repetitive sentences and phrases would be scored as highly coherent, even though repetition of ideas is common in FTD and should be scored negatively. The authors proposed a series of improvements to address some of these limitations, however the sample sizes in this study were small (9 clinical subjects and 5 control subjects), as the authors note.

Another issue with semantic coherence computation in clinical practice is difficulty with interpretability of computed metrics. Recent work [5] attempted to address this issue by computing semantic coherence measures (using word2vec, InferSent, and SIF embeddings), lexical density and diversity measures, and syntactic complexity measures as they relate to the language of patients with schizophrenia, patients with bipolar disorder, and healthy controls undergoing a validated clinical social skills assessment [3]. Linear regression was used to determine a subset of language features across all categories that could effectively model the scores assigned by clinicians during the social skills performance assessment, in which participants were required to act out various role-playing conversational scenes with clinical assessors scored for cognitive performance. Then, these features were used to train simple binary classifiers (both naïve Bayes and logistic regression), for which leave-one-out cross-validation was used to determine their effectiveness at classifying groups of interest. For classifying clinical (patients with schizophrenia and bipolar I disorder) subjects and healthy control subjects, the selected feature subset achieved receiver operating characteristic (ROC) area under curve (AUC) performance of AUC0.90; for classifying within the clinical group (to separate subjects with schizophrenia and bipolar disorder), the classifier performance achieved AUC0.80.

III Measuring Cognitive Function with Speech Signal Processing

While cognitive health is primarily highly correlated with complex language production, additional information can be derived by acoustical speech signal analysis of individuals with cognitive impairments. Typically, the information derived from speech signal is used in conjunction with many of the previously described methods to assess cognitive health. Impairment in cognition and thought disorders lead to detectable irregularities in speech production, such as with prosody (intonation, rhythm, etc.). In this section, we will see how audio signal processing of an individual’s speech samples lends additional insight into detection of neurodegenerative diseases and mental health disorders that affect cognition.

III-A Methods

Speech signal features that are indicative of cognitive function largely consist of temporal (time-domain) measures and spectral or time-frequency analysis. The simplest techniques for cognitive assessment involve computing temporal features directly from the recorded speech signals. Among these are duration of voiced segments, duration of silent segments, measures of periodicity, phonation rate, and many other similar features [4, 17]. These measures can indicate irregularities in the rhythm and timing of speech. Additionally, nonverbal speech cues, e.g. counting the number of interruptions, interjections, natural turns, and response times can also indicate identifying features of irregular speech patterns [18].

Spectral analysis of speech for cognitive impairment detection can be seen from a time-frequency perspective for additional insight. Computation of the Mel-frequency cepstral coefficients (MFCC) provide a compressed estimate of the spectral envelope of a speech signal’s spectrogram representation [54]. These features are often used as inputs into an automatic speech recognition (ASR) system, but can also be monitored over time to identify irregularities in speech due to cognitive impairments. As an example, the mean, variance, skewness, and kurtosis of the MFCCs over time can be tracked for identification of irregularities between healthy individuals and those with some cognitive impairment [6].

III-B Clinical Application

Conditions such as MCI and AD are associated with a general slowing of thoughts in affected individuals; researches have discovered that this likely has detectable effects on speech production. For example, König et al. show that MCI and AD can affect several acoustic characteristics of speech production [17]. Subjects were recorded as they were asked to perform various tasks, such as counting backwards, image description, sentence repeating, and verbal fluency testing. The duration of voiced segments, silent segments, periodic segments, and aperiodic segments were all computed. Then, features such as the ratio of the mean of the durations of voiced segments to silent segments were computed as features to express the continuity of speech. As expected, it was shown that healthy control subjects showed greater continuity in these metrics when opposed to those with MCI or AD. These quantifiable alterations of speech in individuals with MCI and AD allowed the researchers to successfully separate patients with AD from healthy controls (approx. 87% accuracy), patients with MCI from controls (approx. 80% accuracy), and patients with MCI from patients with AD (approx 80% accuracy).

Auditory speech analysis can also be successful in classifying patients with mental illness that affects cognition, as seen in work by Tahir et al.  [18]. In this study, patients with severe schizophrenia, receiving Cognitive Remediation Therapy (CRT), were differentiated from control subjects with less severe schizophrenia (no CRT recommended) by non-verbal speech analysis. They note that nonverbal cues in speech often play a crucial role in communication, and that it is expected that individuals with schizophrenia would have a muted display of these features of speech. Conversational cues, such as interruptions, interjections, natural turns, response time, etc. were used as features in the classification. Preliminary results from this study indicate that these nonverbal cues show approximately 90% accuracy in classifying control subjects from those with more severe forms of schizophrenia.

Hybrid approaches that utilize both speech (audio) and language (textual) data to study neurodegenerative disease and mental illness have also been explored with promising results. As an example, the previously mentioned work by Roark et al. (in Section II) also made use of acoustic speech samples to aid in the detection of MCI from naturally-produced spoken language. The researchers used manual and automated methods to estimate features related to the duration of speech during each utterance, including the quantity and duration of pause segments. Some of the features that were computed include phonation rate, total phonation time, total pause time, pauses per sample, total locution time (both phonation and pauses), verbal rate, and several others [4]. They conclude that automated speech analysis produces very similar results to manually computing these metrics from the speech samples, demonstrating the potential of automated speech signal processing for detecting MCI. Additionally, they found that a combination of linguistic complexity metrics and speech duration metrics lead to improved classification results. In a later study, Fraser et al. use another hybrid approach with speech and language metrics to show good classification separating patients with Alzheimer’s disease from healthy controls [6]. The DementiaBank99 9 https://dementia.talkbank.org/access corpus was used to collect the data for this analysis. Over 370 distinctive features were considered in this study. The linguistic features include grammatical features (from part-of-speech tagging), syntactic complexity (e.g. mean length of sentences, T-units, clauses, and maximum Yngve depth scoring for the parse tree, as described above), information content (specific and nonspecific word use), repetitiveness of meaningful words, and many more. Acoustic features associated with pathological speech were also identified by computation of the first 42 MFCCs. To differentiate the clinical and control group, they considered mean, variance, skewness, and kurtosis of the MFCCs over time. After collecting these features and performing factor analysis, they show that the majority of variance between the control subjects and those with AD could be explained by semantic impairment, acoustic abnormalities, syntactic impairment, and information impairment. Auditory speech characteristics and linguistic characteristics provide separate but complementary metrics about the progression and severity of the disease, leading to a large feature set from which classification results can be improved.

III-C Advantages & Disadvantages

When assessment of cognitive function is the end goal, auditory speech signals alone provide useful, but limited, information for early detection of neurodegenerative disease and mental illness. As we noted previously, the neuropsychological assessment focuses more on language construction (in terms of lexical diversity, lexical density, semantic coherence, language complexity, etc.). However, speech does still provide additional insight that can be used in conjunction with this other information to strengthen classifier performance for detection of cognitive-linguistic decline.

In addition to strengthening classification performance for computational models, speech signal analysis offers the advantage of being easily interpretable in a clinical setting. Objective measures of the periodicity and rhythm of speech, for example, are easy to understand and simple to compute, providing clinicians with useful metrics on which to base their decisions.

IV Concluding Remarks and Future Work

A review of the existing literature reveals a set of future research directions to help advance the state of the art in this area. In this section, we provide an overview of these directions and highlight some of the important open questions in this space.

IV-A Characterizing Inter and Intra-Speaker Variability in Healthy Populations

There is a great deal of variability to be expected in speech and language data. Extensive work on the language variables influencing inter- and intra-speaker variation suggest that any level of language (i.e. phonological, phonetics, semantics, syntax, morphology) is subject to both conscious/explicit and completely unconscious/subtle variation within a speaker. These conscious and unconscious sources of variability are conditioned by pragmatics, style-shifting, or register shifting [55, 56]. Similarly, speech acoustics are impacted by speaker identity, context, background noise, spoken language, etc[57]. These various sources of variability have yet to be fully characterized quantitatively. A more complete understanding of this variability in healthy populations helps to interpret changes observed in clinical populations. For example, this knowledge can help understand how typical or atypical is a particular semantic coherence score (e.g. in what percentile does the semantic coherence score fall?). Furthermore, this understanding can inform stratified sampling schemes that allow experimenters to match healthy and clinical cohorts on relevant aspects of speech/language production.

IV-B Joint Optimization of Speech Elicitation and Speech & Language Analytics

Algorithms published in the literature typically make use of previously-collected speech and language samples. These samples are often collected for other reasons and are only used by algorithm designers because they are available. As a result, published results are potentially biased because these data sets are usually small and collected on a limited set of elicitation tasks. Deeper collaborations between speech neuroscientists, neuropsychologists, and speech technologists are required to push the state-of-the-art forward. There is an extensive literature on how to efficiently and reliably elicit speech to tax various aspects of cognitive-linguistics [58]. The algorithms for extracting clinically-relevant information from speech and audio have been developed independently from this work. We posit that joint exploration of the elicitation-analytics space has the potential to result in improved sensitivity in detecting cognitive-linguistic change.

IV-C Robustness to Noisy Data

The sensitivity of the features we we describe herein and the follow-on models they drive are not well understood under noisy conditions. Our definition of noisy is rather loose here. For example, noise may arise from imperfect transcripts provided by an automatic speech recognition (ASR) engine, background noise that may corrupt the acoustics, or feature distribution mismatch between training and test data in supervised settings. Unimportant nuisance parameters for clinical applications (e.g. idiosyncratic features related to different speakers) are especially problematic in acoustic analysis [57]. A better characterization of the sensitivity of these nuisance features can inform the development of new representations that are robust to various sources of noise. These models can improve the generalization ability of the algorithms and can help us understand the fundamental limits of speech as a diagnostic.

IV-D Data-Driven and Interpretable Features

The features described herein are readily interpretable and it is reasonable to posit that they have clinical utility. However, as clinical speech data becomes available on a large scale, we expect that data-driven artificially intelligence (AI) systems will replace some of the domain-expert features described herein. For example, it is reasonable to expect that features that are optimized for a specific application (e.g. diagnosis schizophrenia) would outperform the general-purpose features described here. This improved performance likely comes at the expense of reduced feature interpretability. An area ripe for further exploration in clinical-speech analytics, and clinical analytics in general, is the development of AI models that provide interpretable outputs when interrogated, such as in [59]. This area has received some attention recently and will continue to become more important as AI systems are deployed in healthcare

References

  • [1] “NIMH » Mental Illness,” https://www.nimh.nih.gov/health/statistics/mental-illness.shtml.
  • [2] G. A. Cecchi, V. Gurev, S. J. Heisig, R. Norel, I. Rish, and S. R. Schrecke, “Computing the structure of language for neuropsychiatric evaluation,” IBM Journal of Research and Development, vol. 61, no. 2/3, pp. 1:1–1:10, Mar. 2017.
  • [3] T. L. Patterson, S. Moscona, C. L. McKibbin, K. Davidson, and D. V. Jeste, “Social skills performance assessment among older patients with schizophrenia,” Schizophrenia Research, vol. 48, no. 2-3, pp. 351–360, Mar. 2001.
  • [4] B. Roark, M. Mitchell, J.-P. Hosom, K. Hollingshead, and J. Kaye, “Spoken Language Derived Measures for Detecting Mild Cognitive Impairment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2081–2090, Sep. 2011.
  • [5] R. Voleti, S. Woolridge, J. M. Liss, M. Milanovic, C. R. Bowie, and V. Berisha, “Objective Assessment of Social Skills Using Automated Language Analysis for Identification of Schizophrenia and Bipolar Disorder,” arXiv preprint arXiv:1904.10622, Apr. 2019.
  • [6] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech,” Journal of Alzheimer’s Disease, vol. 49, no. 2, pp. 407–422, Oct. 2015.
  • [7] V. Berisha, S. Wang, A. LaCross, and J. Liss, “Tracking Discourse Complexity Preceding Alzheimer’s Disease Diagnosis: A Case Study Comparing the Press Conferences of Presidents Ronald Reagan and George Herbert Walker Bush,” Journal of Alzheimer’s Disease, vol. 45, no. 3, pp. 959–963, Mar. 2015.
  • [8] E. S. Kayi, M. Diab, L. Pauselli, M. Compton, and G. Coppersmith, “Predictive Linguistic Features of Schizophrenia,” in Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), 2017, pp. 241–250.
  • [9] V. Berisha, S. Wang, A. LaCross, J. Liss, and P. Garcia-Filion, “Longitudinal changes in linguistic complexity among professional football players,” Brain and Language, vol. 169, pp. 57–63, Jun. 2017.
  • [10] G. Bedi, F. Carrillo, G. A. Cecchi, D. F. Slezak, M. Sigman, N. B. Mota, S. Ribeiro, D. C. Javitt, M. Copelli, and C. M. Corcoran, “Automated analysis of free speech predicts psychosis onset in high-risk youths,” npj Schizophrenia, vol. 1, p. 15030, 2015.
  • [11] C. M. Corcoran, F. Carrillo, D. Fernández-Slezak, G. Bedi, C. Klim, D. C. Javitt, C. E. Bearden, and G. A. Cecchi, “Prediction of psychosis across protocols and risk cohorts using automated language analysis,” World Psychiatry, vol. 17, no. 1, pp. 67–75, Feb. 2018.
  • [12] N. B. Mota, N. A. P. Vasconcelos, N. Lemos, A. C. Pieretti, O. Kinouchi, G. A. Cecchi, M. Copelli, and S. Ribeiro, “Speech Graphs Provide a Quantitative Measure of Thought Disorder in Psychosis,” PLoS ONE, vol. 7, no. 4, p. e34928, Apr. 2012.
  • [13] N. B. Mota, R. Furtado, P. P. C. Maia, M. Copelli, and S. Ribeiro, “Graph analysis of dream reports is especially informative about psychosis,” Scientific Reports, vol. 4, no. 1, Jan. 2014.
  • [14] F. Carrillo, N. Mota, M. Copelli, S. Ribeiro, M. Sigman, G. Cecchi, and D. Fernandez Slezak, “Automated Speech Analysis for Psychosis Evaluation,” in Machine Learning and Interpretation in Neuroimaging, I. Rish, G. Langs, L. Wehbe, G. Cecchi, K.-m. K. Chang, and B. Murphy, Eds.   Cham: Springer International Publishing, 2016, vol. 9444, pp. 31–39.
  • [15] B. Elvevåg, P. W. Foltz, D. R. Weinberger, and T. E. Goldberg, “Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia,” Schizophrenia Research, vol. 93, no. 1-3, pp. 304–316, Jul. 2007.
  • [16] D. Iter, J. Yoon, and D. Jurafsky, “Automatic Detection of Incoherent Speech for Diagnosing Schizophrenia,” in Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, 2018, pp. 136–146.
  • [17] A. König, A. Satt, A. Sorin, R. Hoory, O. Toledo-Ronen, A. Derreumaux, V. Manera, F. Verhey, P. Aalten, P. H. Robert, and R. David, “Automatic speech analysis for the assessment of patients with predementia and Alzheimer’s disease,” Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, vol. 1, no. 1, pp. 112–124, Mar. 2015.
  • [18] Y. Tahir, D. Chakraborty, J. Dauwels, N. Thalmann, D. Thalmann, and J. Lee, “Non-verbal speech analysis of interviews with schizophrenic patients,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference On.   IEEE, 2016, pp. 5810–5814.
  • [19] D. A. Snowdon, S. J. Kemper, J. A. Mortimer, L. H. Greiner, D. R. Wekstein, and W. R. Markesbery, “Linguistic Ability in Early Life and Cognitive Function and Alzheimer’s Disease in Late Life: Findings From the Nun Study,” JAMA, vol. 275, no. 7, pp. 528–532, Feb. 1996.
  • [20] M. A. Covington and J. D. McFall, “Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR),” Journal of Quantitative Linguistics, vol. 17, no. 2, pp. 94–100, May 2010.
  • [21] E. Brunét, Le Vocabulaire de Jean Giraudoux. Structure et Évolution.   Slatkine, 1978, no. 1.
  • [22] A. Honoré, “Some Simple Measures of Richness of Vocabulary,” Association for Literary and Linguistic Computing Bulletin, vol. 7, no. 2, pp. 172–177, 1979.
  • [23] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank:,” Defense Technical Information Center, Fort Belvoir, VA, Tech. Rep., Apr. 1993.
  • [24] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (DRAFT), 3rd ed., Aug. 2017.
  • [25] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL ’03, vol. 1.   Edmonton, Canada: Association for Computational Linguistics, 2003, pp. 173–180.
  • [26] D. Wechsler, “Wechsler Memory Scale–Third Edition Manual,” San Antonio, TX: The Psychological Corp., 1997.
  • [27] R. S. Bucks, S. Singh, J. M. Cuerden, and G. K. Wilcock, “Analysis of spontaneous, conversational speech in dementia of Alzheimer type: Evaluation of an objective technique for analysing lexical performance,” Aphasiology, vol. 14, no. 1, pp. 71–91, Jan. 2000.
  • [28] S. Kemper, “Adults’ diaries: Changes made to written narratives across the life span,” Discourse Processes, vol. 13, no. 2, pp. 207–223, Apr. 1990.
  • [29] S. Kemper and A. Sumner, “The structure of verbal abilities in young and older adults.” Psychology and Aging, vol. 16, no. 2, pp. 312–322, 2001.
  • [30] N. E. Carlozzi, N. L. Kirsch, P. A. Kisala, and D. S. Tulsky, “An Examination of the Wechsler Adult Intelligence Scales, Fourth Edition (WAIS-IV) in Individuals with Complicated Mild, Moderate and Severe Traumatic Brain Injury (TBI),” The Clinical Neuropsychologist, vol. 29, no. 1, pp. 21–37, Jan. 2015.
  • [31] V. H. Yngve, “A Model and an Hypothesis for Language Structure,” Proceedings of the American Philosophical Society, vol. 104, no. 5, 1960.
  • [32] L. Frazier, “Syntactic Complexity,” in Natural Language Parsing.   Cambridge, U.K.: Cambridge University Press, 1985.
  • [33] T. Berg, Structure in Language: A Dynamic Perspective, 1st ed., ser. Routledge Studies in Linguistics.   New York, NY: Routledge, 2009, no. 10, oCLC: 605351697.
  • [34] D. M. Magerman, “Statistical decision-tree models for parsing,” in Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics -.   Cambridge, Massachusetts: Association for Computational Linguistics, 1995, pp. 276–283.
  • [35] D. Lin, “On the structural complexity of natural language sentences,” in Proceedings of the 16th Conference on Computational Linguistics -, vol. 2.   Copenhagen, Denmark: Association for Computational Linguistics, 1996, p. 729.
  • [36] E. Gibson, “Linguistic complexity: Locality of syntactic dependencies,” Cognition, vol. 68, no. 1, pp. 1–76, Aug. 1998.
  • [37] E. Charniak, “A Maximum-entropy-inspired Parser,” in Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, ser. NAACL 2000.   Association for Computational Linguistics, 2000, pp. 132–139.
  • [38] E. Haugen and J. R. Firth, “Papers in linguistics 1934-1951,” Language, vol. 34, no. 4, p. 498, Oct. 1958.
  • [39] T. K. Landauer and S. T. Dumais, “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.” Psychological Review, vol. 104, no. 2, pp. 211–240, 1997.
  • [40] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [41] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation.”   Association for Computational Linguistics, 2014, pp. 1532–1543.
  • [42] E. Altszyler, M. Sigman, S. Ribeiro, and D. F. Slezak, “Comparative study of LSA vs Word2vec embeddings in small corpora: A case study in dreams database,” arXiv preprint arXiv:1610.01520, 2016.
  • [43] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.
  • [44] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], Oct. 2018.
  • [45] S. Arora, Y. Liang, and T. Ma, “A Simple but Tough-to-Beat Baseline for Sentence Embeddings,” in Proceedings of 5th International Conference on Learning Representations, Toulon, France, 2017, p. 16.
  • [46] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features,” arXiv:1703.02507 [cs], Mar. 2017.
  • [47] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data,” arXiv:1705.02364 [cs], May 2017.
  • [48] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal Sentence Encoder,” arXiv preprint arXiv:1803.11175, 2018.
  • [49] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • [50] D. Das, N. Schneider, D. Chen, and N. A. Smith, “Probabilistic Frame-Semantic Parsing,” p. 9.
  • [51] A. M. Colman, A Dictionary of Psychology.   Oxford University Press, 2015.
  • [52] S. C. Yudofsky, R. E. Hales, and A. P. Publishing, Eds., The American Psychiatric Publishing Textbook of Neuropsychiatry and Clinical Neurosciences, 4th ed.   Washington, DC: American Psychiatric Pub, 2002.
  • [53] S. L. Videbeck, Psychiatric-Mental Health Nursing, 2014, oCLC: 1004001626.
  • [54] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, Aug. 1980.
  • [55] N. Coupland, Style: Language Variation and Identity, ser. Key Topics in Sociolinguistics.   Cambridge University Press, 2007.
  • [56] N. Schilling, “Investigating stylistic variation,” The handbook of language variation and change, pp. 325–349, 2013.
  • [57] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris, R. Rose, V. Tyagi, and C. Wellekens, “Automatic speech recognition and speech variability: A review,” Speech Communication, vol. 49, no. 10-11, pp. 763–786, Oct. 2007.
  • [58] K. D. Mueller, R. L. Koscik, L. R. Clark, B. P. Hermann, S. C. Johnson, and L. S. Turkstra, “The Latent Structure and Test–Retest Stability of Connected Language Measures in the Wisconsin Registry for Alzheimer’s Prevention (WRAP),” Archives of Clinical Neuropsychology, vol. 33, no. 8, pp. 993–1005, Dec. 2018.
  • [59] M. Tu, V. Berisha, and J. Liss, “Interpretable Objective Assessment of Dysarthric Speech Based on Deep Neural Networks,” in Interspeech 2017.   ISCA, Aug. 2017, pp. 1849–1853.