In this article we describe our experiences with computational text analysis.We hope to achieve three primary goals. First, we aim to shed light on thornyissues not always at the forefront of discussions about computational textanalysis methods. Second, we hope to provide a set of best practices forworking with thick social and cultural concepts. Our guidance is based on ourown experiences and is therefore inherently imperfect. Still, given ourdiversity of disciplinary backgrounds and research practices, we hope tocapture a range of ideas and identify commonalities that will resonate formany. And this leads to our final goal: to help promote interdisciplinarycollaborations. Interdisciplinary insights and partnerships are essential forrealizing the full potential of any computational text analysis that involvessocial and cultural concepts, and the more we are able to bridge these divides,the more fruitful we believe our work will be.
Quick Read (beta)
How we do things with words: Analyzing text as social and cultural data
In June 2015, the operators of the online discussion site Reddit banned several communities under new anti-harassment rules. Chandrasekharan et al. (2017) used this opportunity to combine rich online data with computational methods to study a current question: Does eliminating these “echo chambers” diminish the amount of hate speech overall? Exciting opportunities like these, at the intersection of “thick” cultural and societal questions on the one hand, and the computational analysis of rich textual data on larger-than-human scales on the other, are becoming increasingly common.
Indeed, computational analysis is opening new possibilities for exploring challenging questions at the heart of some of the most pressing contemporary cultural and social issues. While a human reader is better equipped to make logical inferences, resolve ambiguities, and apply cultural knowledge than a computer, human time and attention are limited. Moreover, many patterns are not obvious in any specific context, but only stand out in the aggregate. For example, in a landmark study, Mosteller and Wallace (1963) analyzed the authorship of The Federalist Papers using a statistical text analysis by focusing on style, based on the distribution of function words, rather than content. As another example, Long and So (2016) studied what defines English haiku and showed how computational analysis and close reading can complement each other. Computational approaches are valuable precisely because they help us identify patterns that would not otherwise be discernible.
Yet these approaches are not a panacea. Examining thick social and cultural questions using computational text analysis carries significant challenges. For one, texts are culturally and socially situated. They reflect the ideas, values and beliefs of both their authors and their target audiences, and such subtleties of meaning and interpretation are difficult to incorporate in computational approaches. For another, many of the social and cultural concepts we seek to examine are highly contested — hate speech is just one such example. Choices regarding how to operationalize and analyze these concepts can raise serious concerns about conceptual validity and may lead to shallow or obvious conclusions, rather than findings that reflect the depth of the questions we seek to address.
These are just a small sample of the many opportunities and challenges faced in computational analyses of textual data. New possibilities and frustrating obstacles emerge at every stage of research, from identification of the research question to interpretation of the results. In this article, we take the reader through a typical research process that involves measuring social or cultural concepts using computational methods, discussing both the opportunities and complications that often arise. In the Reddit case, for example, hate speech is measured, however imperfectly, by the presence of particular words semi-automatically extracted from a machine learning algorithm. Operationalizations are never perfect translations, and are often refined over the course of an investigation, but they are crucial.
We begin our exploration with the identification of research questions, proceed through data selection, conceptualization, and operationalization, and end with analysis and the interpretation of results. The research process sounds more or less linear this way, but each of these phases overlaps, and in some instances turns back upon itself. The analysis phase, for example, often feeds back into the original research questions, which may continue to evolve for much of the project. At each stage, our discussion is critically informed by insights from the humanities and social sciences, fields that have focused on, and worked to tackle, the challenges of textual analysis—albeit at smaller scales—since their inception.
In describing our experiences with computational text analysis, we hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concepts. Our guidance is based on our own experiences and is therefore inherently imperfect. Still, given our diversity of disciplinary backgrounds and research practices, we hope to capture a range of ideas and identify commonalities that will resonate for many. And this leads to our final goal: to help promote interdisciplinary collaborations. Interdisciplinary insights and partnerships are essential for realizing the full potential of any computational text analysis that involves social and cultural concepts, and the more we are able to bridge these divides, the more fruitful we believe our work will be.
Context and intended audience
This article is a result of a workshop organized in 2017 by Maria Liakata and Dong Nguyen at the Alan Turing Institute on “Bridging disciplines in analysing text as social and cultural data”. The primary audience of this article are most likely readers who are interested in doing interdisciplinary text analysis, with a computer science, humanities, or social sciences background. We expect that this text may be especially useful for courses on text as data.
2 Research questions
We typically start by identifying the questions we wish to explore. Can text analysis provide a new perspective on a “big question” that has been attracting interest for years? Or can we raise new questions that have only recently emerged, for example about social media? For social scientists working in computational analysis, the questions are often grounded in theory, asking: How can we explain what we observe? These questions are also influenced by the availability and accessibility of data sources. For example, the choice to work with data from a particular social media platform may be partly determined by the fact that it is freely available, and this will in turn shape the kinds of questions that can be asked. A key output of this phase are the concepts to measure, for example: influence; copying and reproduction; the creation of patterns of language use; hate speech.
Computational analysis of text motivated by these questions is insight driven: we aim to describe a phenomenon or explain how it came about. For example, what can we learn about how and why hate speech is used or how this changes over time? Is hate speech one thing, or does it comprise multiple forms of expression? Is there a clear boundary between hate speech and other types of speech, and what features make it more or less ambiguous? In these cases, it is critical to communicate high-level patterns in terms that are recognizable.
This contrasts with much of the work in computational text analysis, which tends to focus on automating tasks that humans perform inefficiently. These tasks range from core linguistically motivated tasks that constitute the backbone of natural language processing, such as part-of-speech tagging and parsing, to filtering spam and detecting sentiment. Many tasks are motivated by applications, for example to automatically block online trolls. Success, then, is often measured by performance, and communicating why a certain prediction was made—for example, why a document was labeled as positive sentiment, or why a word was classified as a noun—is less important than the accuracy of the prediction itself. The approaches we use and what we mean by ‘success’ are thus guided by our research questions.
Domain experts and fellow researchers can provide feedback on questions and help with dynamically revising them. For example, they may say “we already think we know that”, “that’s too naïve”, “that doesn’t reflect social reality” (negative); “two major camps in the field would give different answers to that question” (neutral); “we tried to look at that back in the 1960s, but we didn’t have the technology” (positive); and “that sounds like something that people who made that archive would love”, “that’s a really fundamental question” (very positive).
Sometimes we also hope to connect to multiple disciplines. For example, while focusing on the humanistic concerns of an archive, we could also ask social questions such as “is this archive more about collaborative processes, culture-building or norm creation?” or “how well does this archive reflect the society in which it is embedded?” Murdock et al. (2017) used quantitative methods to tell a story about Darwin’s intellectual development—an essential biographical question for a key figure in the history of science. At the same time, their methods connected Darwin’s development to the changing landscape of Victorian scientific culture, allowing them to contrast Darwin’s “foraging” in the scientific literature of his time to the ways in which that literature was itself produced. Finally, their methods provided a case study, and validation of technical approaches, for cognitive scientists who are interested in how people explore and exploit sources of knowledge.
Questions about potential “dual use” may also arise. Returning to our introductory example, Chandrasekharan et al. (2017) started with a deceptively simple question: if an internet platform eliminates forums for hate speech, does this impact hate speech in other forums? The research was motivated by the belief that a rising tide of online hate speech was (and is) making the internet increasingly unfriendly for disempowered groups, including minorities, women, and LBGTQ individuals. Yet the possibility of dual use troubled the researchers from the onset. Could the methodology be adopted to target the speech of groups like Black Lives Matter? Could it be adopted by repressive governments to minimize online dissent? While these concerns remained, they concluded that hypothetical dual use scenarios did not outweigh the tangible contribution this research could offer towards making the online environment more equal and just.
The next step involves deciding on the data sources, collecting and compiling the dataset, and inspecting its metadata.
3.1 Data acquisition
Many scholars in the humanities and the social sciences work with sources that are not available in digital form, and indeed may never be digitized. Others work with both analogue and digitized materials, and the increasing digitization of archives has opened opportunities to study these archives in new ways. We can go to the canonical archive or open up something that nobody has studied before. For example, we might focus on major historical moments (French Revolution, post-Milosevic Serbia) or critical epochs (Britain entering the Victorian era, the transition from Latin to proto-Romance). Or, we could look for records of how people conducted science, wrote and consumed literature, and worked out their philosophies.
3.1.1 Born-digital data
A growing number of researchers work with born-digital sources or data11 1 See Salganik (2017) for an extensive discussion.. Born-digital data, e.g., from social media, generally do not involve direct elicitation from participants and therefore enable unobtrusive measurements (Tangherlini, 2016; Webb et al., 1966). In contrast, methods like surveys sometimes elicit altered responses from participants, who might adapt their responses to what they think is expected. Moreover, born-digital data is often massive, enabling large-scale studies of language and behavior in a variety of social contexts.
Still, many scholars in the social sciences and humanities work with multiple data sources. The variety of sources typically used means that more than one data collection method is often required. For example, a project examining coverage of a UK General Election, could draw data from traditional media, web archives, Twitter and Facebook, campaign manifestos, etc. and might combine textual analysis of these materials with surveys, laboratory experiments, or field observations offline. In contrast, many computational studies based on born-digital data have focused on one specific source, such as Twitter.
The use of born-digital data raises ethical concerns. Although early studies often treated privacy as a binary construct, many now acknowledge its complexity (danah boyd and Crawford, 2012). Conversations on private matters can be posted online, visible for all, but social norms regarding what should be considered public information may differ from the data’s explicit visibility settings. Often no informed consent has been obtained, raising concerns and challenges regarding publishing content and potentially harmful secondary uses (Williams et al., 2017; Salganik, 2017).
Recently, concerns about potential harms stemming from secondary uses have led a number of digital service providers to restrict access to born-digital data. Facebook and Twitter, for example, have reduced or eliminated public access to their application programming interfaces (APIs) and expressed hesitation about allowing academic researchers to use data from their platforms to examine certain sensitive or controversial topics. Despite the seeming abundance of born-digital data, we therefore cannot take its availability for granted.
3.1.2 Data quality
Working with data that someone else has acquired presents additional problems related to provenance and contextualisation. It may not always be possible to determine the criteria applied during the creation process. For example, why were certain newspapers digitized but not others, and what does this say about the collection? Similar questions arise with the use of born-digital data. For instance, when using the Internet Archive’s Wayback Machine to gather data from archived web pages, we need to consider what pages were captured, which are likely missing, and why.
We must often repurpose born-digital data (e.g., Twitter was not designed to measure public opinion), but data biases may lead to spurious results and limit justification for generalization.22 2 Olteanu et al. (2016) give an overview of biases in social data. In particular, data collected via black box APIs designed for commercial, not research, purposes are likely to introduce biases into the inferences we draw, and the closed nature of these APIs means we rarely know what biases are introduced, let alone how severely they might impact our research (Tromble et al., 2017). These, however, are not new problems. Historians, for example, have always understood that their sources were produced within particular contexts and for particular purposes, which are not always apparent to us.
Non-representative data can still be useful for making comparisons within a sample. In the introductory example on hate speech (Chandrasekharan et al., 2017), the Reddit forums do not present a comprehensive or balanced picture of hate speech: the writing is almost exclusively in English, the targets of hate speech are mainly restricted (e.g., to black people, or women), and the population of writers is shaped by Reddit’s demographics, which skew towards young white men. These biases limit the generalizability of the findings, which cannot be extrapolated to other languages, other types of hate speech, and other demographic groups. However, because the findings are based on measurements on the same sort of hate speech and the same population of writers, as long as the collected data are representative of this specific population, these biases do not pose an intractable validity problem if claims are properly restricted.
The size of many newly available datasets is one of their most appealing characteristics. Bigger datasets often make statistics more robust. The size needed for a computational text analysis depends on the research goal: When it involves studying rare events, bigger datasets are needed. However, larger is not always better. Some very large archives are “secretly” collections of multiple and distinct processes that no in-field scholar would consider related. For example, Google Books is frequently used to study cultural patterns, but the over-representation of scientific articles in Google books can be problematic (Pechenick et al., 2015). Even very large born-digital datasets usually cover limited timespans compared to, e.g., the Gutenberg archive of British novels.
This stage of the research also raises important questions about fairness. Are marginalized groups, for example, represented in the tweets we have collected? If not, what types of biases might result from analyses relying on those tweets?
Local experts and “informants” can help navigate the data. They can help understand the role an archive plays in the time and place. They might tell us: Is this the central archive, or a peripheral one? What makes it unusual? Or they might tell us how certain underrepresented communities use a social media platform and advise us on strategies for ensuring our data collection includes their perspectives.
However, when it is practically infeasible to navigate the data in this way—for instance, when we cannot determine what is missing from Twitter’s Streaming API or what webpages are left out of the Internet Archive—we should be open about the limitations of our analyses, acknowledging the flaws in our data and drawing cautious and reasonable conclusions from them. In all cases, we should report the choices we have made when creating or re-using any dataset.
3.2 Compiling data
After identifying the data source(s), the next step is compiling the data. This step is fundamental: if the sources cannot support a convincing result, no result will be convincing. In many cases, this involves defining a “core” set of documents and a “comparison” set. We often have a specific set of documents in mind: an author’s work, a particular journal, a time period. But if we want to say that this “core” set has some distinctive property, we need a “comparison” set. Expanding the collection beyond the documents that we would immediately think of has the beneficial effect of increasing our sample size. Having more sources increases the chance that we will notice something consistent across many individually varying contexts.
Comparing sets of documents can sometimes support causal inference, presented as a contrast between a treatment group and a control. In Chandrasekharan et al. (2017), the treatment consisted of the text written in the two forums that were eventually closed by Reddit. However, identifying a control group required a considerable amount of time and effort. Reddit is a diverse platform, with a wide variety of interactional and linguistic styles; it would be pointless to compare hate speech forums against forums dedicated to, say, pictures of wrecked bicycles.33 3 https://www.reddit.com/r/bustedcarbon/ Chandrasekharan et al. used a matching design, populating the control group with forums that were as similar as possible to the treatment group, but were not banned from Reddit. The goal is to estimate the counterfactual scenario: in this case, what would have happened had the site not taken action against these specific forums? An ideal control would make it possible to distinguish the effect of the treatment — closing the forums — from other idiosyncratic properties of texts that were treated.
We also look for categories of documents that might not be useful. We might remove documents that are meta-discourse, like introductions and notes, or documents that are in a language that is not the primary language of the collection, or duplicates when we are working with archived web pages. However, we need to carefully consider the potential consequences of information we remove. Does its removal alter the data, or the interpretation of the data, we are analyzing? Are we losing anything that might be valuable at a later stage?
3.3 Labels and metadata
Sometimes all we have is documents, but often we want to look at documents in the context of some additional information, or metadata. This additional information could tell us about the creation of documents (date, author, forum), or about the reception of documents (flagged as hate speech, helpful review). Information about text segments can be extremely valuable, but it is also prone to errors, inconsistencies, bias, and missing information. Examining metadata is a good way to check a collection’s balance and representativeness. Are sources disproportionately of one form? Is the collection missing a specific time window? This type of curation can be extremely time consuming as it may require expert labeling, but it often leads to the most compelling results. Sometimes metadata are also used as target labels to develop machine learning models. But using them as a “ground truth” requires caution. Labels sometimes mean something different than we expect. For example, a down vote for a social media post could indicate that the content is offensive, or that the voter simply disagreed with the expressed view.
A core step in many analyses is translating social and cultural concepts (such as hate speech, rumor, or conversion) into measurable quantities. Before we can develop measurements for these concepts (the operationalization step, or the “implementation” step as denoted by Piper (2017)), we need to define them. In the conceptualization phase we often start with questions such as: who are the domain experts, and how have they approached the topic? We are looking for a definition of the concept that is flexible enough to apply on our dataset, yet formal enough for computational research. For example, our introductory study on hate speech (Chandrasekharan et al., 2017) used a statement on hate speech produced by the European Union Court of Human Rights. The goal was not to implement this definition directly in software but to use it as a reference point to anchor subsequent analyses.
If we want to move beyond the use of ad hoc definitions, it can be useful to distinguish between what political scientists Adcock and Collier call the “background concept” and the “systematized concept” (Adcock and Collier, 2001). The background concept comprises the full and diverse set of meanings that might be associated with a particular term. This involves delving into theoretical, conceptual, and empirical studies to assess how a concept has been defined by other scholars and, most importantly, to determine which definition is most appropriate for the particular research question and the theoretical framework in which it is situated. That definition, in turn, represents the systematized concept: the formulation that is adopted for the study.
It is important to consider that for social and cultural concepts there is no absolute ground truth. There are often multiple valid definitions for a concept (the “background” concept in the terms of Adcock and Collier), and definitions might be contested over time. This may be uncomfortable for computer scientists, whose primary measure of success is often based on comparing a model’s output against “ground truth” or a “gold standard”, e.g., by comparing a sentiment classifier’s output against manual annotations. However, the notion of ground truth is uncommon in the humanities and the social sciences and it is often taken too far in machine learning. Kirschenbaum (2007, p. 1) notes that in literary criticism and the digital humanities more broadly “interpretation, ambiguity, and argumentation are prized far above ground truth and definitive conclusions”. Hammond et al. (2013, p. 2) draw attention to the different attitudes of literary scholars and computational linguists towards ambiguity, stating that “In Computational Linguistics [..] ambiguity is almost uniformly treated as a problem to be solved; the focus is on disambiguation, with the assumption that one true, correct interpretation exists.” The latter is probably true for tasks such as spam filtering, but in the social sciences and the humanities many relevant concepts are fundamentally unobservable, such as latent traits of political actors (Lowe and Benoit, 2013) or cultural fit in organizations (Srivastava et al., 2018), leading to validation challenges. Moreover, when the ground truth comes from people, it may be influenced by ideological priors, priming, simple differences of opinion or perspective, and many other factors (DiMaggio, 2015). We return to this issue in our discussions on validation and analysis.
In this phase we develop measures (or, “operationalizations”, or “indicators”) for the concepts of interest, a process called “operationalization”. Regardless of whether we are working with computers, the output produced coincides with Adcock and Collier’s “scores”—the concrete translation and output of the systematized concept into numbers or labels (Adcock and Collier, 2001). Choices made during this phase are always tied to the question “Are we measuring what we intend to measure?” Does our operationalization match our conceptual definition? To ensure validity we must recognize gaps between what is important and what is easy to measure. We first discuss modeling considerations. Next, we describe several frequently used computational approaches and their limitations and strengths.
5.1 Modeling considerations
The variables (both predictors and outcomes) are rarely simply binary or categorical. For example, a study on language use and age could focus on chronological age (instead of, e.g., social age (Eckert, 1997)). However, even then, age can be modeled in different ways. Discretization44 4 Reducing a continuous variable to a discrete variable. can make the modeling easier and various NLP studies have modeled age as a categorical variable (Nguyen et al., 2016). But any discretization raises questions: How many categories? Where to place the boundaries? Fine distinctions might not always be meaningful for the analysis we are interested in, but categories that are too broad can threaten validity. Other interesting variables include time, space, and even the social network position of the author. It is often preferable to keep the variable in its most precise form. For example, Nguyen and Eisenstein (2017) perform exploration in the context of hypothesis testing by using latitude and longitude coordinates — the original metadata attached to geotagged social media such as tweets — rather than aggregating into administrative units such as counties or cities. This is necessary when such administrative units are unlikely to be related to the target concept, as is the case in their analysis of dialect differences. Focusing on precise geographical coordinates also makes it possible to recognize fine-grained effects, such as language variation across the geography of a city.
Using a particular classification scheme means deciding which variations are visible, and which ones are hidden (Bowker and Star, 1999). We are looking for a categorization scheme for which it is feasible to collect a large enough labeled document collection (e.g., to train supervised models), but which is also fine-grained enough for our purposes. Classification schemes rarely exhibit the ideal properties, i.e., that they are consistent, their categories are mutually exclusive, and that the system is complete (Bowker and Star, 1999). Borderline cases are challenging, especially with social and cultural concepts, where the boundaries are often not clear-cut. The choice of scheme can also have ethical implications (Bowker and Star, 1999). For example, gender is usually represented as a binary variable in NLP and computational models tend to learn gender-stereotypical patterns. The operationalization of gender in NLP has been challenged only recently (Bamman et al., 2014; Koolen and van Cranenburgh, 2017; Nguyen et al., 2014).
Supervised vs. unsupervised
Supervised and unsupervised learning are the most common approaches to learning from data. With supervised learning, a model learns from labeled data (e.g., social media messages labeled by sentiment) to infer (or predict) these labels from unlabeled texts. In contrast, unsupervised learning uses unlabeled data. Supervised approaches are especially suitable when we have a clear definition of the concept of interest and when labels are available (either annotated or native to the data). Unsupervised approaches, such as topic models, are especially useful for exploration. In this setting, conceptualization and operationalization may occur simultaneously, with theory emerging from the data (Baumer et al., 2017). Unsupervised approaches are also used when there is a clear way of measuring a concept, often based on strong assumptions. For example, Murdock et al. (2017) measure “surprise” in an analysis of Darwin’s reading decisions based on the divergence between two probability distributions.
Units of interest
From an analysis perspective, the unit of text that we are labeling (or annotating, or coding), either automatic or manual, can sometimes be different than one’s final unit of analysis. For example, if in a study on media frames in news stories, the theoretical framework and research question point toward frames at the story level (e.g., what is the overall causal analysis of the news article?), the story must be the unit of analysis. Yet it is often difficult to validly and reliably code a single frame at the story level. Multiple perspectives are likely to sit side-by-side in a story. Thus, an article on income inequality might point to multiple causes, such as globalization, education, and tax policies. Coding at the sentence level would detect each of these causal explanations individually, but this information would need to be somehow aggregated to determine the overall story-level frame. Sometimes scholars solve this problem by only examining headlines and lead paragraphs, arguing that based on journalistic convention, the most important information can be found at the beginning of a story. However, this leads to a return to a shorter, less nuanced analysis.
From a computational perspective, the unit of text can also make a huge difference, especially when we are using bag-of-words models, where word order within a unit does not matter. Small segments, like tweets, sometimes do not have enough information to make their semantic context clear. In contrast, larger segments, like novels, have too much variation, making it difficult to train focused models. Finding a good segmentation sometimes means combining short documents and subdividing long documents. The word “document” can therefore be misleading. But it is so ingrained in the common NLP lexicon that we use it anyway in this article.
For insight-driven text analysis, it is often critical that high-level patterns can be communicated. Furthermore, interpretable models make it easier to find spurious features, to do error analysis, and to support interpretation of results. Some approaches are effective for prediction, but harder to interpret. The value we place on interpretability can therefore influence the approach we choose. There is an increasing interest in developing interpretable or transparent models in the NLP and machine learning communities.55 5 e.g. the Fairness, Accountability, and Transparency (FAT) conference.
Many studies involve human coders. Sometimes the goal is to fully code the data, but in a computational analysis we often use the labels (or annotations) to train machine learning models to automatically recognize them, and to identify language patterns that are associated with these labels. For example, for a project analyzing rumors online (Zubiaga et al., 2016b), conversation threads were annotated along different dimensions, including rumor versus non-rumor and stance towards a rumor.
The collection of annotation choices make up an annotation scheme (or “codebook”). Existing schemes and annotations can be useful as starting points. Usually settling on an annotation scheme requires several iterations, in which the guidelines are updated and annotation examples are added. For example, a political scientist could use a mixed deductive-inductive strategy for developing a codebook. She starts by laying out a set of theory-driven deductive coding rules, which means that the broad principles of the coding rules are laid out without examining examples first. These are then tested (and possibly adjusted) based on a sample of the data. In line with Adcock and Collier’s notion of “content validity” (Adcock and Collier, 2001), the goal is to assess whether the codebook adequately captures the systematized concept. By looking at the data themselves, she gains a better sense of whether some things have been left out of the coding rules and whether anything is superfluous, misleading, or confusing. Adjustments are made and the process is repeated, often with another researcher involved.
The final annotations can be collected using a crowdsourcing platform, a smaller number of highly-trained annotators, or a group of experts. Which type of annotator to use should be informed by the complexity and specificity of the concept. For more complex concepts, highly-trained or expert annotators tend to produce more reliable results. However, complex concepts can sometimes be broken down into micro-tasks that can be performed independently in parallel by crowdsourced annotators. Concepts from highly specialized domains may require expert annotators. In all cases, however, some training will be required, and the training phase should involve continual checks of inter-annotator agreement (i.e. intercoder reliability) or checks against a gold standard (e.g. quizzes in crowdsourcing platforms).
We also need to decide how inter-annotator agreement will be measured and what an acceptable level of agreement would be. Krippendorff’s alpha is frequently used in the social sciences, but the right measure depends on the type of data and task. For manual coding, we can continually check inter-annotator agreement and begin introducing checks of intra-annotator agreement, too. For most communication scholars using only manual content analysis, an acceptable rate of agreement is achieved when Krippendorf’s alpha reaches 0.80 or above. When human-coded data are used to validate machine learning algorithms, the reliability of the human-coded data is even more important. Disagreement between annotators can signal weaknesses of the annotation scheme, or highlight the inherent ambiguity in what we are trying to measure. Disagreement itself can be meaningful and can be integrated in subsequent analyses (Aroyo and Welty, 2013; Demeester et al., 2016).
5.3 Data pre-processing
Preparing the data can be a complex and time-consuming process, often involving working with partially or wholly unstructured data. The pre-processing steps have a big impact on the operationalizations, subsequent analyses and reproducibility efforts (Fokkens et al., 2013), and they are usually tightly linked to what we intend to measure. Unfortunately, these steps tend to be underreported, but documenting the pre-processing choices made is essential and is analogous to recording the decisions taken during the production of a scholarly edition or protocols in biomedical research. Data may also vary enormously in quality, depending on how it has been generated. Many historians, for example, work with text produced from an analogue original using Optical Character Recognition (OCR). Often, there will be limited information available regarding the accuracy of the OCR, and the degree of accuracy may even vary within a single corpus (e.g. where digitized text has been produced over a period of years, and the software has gradually improved). The first step, then, is to try to correct for common OCR errors. These will vary depending on the type of text, the date at which the ‘original’ was produced, and the nature of the font and typesetting.
One step that almost everyone takes is to tokenize the original character sequence into the words and word-like units. Tokenization is a more subtle and more powerful process than people expect. It is often done using regular expressions or scripts that have been circulating within the NLP community. Tokenization heuristics, however, can be badly confused by emoticons, creative orthography (e.g., U$A, sh!t), and missing whitespace. Multi-word terms are also challenging. Treating them as a single unit can dramatically alter the patterns in text. Many words that are individually ambiguous have clear, unmistakable meanings as terms, like “black hole” or “European Union”. However, deciding what constitutes a multi-word term is a difficult problem. In writing systems like Chinese, tokenization is a research problem in its own right.
Beyond tokenization, common steps include lowercasing, removing punctuation, stemming (removing suffixes), lemmatization (converting inflections to a base lemma), and normalization, which has never been clearly defined66 6 (Sproat et al., 2001) make a good first step, but this work precedes the wave of research on social media, and does not include many things that are considered normalization today., but often includes grouping abbreviations like “U.S.A.” and “USA”, ordinals like “1st” and “first”, and variant spellings like “noooooo”. The main goal of these steps is to improve the ratio of tokens (individual occurrences) to types (the distinct things in a corpus). Each step requires making additional assumptions about which distinctions are relevant: is “apple” different from “Apple”? Is “burnt” different from “burned”? Is “cool” different from “coooool”? Sometimes these steps can actively hide useful patterns, like social meaning (Eisenstein, 2013). Some of us therefore try do as little modification as possible.
From a multilingual perspective, English and Chinese have an unusually simple inflectional system, and so it is statistically reasonable to treat each inflection as a unique word type. Romance languages have considerably more inflections than English; many indigenous North American languages have still more. For these languages, unseen data is far more likely to include previously-unseen inflections, and therefore, dealing with inflections is more important. On the other hand, the resources for handling inflections vary greatly by language, with European languages dominating the attention of the computational linguistics community thus far.
We sometimes also remove words that are not relevant to our goals, for example by calculating vocabulary frequencies. We construct a “stoplist” of words that we are not interested in. If we are looking for semantic themes we might remove function words like determiners and prepositions. If we are looking for author-specific styles, we might remove all words except function words. Some words are generally meaningful but too frequent to be useful within a specific collection. We sometimes also remove very infrequent words. Their occurrences are too low for robust patterns and removing them helps reducing the vocabulary size.
The choice of processing steps can be guided by theory or knowledge about the domain as well as experimental investigation. When we have labels, predictive accuracy of a model is a way to assess the effect of the processing steps. In unsupervised settings, it is more challenging to understand the effects of different steps. Inferences drawn from unsupervised settings can be sensitive to pre-processing choices (Denny and Spirling, 2018). Stemming has been found to provide little measurable benefits for topic modeling and can sometimes even be harmful (Schofield and Mimno, 2016). All in all, this again highlights the need to document these steps.
Finally, we can also mark up the data, e.g., by identifying entities (people, places, organizations, etc.) or parts of speech. Although many NLP tools are available for such tasks, they are often challenged by linguistic variation, such as orthographic variation in historical texts (Piotrowski, 2012) and social media (Eisenstein, 2013). Moreover, the performance of NLP tools often drops when applying them outside the training domain, such as applying tools developed on newswire texts to texts written by younger authors (Hovy and Søgaard, 2015). Problems (e.g., disambiguation in named entity recognition) are sometimes resolved using considerable manual intervention. This combination of the automated and the manual, however, becomes more difficult as the scale of the data increases, and the ‘certainty’ brought by the latter may have to be abandoned.
5.4 Dictionary-based approaches
Dictionaries are frequently used to code texts in content analyses (Neuendorf, 2002). Dictionaries consist of one or more categories (i.e. word lists). Sometimes the output is simply the number of category occurrences (e.g., positive sentiment), thus weighting words within a category equally. In some other cases, words are assigned continuous scores. The high transparency of dictionaries makes them sometimes more suitable than supervised machine learning models. However, dictionaries should only be used if the scores assigned to words match how the words are used in the data (see Grimmer and Stewart (2013) for a detailed discussion on limitations). There are many off-the-shelf dictionaries available (e.g., LIWC (Tausczik and Pennebaker, 2010)). These are often well-validated, but applying them on a new domain may not be appropriate without additional validation. Corpus- or domain-specific dictionaries can overcome limitations of general-purpose dictionaries.
The dictionaries are often manually compiled, but increasingly they are constructed semi-automatically (e.g., Fast et al. (2016)). When we semi-automatically create a word list, we use automation to identify an initial word list, and human insight to filter it. By automatically generating the initial words lists, words can be identified that human annotators might have difficulty intuiting. By manually filtering the lists, we use our theoretical understanding of the target concept to remove spurious features.
In the introduction study, SAGE (Eisenstein et al., 2011) was used to obtain a list of words that distinguished the text in the treatment group (subreddits that were closed by Reddit) from text in the control group (similar subreddits that were not closed). The researchers then returned to the hate speech definition provided by the European Court of Human Rights, and manually filtered the top SAGE words based on this definition. Not all identified words fitted the definition. The others included: the names of the subreddits themselves, names of related subreddits, community-specific jargon that was not directly related to hate speech, and terms such as IQ and welfare, which were frequently used in discourses of hate speech, but had significant other uses. The word lists provided the measurement instrument for their main result, which is that the use of hate speech throughout Reddit declined after the two treatment subreddits were closed.
5.5 Supervised models
Supervised learning is frequently used to scale up analyses. For example, Nguyen et al. (2015) wanted to analyze the motivations of Movember campaign participants. By developing a classifier based on a small set of annotations, they were able to expand the analysis to over 90k participants.
The choice of supervised learning model is often guided by the task definition and the label types. For example, to identify stance towards rumors based on sequential annotations, an algorithm for learning from sequential (Zubiaga et al., 2016a) or time series data (Lukasik et al., 2016) could be used. The features (sometimes called variables or predictors) are used by the model to make the predictions. They may vary from content-based features such as single words, sequences of words, or information about their syntactic structure, to meta-information such as user or network information. Deciding on the features requires experimentation and expert insight and is often called feature engineering. For insight-driven analysis, we are often interested in why a prediction has been made and features that can be interpreted by humans may be preferred. Recent neural network approaches often use simple features as input (such as word embeddings or character sequences), which requires less feature engineering but make interpretation more difficult.
Supervised models are powerful, but they can latch on to spurious features of the dataset. This is particularly true for datasets that are not well-balanced, and for annotations that are noisy. In our introductory example on hate speech in Reddit (Chandrasekharan et al., 2017), the annotations are automatically derived from the forum in which each post appears, and indeed, many of the posts in the forums (subreddits) that were banned by Reddit would be perceived by many as hate speech. But even in banned subreddits, not all of the content is hate speech (e.g., some of the top features were self-referential like the name of the subreddit) but a classifier would learn a high weight for these features.
Even when expert annotations are available on the level of individual posts, spurious features may remain. Waseem and Hovy (2016) produced expert annotations of hate speech on Twitter. They found that one of the strongest features for sexism is the name of an Australian TV show, because people like to post sexist comments about the contestants. If we are trying to make claims about what inhibits or encourages hate speech, we would not want those claims to be tied to the TV show’s popularity. Such problems are inevitable when datasets are not well-balanced over time, across genres, topics, etc. Especially with social media data, we lack a clear and objective definition of ‘balance’ at this time.
The risk of supervised models latching on to spurious features reinforces the need for interpretability. Although the development of supervised models is usually performance driven, placing more emphasis on interpretability could increase the adoption of these models in insight-driven analyses. One way would be to only use models that are already somewhat interpretable, for example models that use a small number of human-interpretable features. Rather than imposing such restrictions, there is also work on generating post-hoc explanations for individual predictions (e.g., Ribeiro et al. (2016)), even when the underlying model itself is very complex.
5.6 Topic modeling
Topic models (e.g., LDA (Blei et al., 2003)) are usually unsupervised and therefore less biased towards human-defined categories. They are especially suited for insight-driven analysis, because they are constrained in ways that make their output interpretable. Although there is no guarantee that a “topic” will correspond to a recognizable theme or event or discourse, they often do so in ways that other methods do not. Their easy applicability without supervision and ready interpretability make topic models good for exploration. Topic models are less successful for many performance-driven applications. Raw word features are almost always better than topics for search and document classification. LSTMs and other neural network models are better as language models. Continuous word embeddings have more expressive power to represent fine-grained semantic similarities between words.
A topic model provides a different perspective on a collection. It creates a set of probability distributions over the vocabulary of the collection, which, when combined together in different proportions, best match the content of the collection. We can sort the words in each of these distributions in descending order by probability, take some arbitrary number of most-probable words, and get a sense of what (if anything) the topic is “about”. Each of the text segments also has its own distribution over the topics, and we can sort these segments by their probability within a given topic to get a sense of how that topic is used.
One of the most common questions about topic models is how many topics to use, usually with the implicit assumption that there is a “right” number that is inherent in the collection. We prefer to think of this parameter as more like the scale of a map or the magnification of a microscope. The “right” number is determined by the needs of the user, not by the collection. If the analyst is looking for a broad overview, a relatively small number of topics may be best. If the analyst is looking for fine-grained phenomena, a larger number is better.
After fitting the model, it may be necessary to circle back to an earlier phase. Topic models find consistent patterns. When authors repeatedly use a particular theme or discourse, that repetition creates a consistent pattern. But other factors can also create similar patterns, which look as good to the algorithm. We might notice a topic that has highest probability on French stopwords, indicating that we need to do a better job of filtering by language. We might notice a topic of word fragments, such as “ing”, “tion”, “inter”, indicating that we are not handling end-of-line hyphenation correctly. We may need to add to our stoplist or change how we curate multi-word terms.
The output of our measurement procedures (in the social sciences often called the “scores”) must now be assessed in terms of their reliability and validity with regard to the (systemized) concept. Reliability aims to capture repeatability, i.e. the extent to which a given tool provides consistent results.
Validity assesses the extent to which a given measurement tool measures what it is supposed to measure. In NLP and machine learning, most models are primarily evaluated by comparing the machine-generated labels against an annotated sample. This approach presumes that the human output is the “gold standard” against which performance should be tested. In contrast, when the reliability is measured based on the output of different annotators, no coder is taken as the standard and the likelihood of coders reaching agreement by chance (rather than because they are “correct”) is factored into the resulting statistic. Comparing against a “gold standard” suggests that the threshold for human inter- and intra-coder reliability should be particularly high.
Accuracy, as well as other measures such as precision, recall and F-score, are sometimes presented as a measure of validity, but if we do not have a genuinely objective determination of what something is supposed measure—as is often the case in text analysis—then accuracy is perhaps a better indication of reliability than of validity. In that case, validity needs to be assessed based on other techniques like those we discuss later in this section. It is also worth asking what level of accuracy is sufficient for our analysis and to what extent there may be an upper bound, especially when the labels are native to the data or when the notion of a “gold standard” is not appropriate.
For some in the humanities, validation takes the form of close reading, not designed to confirm whether the model output is correct, but to present what Piper (2015, pp. 67-68) refers to as a form of “further discovery in two directions”. Model outputs tell us something about the texts, while a close reading of the texts alongside those outputs tells us something about the models that can be used for more effective model building. Applying this circular, iterative process to 450 18th-century novels written in three languages, Piper was able to uncover a new form of “conversional novel” that was not previously captured in “literary history’s received critical categories” (Piper, 2015, p. 92).
Along similar lines, we can subject both the machine-generated output and the human annotations to another round of content validation. That is, take a stratified random sample, selecting observations from the full range of scores, and ask: Do these make sense in light of the systematized concept? If not, what seems to be missing? Or is something extraneous being captured? This is primarily a qualitative process that requires returning to theory and interrogating the systematized concept, indicators, and scores together. This type of validation is rarely done in NLP, but it is especially important when it is difficult to assess what drives a given machine learning model. If there is a mismatch between the scores and systematized concept at this stage, the codebook may need to be adjusted, human coders retrained, more training data prepared, algorithms adjusted, or in some instances, even a new analytical method adopted.
Other types of validation are also possible, such as comparing with other approaches that aim to capture the same concept, or comparing the output with external measures (e.g., public opinion polls, the occurrence of future events). We can also go beyond only evaluating the labels (or point estimates). Lowe and Benoit (2013) used human judgments to not only assess the positional estimates from a scaling method of latent political traits but also to assess uncertainty intervals. Using different types of validation can increase our confidence in the approach, especially when there is no clear notion of ground truth.
Besides focusing on rather abstract evaluation measures, we could also assess the models in task-based settings using human experts. Furthermore, for insight-driven analyses, it can be more useful to focus on improving explanatory power than making small improvements in predictive performance.
In this phase, we use our models to explore or answer our research questions. For example, given a topic model we can look at the connection between topics and metadata elements. Tags such as “hate speech” or metadata information imply a certain way of organizing the collection. Computational models provide another organization, which may differ in ways that provide more insight into how these categories manifest themselves, or fail to do so.
Moreover, when using a supervised approach, the “errors”, i.e. disagreement between the system output and human-provided labels, can point towards interesting cases for closer analysis and help us reflect on our conceptualizations. In the words of Long and So (2016), they can be “opportunities for interpretation”. Other types of “failures” can be insightful as well. Sometimes there is a “dog that didn’t bark” (Doyle, 1892)–i.e., something that everyone thinks we should have found, but we did not. Or, sometimes the failures are telling us about the existence of something in the data that nobody noticed, or thought important, until then (e.g., the large number of travel journals in Darwin’s reading lists).
Computational text analysis is not a replacement for but rather an addition to the approaches one can take to analyze social and cultural phenomena using textual data. By moving back and forth between large-scale computational analyses and small-scale qualitative analyses, we can combine their strengths so that we can identify large-scale and long-term trends, but also tell individual stories. For example, the Reddit study on hate speech (Chandrasekharan et al., 2017) raised various follow-up questions: Can we distinguish hate speech from people talking about hate speech? Did people find new ways to express hate speech? If so, did the total amount of online hate speech decrease after all? As possible next steps, a qualitative discourse analyst might examine a smaller corpus to investigate whether commenters were indeed expressing hate speech in new ways; a specialist in interview methodologies might reach out to commenters to better understand the role of online hate speech in their lives. Computational text analysis represents a step towards better understanding social and cultural phenomena, and it is in many cases better suited towards opening questions rather than closing them.
Insight-driven computational analysis of text is becoming increasingly common. It not only helps us see more broadly, it helps us see subtle patterns more clearly and allows us to explore radical new questions about culture and society. In this article we have consolidated our experiences, as scholars from very different disciplines, in analyzing text as social and cultural data and described how the research process often unfolds. Each of the steps in the process is time-consuming and labor-intensive. Each presents challenges. And especially when working across disciplines, the research often involves a fair amount of discussion—even negotiation—about what means of operationalization and approaches to analysis are appropriate and feasible. And yet, with a bit of perseverance and mutual understanding, conceptually sound and meaningful work results so that we can truly make use of the exciting opportunities rich textual data offers.
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. Dong Nguyen is supported with an Alan Turing Institute Fellowship (TU/A/000006). Maria Liakata is a Turing fellow at 40%. We would also like to thank the participants of the “Bridging disciplines in analysing text as social and cultural data” workshop held at the Turing Institute (2017) for insightful discussions. The workshop was funded by a Turing Institute seed funding award to Nguyen and Liakata.
- Adcock and Collier  Robert Adcock and David Collier. Measurement validity: A shared standard for qualitative and quantitative research. American Political Science Review, 95(3):529–546, 2001.
- Aroyo and Welty  Lora Aroyo and Chris Welty. Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. In Proceedings of WebSci’13, 2013.
- Bamman et al.  David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160, 2014.
- Baumer et al.  Eric P. S. Baumer, David Mimno, Shion Guha, Emily Quan, and Geri K. Gay. Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology, 68(6):1397–1410, 2017.
- Blei et al.  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of machine Learning research, 3:993–1022, 2003.
- Bowker and Star  Geoffrey C. Bowker and Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, 1999.
- Chandrasekharan et al.  Eshwar Chandrasekharan, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob Eisenstein, and Eric Gilbert. You can’t stay here: The effectiveness of Reddit’s 2015 ban through the lens of hate speech. Proceedings of the ACM on Human-Computer Interaction, 1(2), 2017.
- danah boyd and Crawford  danah boyd and Kate Crawford. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5):662–679, 2012.
- Demeester et al.  Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, and Chris Develder. Predicting relevance based on assessor disagreement: Analysis and practical applications for search evaluation. Information Retrieval Journal, 19(3):284–312, 2016.
- Denny and Spirling  Matthew J. Denny and Arthur Spirling. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2):168–189, 2018.
- DiMaggio  Paul DiMaggio. Adapting computational text analysis to social science (and vice versa). Big Data & Society, 2015.
- Doyle  A. Conan Doyle. Adventures of Sherlock Homes: The adventure of Silver Blaze. Strand Magazine, Vol. IV, December 1892:291–306, 1892.
- Eckert  Penelope Eckert. Age as a sociolinguistic variable. In Florian Coulmas, editor, The Handbook of Sociolinguistics, pages 151–167. Blackwell Publishers, 1997.
- Eisenstein  Jacob Eisenstein. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 359–369, 2013.
- Eisenstein et al.  Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. Sparse additive generative models of text. In Proceedings of the International Conference on Machine Learning (ICML), pages 1041–1048, 2011.
- Fast et al.  Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 4647–4657, 2016.
- Fokkens et al.  Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen, and Nuno Freire. Offspring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1691–1701, 2013.
- Grimmer and Stewart  Justin Grimmer and Brandon M. Stewart. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3):267–297, 2013.
- Hammond et al.  Adam Hammond, Julian Brooke, and Graeme Hirst. A tale of two cultures: Bringing literary analysis and computational linguistics together. In Proceedings of the Workshop on Computational Linguistics for Literature, pages 1–8, 2013.
- Hovy and Søgaard  Dirk Hovy and Anders Søgaard. Tagging performance correlates with author age. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 483–488, Beijing, China, 2015.
- Kirschenbaum  Matthew G. Kirschenbaum. The remaking of reading: Data mining and the digital humanities. In The National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation, Baltimore, MD, 2007.
- Koolen and van Cranenburgh  Corina Koolen and Andreas van Cranenburgh. These are not the stereotypes you are looking for: Bias and fairness in authorial gender attribution. In Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 12–22, 2017.
- Long and So  Hoyt Long and Richard Jean So. Literary pattern recognition: Modernism between close reading and machine learning. Critical Inquiry, 42(2):235–267, 2016.
- Lowe and Benoit  Will Lowe and Kenneth Benoit. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis, 21(3):298–313, 2013.
- Lukasik et al.  Michal Lukasik, P. K. Srijith, Duy Vu, Kalina Bontcheva, Arkaitz Zubiaga, and Trevor Cohn. Hawkes processes for continuous time sequence classification: an application to rumour stance classification in Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 393–398, Berlin, Germany, 2016.
- Mosteller and Wallace  Frederick Mosteller and David L. Wallace. Inference in an authorship problem. Journal of the American Statistical Association, 58(302):275–309, 1963.
- Murdock et al.  Jaimie Murdock, Colin Allen, and Simon DeDeo. Exploration and exploitation of Victorian Science in Darwin’s reading notebooks. Cognition, 159:117–126, 2017.
- Neuendorf  Kimberly A. Neuendorf. The Content Analysis Guidebook. Sage, 2002.
- Nguyen and Eisenstein  Dong Nguyen and Jacob Eisenstein. A kernel independence test for geographical language variation. Computational Linguistics, 43(3):567–592, 2017.
- Nguyen et al.  Dong Nguyen, Dolf Trieschnigg, A. Seza Doğruöz, Rilana Gravel, Mariët Theune, Theo Meder, and Franciska de Jong. Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1950–1961, Dublin, Ireland, 2014.
- Nguyen et al.  Dong Nguyen, Thijs van den Broek, Claudia Hauff, Djoerd Hiemstra, and Michel Ehrenhard. #Supportthecause: Identifying motivations to participate in online health campaigns. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2570–2576, Lisbon, Portugal, 2015.
- Nguyen et al.  Dong Nguyen, A. Seza Doğruöz, Carolyn P. Rosé, and Franciska de Jong. Computational sociolinguistics: A survey. Computational Linguistics, 42(3):537–593, 2016.
- Olteanu et al.  Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kiciman. Social data: Biases, methodological pitfalls, and ethical boundaries. Available at SSRN: https://ssrn.com/abstract=2886526, 2016.
- Pechenick et al.  Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds. Characterizing the Google books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE, 10(10):e0137041, 2015.
- Piotrowski  Michael Piotrowski. Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 2012.
- Piper  Andrew Piper. Novel devotions: Conversional reading, computational modeling, and the modern novel. New Literary History, 46(1):63–98, 2015.
- Piper  Andrew Piper. Think small: On literary modeling. PMLA, 132(3):651–658, 2017.
- Ribeiro et al.  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of KDD ’16, pages 1135–1144, 2016.
- Salganik  Matthew J. Salganik. Bit by Bit: Social Research in the Digital Age. Princeton University Press, 2017.
- Schofield and Mimno  Alexandra Schofield and David Mimno. Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4:287–300, 2016.
- Sproat et al.  Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. Normalization of non-standard words. Computer Speech & Language, 15(3):287 – 333, 2001.
- Srivastava et al.  Sameer B. Srivastava, Amir Goldberg, V. Govind Manian, and Christopher Potts. Enculturation trajectories: Language, cultural adaptation, and individual outcomes in organizations. Management Science, 64(3):983–1476, 2018.
- Tangherlini  Timothy R. Tangherlini. Big folklore: A special issue on computational folkloristics. The Journal of American Folklore, 129(511):5–13, 2016.
- Tausczik and Pennebaker  Yla R. Tausczik and James W. Pennebaker. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24–54, 2010.
- Tromble et al.  Rebekah Tromble, Andreas Storz, and Daniela Stockmann. We don’t know what we don’t know: When and how the use of Twitter’s public APIs biases scientific inference. Available at SSRN: https://ssrn.com/abstract=3079927, 2017.
- Waseem and Hovy  Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93, San Diego, California, 2016.
- Webb et al.  Eugene J. Webb, Donald Thomas Campbell, Richard D. Schwartz, and Lee Sechrest. Unobtrusive Measures: Nonreactive Research in the Social Sciences. Chicago, IL: Rand McNally, 1966.
- Williams et al.  Matthew L. Williams, Pete Burnap, and Luke Sloan. Towards an ethical framework for publishing Twitter data in social research: Taking into account users’ views, online context and algorithmic estimation. Sociology, 51(6):1149–1168, 2017.
- Zubiaga et al. [2016a] Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, and Michal Lukasik. Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2438–2448, Osaka, Japan, 2016a.
- Zubiaga et al. [2016b] Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Peter Tolmie. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS one, 11(3):e0150989, 2016b.