Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

  • 2019-07-09 00:44:23
  • Tyler J. Gray, Christopher M. Danforth, Peter Sheridan Dodds
  • 19

Abstract

Stretched words like `heellllp' or `heyyyyy' are a regular feature of spokenlanguage, often used to emphasize or exaggerate the underlying meaning of theroot word. While stretched words are rarely found in formal written languageand dictionaries, they are prevalent within social media. In this paper, weexamine the frequency distributions of `stretchable words' found in roughly 100billion tweets authored over an 8 year period. We introduce two centralparameters, `balance' and `stretch', that capture their main characteristics,and explore their dynamics by creating visual tools we call `balance plots' and`spelling trees'. We discuss how the tools and methods we develop here could beused to study the statistical patterns of mistypings and misspellings, alongwith the potential applications in augmenting dictionaries, improving languageprocessing, and in any area where sequence construction matters, such asgenetics.

 

Quick Read (beta)

Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Tyler J. Gray [email protected] Vermont Complex Systems Center, Computational Story Lab, Department of Mathematics & Statistics, The University of Vermont, Burlington, VT 05401.    Christopher M. Danforth [email protected] Vermont Complex Systems Center, Computational Story Lab, Department of Mathematics & Statistics, The University of Vermont, Burlington, VT 05401.    Peter Sheridan Dodds [email protected] Vermont Complex Systems Center, Computational Story Lab, Department of Mathematics & Statistics, The University of Vermont, Burlington, VT 05401.
July 11, 2019
Abstract

Stretched words like ‘heellllp’ or ‘heyyyyy’ are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of ‘stretchable words’ found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, ‘balance’ and ‘stretch’, that capture their main characteristics, and explore their dynamics by creating visual tools we call ‘balance plots’ and ‘spelling trees’. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics.

pacs:
89.65.-s,89.75.Da,89.75.Fb,89.75.-k

I Introduction

Watch a soccer match, and you are likely to hear an announcer shout ‘GOOOOOOOOOAAAAAAAAL!!!!!!’. Stretched words, sometimes called elongated words elongated , are an integral part of spoken language, often used to modify the meaning of the base word in some way, such as to strengthen the meaning (e.g., ‘huuuuuge’), imply sarcasm (e.g., ‘suuuuure’), show excitement (e.g., ‘yeeeessss’), or communicate danger (e.g., ‘nooooooooooooo’). We will refer to words that are amenable to such lengthening as ‘stretchable words’.

However, despite their being a fundamental part of spoken language, stretched words are rarely found in literature and lexicons: There is no ‘hahahahahahaha’ in the Oxford English Dictionary OED1989 . With the advent and rise of social media, stretched words have finally found their way into large-scale written text.

With the increased use of social media comes rich datasets of a linguistic nature, granting science an unprecedented opportunity to study the everyday linguistic patterns of society. As such, in recent years there have been a number of papers published that have used data from social media platforms, such as Twitter, to study different aspects of language grieve2016a ; gray2018a ; goncalves2014a ; goncalves2018a ; donoso2017a ; eisenstein2010a ; eisenstein2014a .

In this paper, we use an extensive set of social media messages collected from Twitter—tweets—to investigate the characteristics of stretchable words used in this particular form of written language. The tools and approach we introduce here have many potential applications, including the possible use by dictionaries to formally include this intrinsic part of language. The online dictionary Wiktionary has already discussed the inclusion of some stretched words and made a policy on what to include wiktionary:talk_cute ; wiktionary:criteria_for_inclusion . Other potential applications include the use by natural language processing software and toolkits, and by Twitter to build better spam filters.

We structure our paper as follows: In Sec. II, we detail our dataset and our method of collecting stretchable words and distilling them down to their ‘kernels’. In Sec. III.1, we examine the frequency distributions for lengths of stretchable words. We quantify two independent properties of stretchable words: Their ‘balance’ in Sec. III.2 and ‘stretch’ in Sec. III.3. In Sec. III.4, we develop an investigative tool, ‘spelling trees’, as a means of visualizing stretchable words involving a two character repeated element. We comment on mistypings and misspellings in Sec. III.5. Finally, in Sec. IV, we provide concluding remarks.

II Description of the dataset and method for extracting stretched words

1. hahhahahaahahaa
(ha)
2. gooooooaaaaaaal
g[o][a]l
3. ggggoooooaaaaallllll
[g][o]aaaaallllll
[g][o][a][l]
4. bbbbbaaaaaabbbbbbyyyyyyy
[b][a][b]yyyyyyy
[b][a][b][y]
5. awawawaaawwwwwesssssommmmmeeeeee
(aw)esssssommmmmeeeeee
(aw)essssso[m][e]
(aw)e[s]o[m][e]
Table 1: Examples of distilling tokens down to their kernels. The first line of each cell is the example token. The following lines show the result after every time a replacement of characters by the corresponding single letter element(s) or double letter element is made by the code, in order. The final line of each cell gives the resulting kernel for each example.

The Twitter dataset we use in this study comprises a random sample of approximately 10% of all tweets (the ‘gardenhose’ API) from 9 September 2008 to 31 December 2016. We limited our scope to tweets that either were flagged as an English tweet or not flagged for any language. All tweets in this time period have a maximum length of 140 characters. To collect stretchable words, we begin by making all text lowercase and collecting all tokens within our dataset from calendar year 2016 that match the Python regular expression r‘(\b\w*(\w)(\w)(?:\2|\3){28,}\w*\b)’. This pattern will collect any token with at least 30 characters that has a single character repeated at least 29 times consecutively, or two different characters that are repeated in any order at least 28 times, for a total of at least 30 consecutive repeated occurrences of the two characters. The choice of 28 in the regular expression is a threshold we chose with the goal of limiting our collection to tokens of words that really do get stretched in practice.

After collecting these tokens, we remove any that contains a character that is not a letter ([a-z]), and distill each remaining token down to its ‘kernel’. Table 1 gives a few examples of this distillation process. Proceeding along the token from left to right, whenever any pair of distinct letters, l1 and l2, occur in the token where 1. l1 occurs followed by any sequence of l1 and l2 of total length at least three, and 2. such that l1 and l2 each occur at least twice in the sequence, we replace the sequence with the ‘two letter element’ (l1l2). For example, see the first cell in Table 1.

Exceptions to the preceding are: 1. The case where the sequence is a series of l1 followed by a series of l2, which is replaced with the pair of ‘single letter elements’ [l1][l2]. For example, see the second cell in Table 1. And 2., the case where the sequence is a series of l1 followed by a series of l2 followed by a series of l1, which is replaced with [l1][l2][l1]. For example, see the first step in the fourth cell of Table 1 where ‘bbbbbaaaaaabbbbbb’ is replaced with [b][a][b].

Following this process, whenever a single letter, l3, occurs two or more times in a row, we replace the sequence with the single letter element [l3]. For example, see the last step of the fourth cell in Table 1 where ‘yyyyyyy’ is replaced with [y], or the last step in the fifth cell where ‘sssss’ is replaced with [s].

Figure 1: Token count distribution for the kernel (to). The horizontal axis represents the length (number of characters) of the token and the vertical axis gives the total number of tokens of a given length that match this kernel. The included statistics give the kernel rank, r (see Sec. II), the value of the balance parameter (normalized entropy, H; see Sec. III.2), and the value of the stretch parameter (Gini coefficient, G; see Sec. III.3) for this kernel. The large drop between the second and third points denotes the change from ‘unstretched’ versions of (to), located to the left of this drop, to ‘stretched’ versions of (to), located to the right of this drop.
Figure 2: Total token counts for stretched versions of all kernels. Kernels are ranked by their descending total token count along the horizontal axis. The diagonal line gives the regression line calculated using the values between ranks 10 and 103. The vertical dashed line denotes the first location after rank 103 where the distribution drops below 1/10 of the corresponding value of the regression line, denoted by the red interval, giving the cutoff rank for the final threshold to decide which kernels to include in the study.

We collected tokens in batches of seven consecutive days at a time throughout 2016 (with the last batch being only two days). If a kernel is not found in more than one batch, or within the same batch but from at least two distinct stretched words, then it is removed from consideration.

Different but related stretched words (that is, different stretched words, but both stretched out versions of the same base word) may distill to different kernels. We combine these into a single kernel for each word such that it covers all cases observed in the collected tokens. For example, the kernels g[o]a[l] and go[a][l] would be combined as g[o][a][l]. The kernels h[a] and (ha) would be combined as (ha).

After processing our dataset, we obtained a collection of 7,526 kernels. We then represented each kernel with a corresponding regular expression and collected all tokens in our entire gardenhose dataset that matched the regular expressions. To go from the kernel to the regular expression, we replaced ] with ]+, replaced (l1l2) with l1[l1l2]*l2[l1l2]*, and we surrounded the kernel with word boundary characters \b. So, for example, the kernel g[o][a][l] goes to the Python regular expression r‘\bg[o]+[a]+[l]+\b’ and the kernel (ha) goes to the Python regular expression r‘\bh[ha]*a[ha]*\b’.

Once we collected all tokens matching our kernels, we carried out a final round of thresholding on our kernel list, removing those with the least amount of data and least likely to represent a bona fide stretchable word. For each kernel, we calculated the token count as a function of token length (number of letters) for all tokens matching that kernel. For example, Fig. 1 gives the plot of the token count distribution for the kernel (to). Then, with the token counts in order by increasing token length, as in Fig. 1, we found the location of the largest drop in the log10 of token counts between two consecutive values within the first 10 values. We call the words with lengths coming before the location of the drop ‘unstretched’ versions of the kernel and those that come after ‘stretched’ versions. For most kernels, the largest drop will be between the first and second value. However, for some kernels this drop occurs later. For example, in Fig. 1 we see that for the kernel (to), which covers both the common words ‘to’ and ‘too’, this drop is between the second and third value (between tokens of length three and four). Thus, the unstretched versions of (to) are represented by the first two points in Fig. 1, with the remaining points representing stretched versions of (to).

We then ranked the kernels by the sum of the token counts for their stretched versions. Fig. 2 shows this sum as a function of rank for each kernel. Inspired by the idea of a cutoff frequency wiki:cutoff_frequency , we estimate a cutoff rank for the kernels. Using the values between rank 10 and 103, we found the regression line between the log10 of the ranks and the log10 of the summed token counts (straight line, Fig. 2). We calculated the cutoff as the first rank (after 103) where the summed token count is less than 1/10 of the corresponding value of the regression line. This occurs at rank 5,164, which is shown by the vertical dashed line in Fig. 2. For the remainder of this study, we used the kernels with rank preceding this cutoff, giving us a total of 5,163 kernels, and, unless otherwise specified, a kernel’s ‘rank’, r, refers to the rank found here. See Online Appendix A at http://compstorylab.org/stretchablewords/ for a full list of kernels meeting our thresholds, along with their regular expressions and other statistics discussed throughout the remainder of this paper.

III Analysis and Results

III.1 Distributions

For each kernel, we plotted the corresponding distribution of token counts as a function of token length. Most of the distributions largely follow a roughly power-law shape. For example, Fig. 3 gives the frequency distribution for the kernel [g][o][a][l]. From the elevated frequency of the first dot, we can see that the unstretched word ‘goal’ is used about two orders of magnitude more frequently than any stretched version. After the first point, we see a rollover in the distribution, showing that if users are going to stretch the word, they are more likely to include a few extra characters rather than just one. We also see that there are some users who indeed fill the 140 character limit with a stretched version of the word ‘goal’, and the elevated dot there suggests that if users get close to the character limit, they are more likely to fill the available space. The other dots elevated above the trend represent tokens that likely appear in tweets that have a small amount of other text at the beginning or end, such as a player name or team name, or, more generally, a link or a user handle.

Figure 3: Token count distribution for the kernel [g][o][a][l]. The horizontal axis represents the length (number of characters) of the token and the vertical axis gives the total number of tokens of a given length that match this kernel. See Fig. 1 caption for details on the included statistics. The base version of the word appears roughly 100 times more frequently than the most common stretched version.
Figure 4: Token count distribution for the kernel (ha). The horizontal axis represents the length (number of characters) of the token and the vertical axis gives the total number of tokens of a given length that match this kernel. See Fig. 1 caption for details on the included statistics.

In Fig. 4, we show the frequency distribution for the kernel (ha) as an example of a distribution for a two character repeated element. For this distribution we observe an alternating up and down in frequency for even length tokens and odd length tokens. This behaviour is typical of distributions with a two character repeated element, likely resulting from an intent for these tokens to be a perfect alternating repetition of ‘h’ and ‘a’, hahaha…, to represent laughter. Under this assumption, the correct versions will be even length. Then, any incorrect version could be odd or even length depending on the number of mistakes. We look at mistakes further in Sec. III.5.

We note that there is also an initial rollover in this distribution, showing that the four character token, with dominant contributor ‘haha’, is the most common version for this kernel. We also again see some elevated counts near the tail, including for 140 characters, along with some depressed counts just short of 140, which again suggests that when users approach the character limit with stretched versions of (ha), they will most likely fill the remaining space. We did not perform a detailed analysis of this area, but it is likely that the other elevated points near the end are again due to the inclusion of a link or user handle, etc. Similarly, the general flattening of the distribution’s right tail is likely a result of random lengths of short other text combined with a stretched word that fills the remaining space.

Similar distributions for each kernel can be found in Online Appendix B at http://compstorylab.org/stretchablewords/.

III.2 Balance

For each kernel, we measure two quantities: 1. The balance of the stretchiness across characters, and 2. the overall stretchiness of the kernel. To measure balance, we calculate the average stretch of each character in the kernel across all the tokens within a bin of token lengths. By average stretch of a character, we mean the average number of times that character appears. Fig. 5 shows the balance for the kernel [g][o][a][l] partitioned into bins of logarithmically increasing sizes of length. The horizontal dashed lines represent the bin edges. The distance between the solid diagonal lines represents the average stretch, or average number of times each character was repeated, and are plotted in the same order that they appear in the kernel. From this figure we see that ‘g’ is not stretched much on average, ‘o’ is stretched the most, and ‘a’ and ‘l’ are both stretched around 2/3 as much as ‘o’.

Figure 5: Balance plot for the kernel [g][o][a][l]. The vertical axis represents the length (number of characters) of tokens, and is broken into bins of lengths, with boundaries denoted by horizontal dashed lines, which increase in size logarithmically. For all the tokens that match the kernel and fall within a bin of lengths, the average number of times each character was stretched in those tokens was calculated, and is shown on the plot as the distance between two solid lines in the same order as in the kernel. Thus, for a given bin, the distance between the vertical axis and the first solid line is the average stretch for the letter ‘g’, the distance between that first line and the second line is the average stretch for the letter ‘o’, and so on. For example, the last bin contains tokens with lengths in the interval [131,140], with average length roughly 137. On average, tokens falling in this most celebratory bin contain roughly 3 ‘g’s, 57 ‘o’s, 41 ‘a’s, and 36 ‘l’s.
Figure 6: Balance plot for the kernel (ha). See the Fig. 5 caption for plot details. For two letter elements, even though the letters can alternate within a given token, we still count the number of occurrences for each letter separately and display the average number of total repetitions in the same order as the letters appear in the kernel. Thus, for a given bin, the distance between the vertical axis and the first line is the average number of times the letter ‘h’ occurred in the tokens, and the distance between that first line and the second line is the average number of times the letter ‘a’ occurred in the token. This plot clearly shows that (ha) is well balanced across all bins of token lengths.

When part of the kernel is a two letter element of the form (l1l2), we still count the number of occurrences of l1 and l2 corresponding to this element in the kernel separately, even though the letters can be intermingled in the stretched word. When we display the results, we display it in the same order that the letters appear in the kernel. So in Fig. 6, which shows the results for the kernel (ha), the first space represents the average stretch for ‘h’ and the second space is for ‘a’. From this figure, we can see that the stretch is almost perfectly balanced between the two letters on average.

Figure 7: Jellyfish plots for kernel balance for (A) all kernels, and (B) excluding kernels with entropy exactly 0. Corresponding histograms are given at the top of each plot. Kernels are plotted vertically by their rank, r, and horizontally by their balance as given by normalized entropy, H, where larger entropy denotes increased balance. The deciles 0.1,0.2,,0.9 are calculated for rolling bins of 500 kernels and are plotted as the ‘tentacles’.

Similar balance plots can be found for each kernel in Online Appendix C at http://compstorylab.org/stretchablewords/. In general, for these balance plots, we stop plotting at the first bin with no tokens, even if later bins may be nonempty.

For each kernel, we also calculate an overall measure of balance. To do this, we begin by binning the tokens by length. Then, for each bin (containing tokens longer than the kernel) we calculate the average stretch for each character across tokens within the bin as before. Then, we subtract one from each of these values (removing the contribution from each base character; counting just the number of times each character was repeated) and normalize the values so they sum to 1 and can be thought of like probabilities. We then average the probabilities across the bins, weighing each bin equally, and compute the normalized entropy, H, of the averaged probabilities as our overall measure of balance. This measure is such that if each character stretches the same on average, the normalized entropy is 1, and if only one character in the kernel stretches, the normalized entropy is 0. Thus, higher entropy corresponds with more balanced words. (For a comparison with an alternate entropy measure where tokens contribute equally rather than equally weighing each length bin, and an explanation of the different corresponding views, see Appendix A.)

Fig. 7 shows two ‘jellyfish plots’ kloumann2012a for balance. Fig. 7A is the version containing all words and for Fig. 7B we remove the words that have a value of 0 for entropy. The top of the left plot in Fig. 7 shows the frequency histogram of the normalized entropy for each kernel. The spike containing value 0 comes largely from kernels where only one character stretches, giving that kernel an entropy of exactly 0. The main plot shows the normalized entropy values as a function of word rank, where rank is given, as before, by the sum of stretched token counts. The ‘tentacles’ give rolling deciles. That is, for rolling bands of 500 words by rank, the deciles 0.1,0.2,,0.9 are calculated for the entropy values, and are represented by the solid lines. These plots allow us to see how stable the distribution is across word ranks.

We can see from Fig. 7A that the distribution largely shifts towards smaller entropy values with increasing rank, mostly drawn in that direction by the kernels with only a single repeated letter and entropy exactly 0. For Fig. 7B, we remove all kernels with entropy 0. Everything else remains the same, including the rank of each kernel (we skip over ranks of removed kernels) and the rolling bands of 500 kernels for percentile calculations still have 500 kernels, and thus tend to be visually wider bands. In contrast to Fig. 7A, we now see a small left-shift in the earlier ranks, and then the distribution tends to stabilize for lower ranks. This shows that the highest ranked kernels tend to have a larger entropy, meaning the stretch of the kernel is more equally balanced across all characters. We also see that not many of the high ranked words stretch with just one character. It appears that these kernels that stretch in only a single character become more prevalent in the lower ranks.

Table 2 shows the kernels with the ten largest entropies and Table 3 shows those with the ten smallest nonzero entropies. We observe that the kernels with largest entropies are mostly of the form (l1l2) and are almost perfectly balanced. The least balanced kernels tend to be more recognizable English or Spanish words and names, with one exclamation also appearing in the bottom ten.

H Kernel Example token
1 0.99998 (kd) kdkdkdkdkdkdkd
2 0.99998 (ha) hahahahahaha
3 0.99997 [i][d] iiiiiiiddddd
4 0.99997 (ui) uiuiuiuiuiui
5 0.99997 (ml) mlmlmlmlmlmlml
6 0.99995 (js) jsjsjsjsjsj
7 0.99990 [e][t] eeeeetttttt
8 0.99988 (ox) oxoxoxoxoxox
9 0.99980 (xq) xqxqxqxqxqxqxq
10 0.99971 (xa) xaxaxaxaxaxa
Table 2: Top 10 kernels by normalized entropy, H.
H Kernel Example token
1 0.01990 [b][o][b]ies booooooobies
2 0.02526 [d][o][d]e doooooooode
3 0.03143 infini[t][y] infinityyyyy
4 0.03342 che[l]se[a] chelseaaaaaa
5 0.03587 tay[l]o[r] taylorrrrrr
6 0.03803 f(re) freeeeeeeeeeeee
7 0.03930 [f]ai[r] fairrrrrrrr
8 0.05270 regr[e][s][e] regreseeeeee
9 0.05271 herm[a][n][a] hermanaaaaaaaa
10 0.05323 sq[u][e] squueeeeeeee
Table 3: Bottom 10 kernels by normalized entropy, H.

III.3 Stretch

To measure overall stretchiness for a kernel we calculated the Gini coefficient, G, of the kernel’s token length frequency distribution. (For a comparison with another possible measure of stretch, see Appendix B.) If the distribution has most of its weight on the short versions and not much on stretched out versions, then the Gini coefficient will be closer to 0. If more tokens are long and the kernel is stretched longer more often, the Gini coefficient will be closer to 1. Fig. 8 gives the jellyfish plot for the Gini coefficient for each kernel. The horizontal axis has a logarithmic scale, and the histogram bins have logarithmic widths. From this plot, we see that the distribution for stretch is quite stable across word ranks, except for perhaps a slight shift towards higher Gini coefficient (more stretchiness) for the highest ranked kernels.

Figure 8: Jellyfish plots for kernel stretch as measured by the Gini coefficient, G, of its token count distribution, where higher Gini coefficient denotes increased stretch. The histogram is given at the top of the plot (with logarithmic width bins). Kernels are plotted vertically by their rank, r, and horizontally (on a logarithmic scale) by their stretch. The deciles 0.1,0.2,,0.9 are calculated for rolling bins of 500 kernels and are plotted as the ‘tentacles’.
Figure 9: Kernels plotted in Balance-Stretch parameter space. Each kernel is plotted horizontally by the value of its balance parameter, given by normalized entropy, H, and vertically (on a logarithmic scale) by its stretch parameter, given by the Gini coefficient, G, of its token count distribution. Larger entropy implies greater balance and larger Gini coefficient implies greater stretch.
G Kernel Example token
1 0.66472 [k] kkkkkkkkkkkkkkk
2 0.63580 [w][v][w] wwwwwwwwwwvwwww
3 0.62843 [m][n][m] mmmmmmmmmmmmnm
4 0.53241 [o][c][o] oooooooooco
5 0.52577 wa(ki) wakikikikkkikikik
6 0.51706 (go)[l] goooooooooool
7 0.51273 [m][w][m] mmmmmwmmmmmmmmm
8 0.50301 galop[e]ir[a] galopeeeeira
9 0.50193 [k][j][k] kkkkkjjkkkkkkkkkk
10 0.49318 [i][e][i] iiiiiieeiiiiiii
Table 4: Top 10 kernels by Gini coefficient, G.
G Kernel Example token
1 0.00001 am[p] amppppppppp
2 0.00002 m[a]kes maaaaaaaaakes
3 0.00002 fr[o]m frooooooooooom
4 0.00002 watch[i]ng watchiiiiiing
5 0.00003 w[i]th wiiiiiiiith
6 0.00004 pla[y]ed playyyyyyed
7 0.00004 s[i]nce siiiiiiiince
8 0.00006 eve[r]y everrrrrrrrrry
9 0.00006 manage[r] managerrrrr
10 0.00007 learnin[g] learninggggg
Table 5: Bottom 10 kernels by Gini coefficient, G.

Table 4 shows the top 10 kernels ranked by Gini coefficient and Table 5 shows the bottom 10. The top kernel is [k], which represents laughter in Portuguese, similar to (ha) in English (and other languages). Containing a single letter, [k] is easier to repeat many times, and does not have an unstretched version that is a common word. We also see (go)[l] on the list, where ‘gol’ is Spanish and Portuguese for ‘goal’. Interestingly, (go)[l] has a much higher Gini coefficient (G=0.5171) than [g][o][a][l] does (G=0.1080). The kernels with lowest Gini coefficient all represent regular words and all allow just one letter to stretch, which does not get stretched much.

Figure 10: Spelling tree for the kernel (ha). The root node represents ‘h’. From there, branching to the left (light gray edge) is equivalent to appending an ‘h’. Branching to the right (dark gray edge) is equivalent to appending an ‘a’. The edge width is logarithmically related to the number of tokens that pass along that edge when spelled out. A few example words are annotated, and their corresponding nodes are denoted with a star. This tree was trimmed by only including words with a token count of at least 10,000. The code used to create the figures for these spelling trees is largely based on the algorithm presented by Wetherel and Shannon wetherell1979a . We note that Mill has written a more recent paper based largely on this earlier work specialized for Python mill2008a , and an implementation for it as well mill2008b , but they both contain algorithmic bugs (detailed in Appendix C).

In Fig. 9, we show a scatter plot of each kernel where the horizontal axis is given by the measure of balance of the kernel using normalized entropy, and the vertical coordinate is given by the measure of stretch for the kernel using the Gini coefficient. Thus, this plot positions each kernel in the two dimensional space of balance and stretch. We see that the kernels spread out across this space and that these two dimensions capture two independent characteristics of each kernel.

We do note that there are some structures visible in Fig. 9. There is some roughly vertical banding. In particular, the vertical band at H=0 is from kernels that only allow one character to stretch and the vertical band near H=1 is from kernels where all characters are allowed to stretch and do so roughly equally, which especially occurs with kernels that are a single two letter element. Fainter banding around H.43, H.5, and H.63 can also be seen. This largely comes from kernels of length 5, 4, and 3, respectively, that allow exactly two characters to stretch and those characters stretch roughly equally. If the stretch was perfectly equal, then the normalized entropy in each respective case would be H=1/log2(5).43, H=1/log2(4)=.5, and H=1/log2(3).63.

Figure 11: A collection of example spelling trees. From left to right, top to bottom, trees for the kernels (to), (ja), (aw), (do), h(er), (fu), (mo), and (xo).

III.4 Spelling trees

So far we have considered frequency distributions for kernels by token length, combining the token counts for all the different words of the same length matching the kernel. However, different tokens of the same length may of course be different words—different stretched versions—of the same kernel. For kernels that contain only single letter elements, these different versions may just have different amounts of the respective stretched letters, but all the letters are in the same order. However, for kernels that have two letter elements, the letters can change order in myriad ways, and the possible number of different stretched versions of the same length becomes much larger and potentially more interesting.

In order to further investigate these intricacies, we introduce ‘spelling trees’ to give us a visual method of studying the ways in which kernels with two letter elements are generally expanded. Fig. 10 gives the spelling tree for the kernel (ha). The root node is the first letter of the two letter element, which in this case is ‘h’. Then, recursively, if the next letter in the word matches the first letter of the pair, it branches left, represented by a lighter gray edge, and if it matches the second letter of the pair then it branches right, represented by a darker gray edge. This branching continues until the word is finished. The first few nodes are highlighted with the letter corresponding to that point of the tree. The edge weights are logarithmically related to the number of tokens flowing through them. In Fig. 10, a few nodes, denoted by stars, are annotated with the exact word to which they correspond. The annotated nodes are all leaf nodes, but words can, and most do, stop at nodes that are not leaves. We also trimmed the tree by only including words that have a token count of at least 10,000. This threshold of pruning reveals the general pattern while avoiding making the spelling tree cluttered.

The spelling tree for (ha) has a number of interesting properties. Most notable among them is the self-similar, fractal-like structure. The main branch line dropping down just right of center represents the perfect alternating sequence ‘hahahahaha…’, as shown by the annotated example at the leaf of this line. There are also many similar looking subtrees that branch off from this main branch that each have their own similar looking main branch. These paths that follow the main branch, break off at one location, and then follow the main branch of a subtree represent words that are similar to the perfect alternating laugh, but either have one extra ‘h’ (if the branch veers left) or one extra ‘a’ (if the branch leads right). For example, the middle left annotation shows that the fourth letter was an extra ‘h’, and then the rest of the word retained an accurate alternating pattern. This word, ‘hahhahahahahahaha’, appeared 13,894 times in our dataset.

The tree also shows that ‘haaaaa…’ is a strong pattern, as can be seen farthest right in the (ha) spelling tree. The subtrees on the right show that users also start with the back and forth pattern for a stretch, and then finish the word with trailing ‘a’s. Many other patterns also appear in this tree, and additional patterns are occluded by our trimming of the tree, but likely most of these come from users trying to follow one of the patterns we have already highlighted and introducing mistypings.

We made similar trees for every kernel that had a single occurrence of a two letter element, where the tree represents just the section of word that matches the two letter element. These trees are trimmed by only including words that have a token count of at least the fourth root of the total token count for the stretched tokens.

Fig. 11 gives eight more examples of these spelling trees. The trees for (ja) and (xo) have many of the same characteristics as the tree for (ha), as do most of the trees for kernels that are a two letter element where tokens predominantly alternate letters back and forth. For the tree for (xo), the pattern where the first letter of the two letter element is stretched, followed by the second letter being stretched, such as ‘xxxxxooooo’, is more apparent. This type of pattern is even more notable in the trees for (aw), and especially (fu). The tree for (mo) has stretched versions for both ‘mom’ and ‘moo’. Similarly, the tree for h(er) shows stretched versions of both ‘her’ and ‘here’, where we see that both ‘e’s and the ‘r’ all get stretched. In the tree for (to), the word ‘totoo’ has a much larger token count then words stretched beyond that (noticeable by the fact that the edges leaving that node are much smaller than the edge coming in). The word ‘totoo’ is Tagalog for ‘true’. Finally, every example tree here does show the back and forth pattern to at least some extent. All of the trees created are available for viewing in Online Appendix D at http://compstorylab.org/stretchablewords/.

Figure 12: (A) Token count distribution, (B) balance plot, and (C) spelling tree for the kernel n[o](io). In general, these types of plots offer diagnostic help when studying mistypings. In this case, they provide evidence towards the conclusion that the words that match this kernel were likely meant to be stretched versions of the word ‘no’ with a few mistaken ‘i’s included. Note that ‘i’ is next to ‘o’ on a standard QWERTY keyboard.

III.5 Mistypings

Mistypings appear often in tweets and we see evidence of them in stretched words. For example, the kernel n[o](io) is likely a result of mistypings of n[o]. On at least some platforms, holding down the key for a letter does not make that letter repeat, so one must repeatedly press the same key. For the standard QWERTY keyboard layout, the letter ‘i’ is next to the letter ‘o’, so it would be easy to accidentally press the letter ‘i’ occasionally instead of ‘o’ when trying to repeat it many times, especially on the small keyboards accompanying mobile phones. This sort of thing could lead to a kernel like n[o](io) when users try to stretch the word ‘no’. Similarly, the letters ‘a’ and ‘s’ are next to each other on a QWERTY keyboard, so a kernel like (ha)s(ha)(sh)(ah) likely comes from mistypings of the much simpler kernel (ha).

However, it is not always clearly apparent if a kernel is from mistypings or on purpose, or perhaps comes as a result of both. For example, the letter ‘b’ is close to the letter ‘h’, so the kernel (ha)b(ah) could come from mistypings of (ha). But, this form could also be intentional, and meant to represent a different kind of laughter. For example, (ba)(ha) is a highly ranked kernel (rank 211) representing a comedically sinister kind of laughter. Similarly, (ja) is a core component of laughter in Spanish, but ‘j’ is next to ‘h’ on the QWERTY keyboard, so it is not apparent if a kernel like (ha)j(ah)(ja)(ha) comes from mistypings or from switching back and forth between English and Spanish as the word stretches.

Our methodology may enable further study of mistypings. For example, Fig. 12 shows the distribution, balance plot, and spelling tree for the kernel n[o](io). The distribution shows that it is not a strong kernel, with the lower rank of 4,858, compared to a rank of 8 for (no). The balance plot shows that the letter ‘i’ is not stretched much, and the spelling tree shows that the word is mostly just a repetition of ‘o’s. On the whole, the evidence suggests that the kernel n[o](io) is mainly a result of mistypings.

These tools can also be used to help study what are likely misspellings, rather than mistypings. For example, Fig. 13 shows the spelling tree for the kernel hear(ta)ck (which does not actually fall within our rank cutoff, as described in Sec. II, but provides a good example). The word ‘attack’ has two ‘t’s. Thus, the word ‘heartattack’ (if written as one word; usually it is two) should, under normal spelling, have a double ‘t’ after the second ‘a’. From Fig. 13 we can see from the weights of the branches that it is often written as ‘heartatack’, with a single ‘t’ instead of the double ‘t’.

Figure 13: Spelling tree for the kernel hear(ta)ck. From this tree, we can see the relative number of times the word ‘heartatack’ is written rather than ‘heartattack’, indicating a common misspelling.

IV Concluding remarks

In this paper, we have studied stretched words, which are often used in spoken language. Until the advent of social media, stretched words were not prevalent in written language and largely absent from dictionaries. The area of stretchable language is rich, and we have discovered that these words span at least the two dimensional parameter space of balance and stretch.

The tools we have developed not only help uncover the hidden dynamics of stretchable words, but can be further applied to study phenomena such as mistypings and misspellings, and possibly more. Online dictionaries, such as the Wiktionary wiktionary , could use our kernels as a general entry for each type of stretchable word, and include the balance and stretch parameters as part of their structured word information, as they do, for example, with part of speech. Natural language processing software and toolkits could use the techniques we developed to help with processing stretched words, e.g., in their approach to stemming. Similarly, spell checking software may be able to use our methods to help prevent marking stretched words as misspellings. Our procedures could also be used to help prevent typosquatting wiki:typosquatting . Twitter could use our methods to help improve their spam filter, looking for slight variations of tweets. Also, spelling trees could more generally be used to analyze the construction of any sequence, such as genome sequences.

However, much more could be done. We have restricted our study to words containing only Latin letters. Future work could extend this to include all characters, including punctuation and emojis. We also limited the way we constructed kernels, focusing only on one and two letter elements. This can be expanded to three letter elements and possibly beyond to capture the characteristics of words like ‘omnomnomnom’. Furthermore, our methodology for creating kernels leads to situations where, for example, we have both (ha)g(ah) and (ha)(ga)(ha) as kernels. Expanding to three letter elements and beyond in the future could collapse these forms, and related kernels, into a kernel like (hag).

Along with more advanced kernels, similar but more advanced spelling trees could be developed. We only created spelling trees for kernels with a single two letter element. Future work could explore kernels with more than two letter elements. They could also be created for every kernel, where the branching of even the single letter elements is shown, where one branch would signify the repetition of that letter and the other branch would signify moving onto the next letter of the kernel. Furthermore, to go with three letter elements, ternary trees could be developed. Among other things, this would reveal mistypings like (ha)(hs), for example, if this became a kernel with a three letter element like (has), and we assume that the ‘s’ is mostly a mistyping of the letter ‘a’ in the kernel (ha). This situation should be discernible from the case where the word ‘has’ is stretched.

Another interesting phenomenon to look at is the distinction between phonetic and visual stretching. When verbally stretching a word, only certain sounds can be stretched out, whereas when typing a word, any letter can be repeated. For example, compare ‘gooooaaaaal’ with ‘ggggggooooooooaaaaaalllll’. Both stretched words can be typed, but only the first can be said because of the plosive ‘g’. Relatedly, looking at what parts of words, such as the end of words, or which letters get stretched more could be interesting.

Finally, our methodology could be used to explore linguistic and behavioral responses to changes in Twitter’s protocol (e.g., character length restrictions) and platform (e.g., mobile vs. laptop). For example, what are the effects of auto-correct, auto-complete, and spell check technologies? And what linguistic changes result from platform restrictions such as when a single key cannot be held down anymore to repeat a character? Also, we only considered tweets before the shift from the 140 to 280 character limit on Twitter. Some initial work indicates that the doubling of tweet length has removed the edge effect that the character limit creates gligoric2018a_conference . Further work could study how this change has affected stretchable words, and in particular, the tail of their distributions.

Acknowledgements.
CMD and PSD were supported by NSF Grant No. IIS-1447634, and TJG, CMD, and PSD were supported by a gift from MassMutual.

References

  • [1] Appendix: Glossary — Wiktionary, the free dictionary. https://en.wiktionary.org/w/index.php?title=Appendix:Glossary&oldid=51610328. Accessed: 2019-03-24.
  • [2] J. A. Simpson and E. S. C. Weiner, editors. The Oxford English Dictionary. Oxford University Press, Oxford, 2nd edition, 1989.
  • [3] Yuan Huang, Diansheng Guo, Alice Kasakoff, and Jack Grieve. Understanding US regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 59:244–255, 2016.
  • [4] Tyler J. Gray, Andrew J. Reagan, Peter Sheridan Dodds, and Christopher M. Danforth. English verb regularization in books and tweets. PLOS ONE, 13(12):1–17, 12 2018.
  • [5] Bruno Gonçalves and David Sánchez. Crowdsourcing dialect characterization through Twitter. PLOS ONE, 9(11):1–6, 11 2014.
  • [6] Bruno Gonçalves, Lucía Loureiro-Porto, José J. Ramasco, and David Sánchez. Mapping the Americanization of English in space and time. PLOS ONE, 13(5):1–15, 05 2018.
  • [7] Gonzalo Donoso and David Sanchez. Dialectometric analysis of language variation in Twitter. CoRR, abs/1702.06777, 2017.
  • [8] Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ‘10, pages 1277–1287, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [9] Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. Diffusion of lexical change in social media. PLOS ONE, 9(11):1–13, 11 2014.
  • [10] Talk: cuuute — Wiktionary, the free dictionary. https://en.wiktionary.org/w/index.php?title=Talk:cuuute&oldid=51216685. Accessed: 2019-03-24.
  • [11] Wiktionary: Criteria for inclusion — Wiktionary, the free dictionary. https://en.wiktionary.org/w/index.php?title=Wiktionary:Criteria_for_inclusion&oldid=52749064. Accessed: 2019-05-12.
  • [12] Cutoff frequency — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Cutoff_frequency&oldid=873937426. Accessed: 2019-05-07.
  • [13] Isabel M. Kloumann, Christopher M. Danforth, Kameron Decker Harris, Catherine A. Bliss, and Peter Sheridan Dodds. Positivity of the English language. PLOS ONE, 7(1):1–7, 01 2012.
  • [14] Charles Wetherell and Alfred Shannon. Tidy drawings of trees. IEEE Transactions on Software Engineering, (5):514–520, 1979.
  • [15] Bill Mill. Drawing presentable trees. Python Magazine, 2(8), 08 2008.
  • [16] Bill Mill. Github — llimllib/pymag-trees: Code from the article “Drawing good-looking trees” in Python Magazine. https://github.com/llimllib/pymag-trees/tree/9acfb8d52a09a495f25af91dcbf438499546748b. Accessed: 2019-01-21.
  • [17] Wiktionary, the free dictionary. https://en.wiktionary.org/wiki/Wiktionary:Main_Page. Accessed: 2019-05-12.
  • [18] Typosquatting — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Typosquatting&oldid=884561229. Accessed: 2019-05-12.
  • [19] Kristina Gligorić, Ashton Anderson, and Robert West. How constraints affect content: The case of Twitter’s switch from 140 to 280 characters. International AAAI Conference on Web and Social Media, 06 2018.

Appendix A Alternate Balance Measure

Figure A1: Balance plot for the kernel (pa). See the Fig. 6 caption for plot details. Even though Halt=1.00000 for (pa), this plot clearly shows perfect balance is not sustained as tokens increase in length.
Figure A2: Jellyfish plots for kernel balance based on an alternate entropy measure for (A) all kernels, and (B) excluding kernels with entropy exactly 0. Corresponding histograms are given at the top of each plot. Kernels are plotted vertically by their rank, r, and horizontally by their balance as given by an alternate normalized entropy, Halt, where larger entropy denotes increased balance. The deciles 0.1,0.2,,0.9 are calculated for rolling bins of 500 kernels and are plotted as the ‘tentacles’.

As a comparison to our normalized entropy measure for balance discussed in Sec. III.2, we also compute an alternate normalized entropy measure, Halt, that measures balance from a different view.

To compute Halt, we first calculate the overall average stretch for each character as before, but now do so across all tokens at once. Then, we subtract one from each of these values and normalize them so they sum to 1 and can be thought of like probabilities. We then compute the normalized entropy, Halt, of these values as a measure of overall balance. Halt is similar to H in that if each character stretches the same on average, the normalized entropy is 1, and if only one character in the kernel stretches, the normalized entropy is 0. Again, higher entropy corresponds with more balanced words.

The difference is the view, and what is meant by ‘on average’. For Halt, each token is weighted equally when calculating balance. Thus, this measure corresponds to the view of if one randomly samples tokens and looks at how balanced they are on average.

By contrast, for H, as calculated in Sec. III.2, tokens are grouped by length, and then each group gets an equal weight regardless of the group size. This view looks at how well balance is sustained across lengths, and corresponds to sampling tokens by first randomly picking a length, and then randomly picking a token from all tokens of that length, and then looking at how balanced the sampled tokens are on average.

For example, for the kernel (pa), Halt=1.00000, signifying nearly perfect balance. However, looking at the balance plot for (pa) in Fig. A1, we see that perfect balance is not sustained across lengths. Because most of the tokens are short, and short stretched versions of (pa) are well balanced, all of the weight is on the well balanced short ones when randomly picking tokens. However, as people create longer stretched versions of (pa), they tend to use more ‘a’s than ‘p’s, and near perfect balance is not maintained. This is better captured by the measure H=0.80982.

As our main measure of balance, we chose the view better representing how well balanced tokens are as they are stretched, equally weighing lengths. This does have the limitation that groups of tokens with different lengths have different sizes, and some of them may contain a single token, possibly increasing the variance of the measure. It is possible this could be improved in the future by only including lengths that have a certain number of examples, or possibly creating larger bins of lengths for the longer tokens like we do in the balance plots.

Halt Kernel Example token
1 1.00000 (ba) baaaaaaaaaaa
2 1.00000 (pa) ppppppppppppa
3 1.00000 (uo) uouuuuuuuuuuuuu
4 0.99998 (pr) prrrrrrrrrrr
5 0.99998 (du) duduudduududududuuu
6 0.99995 (xa) xaxaxaxaxxa
7 0.99995 (ai) aaaaaaaaaaaaaaaai
8 0.99993 (he) hehehheheheh
9 0.99986 (bi) biiiiiiiiii
10 0.99985 (wq) wqwqwqwqwqw
Table A1: Top 10 kernels by an alternate normalized entropy, Halt.
Halt Kernel Example token
1 0.00115 [t][e][t]h teeeeeeeeeth
2 0.00119 f[e]l[i]ng feeeeeeling
3 0.00170 c[a][l]ing calllllling
4 0.00196 a[c]ep[t] accepttttttt
5 0.00197 fa[l][i]ng falllllling
6 0.00217 hi[l]ar[y] hilllllaryy
7 0.00227 m[i][s][i]ng missssssssssing
8 0.00271 ba[n]e[d] baneddddddddd
9 0.00277 t[h][r][e] threeeeeeeee
10 0.00302 th(er) therrrreeeee
Table A2: Bottom 10 (nonzero) kernels by an alternate normalized entropy, Halt.
Figure A3: Kernels plotted in Balance-Stretch parameter space using an alternate measure of normalized entropy for balance. Each kernel is plotted horizontally by the value of its balance parameter, given by an alternate normalized entropy, Halt, and vertically (on a logarithmic scale) by its stretch parameter, given by the Gini coefficient, G, of its token count distribution. Larger entropy implies greater balance and larger Gini coefficient implies greater stretch.

We include the same plots and tables for Halt as we did with H, and many of the observations are similar. Fig. A2 shows the two jellyfish plots for Halt. Similar to before, Fig. A2A is the version containing all words and for Fig. A2B we remove the words that have a value of 0 for entropy. The top of the plots in Fig. A2 shows the frequency histograms in each case. As before, after removing kernels with an entropy of 0, we see a small left-shift in the highest ranked kernels, and then the distribution largely stabilizes. Again, the highest ranked kernels tend to be more equally balanced, and kernels only stretching a single character tend to be lower ranked.

Table A1 shows the kernels with the ten largest entropies and Table A2 shows those with the ten smallest nonzero entropies as measured in this alternate way. We observe that the kernels with largest entropies are all of the form (l1l2) and are almost perfectly balanced given the view of equally weighing all tokens. The kernels with lowest entropies all expand to regular words that when spelled in the standard way contain a letter that is repeated, plus these kernels allow other letters to stretch.

Finally, Fig. A3 shows the scatter plot of each kernel where the horizontal axis is given by this alternate measure of balance, Halt, and the vertical coordinate is again given by the measure of stretch for the kernel using the Gini coefficient, G. We again see that the kernels span the two dimensional space.

We still get the same kind of rough vertical banding that we saw in Fig. 9 for the same reason, but we also see a curved dense band at lower entropy values, which seems to mostly contain kernels whose base word is spelled with a double letter, like ‘summer’ (with kernel [s][u][m][e][r]).

Appendix B Stretch Ratio

Figure B1: Jellyfish plots for kernel stretch ratio, ρ, as given by the ratio of the sum of the kernel’s stretched tokens to the sum or its unstretched tokens. The histogram is given at the top of the plot (with logarithmic width bins). Kernels are plotted vertically by their rank and horizontally (on a logarithmic scale) by their stretch ratio. The deciles 0.1,0.2,,0.9 are calculated for rolling bins of 500 kernels and are plotted as the ‘tentacles’.
Figure B2: Scatter plot of measures of stretch for each kernel. For each kernel, the horizontal axis gives its stretch as measured by the Gini coefficient, G, of its token count distribution and the vertical axis gives its stretch ratio, ρ. Both axes have a logarithmic scale.

For each kernel, we also measure a ‘stretch ratio’, ρ. This is simply the ratio of the total number of stretched tokens, ns, to the total number of unstretched tokens, nu, for that kernel. That is,

ρ=nsnu. (1)

Fig. B1 gives the jellyfish plot for the stretch ratio. Like Fig. 8, the horizontal axis has a logarithmic scale and the histogram bins have logarithmic widths. The stretch ratio distribution stays fairly stable across ranks, except for the highest ranked kernels, which tend to have a larger ratio.

This stretch ratio can be thought of as a simple measure for the stretchiness of a kernel, with a larger ratio representing stretchier words. As stretched versions of the word are used more, the numerator increases and the ratio value increases. Conversely, as unstretched versions of the kernel are used more, the denominator increases, and the ratio value decreases. However, this simpler measure uses less information from the full distribution than a measure like the Gini coefficient does, so we would expect some differences between the two. Indeed, Fig. B2 shows that there are some kernels for which the two measures seem to disagree. Yet, Fig. B2 shows that the stretch ratio and Gini coefficient are quite well correlated, with Pearson correlation coefficient 0.89 (p<10-100), so there is not much gained by including both. We choose to use the Gini coefficient as our main measure of stretchiness both because of its wide usage and because of the fact that it uses more information from the full distribution than the simpler stretch ratio.

ρ Kernel Example token
1 76.04717 s[o][c][o][r][o][k] socorrokkkkkk
2 29.94863 mou(ha) mouhahahaha
3 21.93369 p[f](ha) pffhahahaha
4 19.82821 bu(ha) buhahahahaha
5 15.15702 (ha)j(ah)(ja)(ha) hahahahajahajaha
6 10.32701 pu(ha) puhahahahaa
7 8.63055 (ha)(ba)(ha) habahahhaha
8 8.47429 (ha)b(ha) hahahhahabha
9 8.13269 (ah)j(ah) ahahahjahah
10 7.72953 a[e]h[o] aehooooooooooooo
Table B1: Top 10 kernels by stretch ratio, ρ.
ρ Kernel Example token
1 0.00002 am[p] amppppppppp
2 0.00004 fr[o]m froooooooom
3 0.00004 m[a]kes maaaaaaakes
4 0.00007 w[i]th wiiiiiiiiiiiiith
5 0.00009 eve[r]y everrrrrrrrrry
6 0.00011 p[r]a prrrrrrrrrrra
7 0.00011 watch[i]ng watchiiiing
8 0.00011 s[i]nce siiiiiiiince
9 0.00012 pla[y]ed playyyyyyyed
10 0.00012 vi[a] viaaaaaaaaaaaaaaa
Table B2: Bottom 10 kernels by stretch ratio, ρ.

Table B1 shows the top 10 kernels by stretch ratio and Table B2 gives the bottom 10. The correlation between stretch ratio and Gini coefficient, at least for the least stretchy kernels, can be seen further when comparing this to Table 5. Many of the kernels that show up as the least stretchy words (lowest Gini coefficients) also show up here in the list of kernels with smallest stretch ratio.

Appendix C “Drawing presentable trees” algorithmic bugs

Wetherel and Shannon presented an algorithm for drawing large trees in a nice way in their paper “Tidy drawing of trees” [14]. The article “Drawing presentable trees” [15] by Mill and related code [16], based largely on the earlier work of Wetherel and Shannon, provide a version of the algorithm written in the Python syntax, but both the article and the code contain algorithmic bugs. In the following, we present the bugs we found.

We will discuss Listing 5 in Mill’s paper [15], as that is the version that most closely resembles Algorithm 3 of Wetherel’s and Shannon’s paper [14], which is what our code to create the spelling trees is based off of.

In Listing 5, the definition of setup contains the code:

elif len(tree.children) == 1:
    place = tree.children[0].x - 1

This needs to be split into a left case and a right case. If the only child node is a left child, then the parent should be placed to the right by one, and if the only child node is a right child, then the parent should be placed to the left by one. The DrawTree class needs a way to tell if a node has a left or right child. Let us assume the class DrawTree has an attribute left properly implemented that is set to True iff the node has a left child. Then the code should be something more like the following:

elif len(tree.children) == 1:
    if tree.left:
        place = tree.children[0].x + 1
    else:
        place = tree.children[0].x - 1

Compare the above fix to the corresponding code in the right_visit case in the first while loop in Algorithm 3 in “Tidy drawing of trees” [14]:

elseif current.left_son = nil
    then place := current.right_son.x - 1;
elseif current.right_son = nil
    then place := current.left_son.x + 1;

Later in Listing 5 in the definition of setup is the following line:

nexts[depth] += 2

However, we want the next available spot, recorded in nexts, to be two spots to the right of the current placement, and the current placement is sometimes different from the current next available spot. Thus the line should look something like the following:

nexts[depth] = tree.x + 2

Again, compare this to the corresponding code found near the end of the right_visit case of the first while loop of Algorithm 3 in “Tidy drawing of trees”:

next_pos[h] := current.x + 2;

The final bug in Listing 5 is not an algorithmic bug, but merely a typo. In the definition of addmods is the line of code:

modsum += tree.offset

However, tree does not have the attribute offset. Instead the mod attribute should be added to the accumulated sum as follows:

modsum += tree.mod