Deconstructing and reconstructing word embedding algorithms

  • 2019-11-29 18:27:36
  • Edward Newell, Kian Kenyon-Dean, Jackie Chi Kit Cheung
  • 2

Abstract

Uncontextualized word embeddings are reliable feature representations ofwords used to obtain high quality results for various NLP applications. Giventhe historical success of word embeddings in NLP, we propose a retrospective onsome of the most well-known word embedding algorithms. In this work, wedeconstruct Word2vec, GloVe, and others, into a common form, unveiling some ofthe necessary and sufficient conditions required for making performant wordembeddings. We find that each algorithm: (1) fits vector-covector dot productsto approximate pointwise mutual information (PMI); and, (2) modulates the lossgradient to balance weak and strong signals. We demonstrate that these twoalgorithmic features are sufficient conditions to construct a novel wordembedding algorithm, Hilbert-MLE. We find that its embeddings obtain equivalentor better performance against other algorithms across 17 intrinsic andextrinsic datasets.

 

Quick Read (beta)

Deconstructing and reconstructing word embedding algorithms

Edward Newell   Kian Kenyon-Dean11footnotemark: 1   Jackie Chi Kit Cheung
Mila - Québec AI Institute, McGill University
Montreal, Canada
{edward.newell,kiankd}@gmail.com[email protected]
These authors contributed equally.
Abstract

Uncontextualized word embeddings are reliable feature representations of words used to obtain high quality results for various NLP applications. Given the historical success of word embeddings in NLP, we propose a retrospective on some of the most well-known word embedding algorithms. In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the necessary and sufficient conditions required for making performant word embeddings. We find that each algorithm: (1) fits vector-covector dot products to approximate pointwise mutual information (PMI); and, (2) modulates the loss gradient to balance weak and strong signals. We demonstrate that these two algorithmic features are sufficient conditions to construct a novel word embedding algorithm, Hilbert-MLE. We find that its embeddings obtain equivalent or better performance against other algorithms across 17 intrinsic and extrinsic datasets.

\stackMath

Deconstructing and reconstructing word embedding algorithms


Edward Newellthanks: These authors contributed equally.   Kian Kenyon-Dean11footnotemark: 1   Jackie Chi Kit Cheung Mila - Québec AI Institute, McGill University Montreal, Canada {edward.newell,kiankd}@gmail.com[email protected]

1 Introduction

Word embeddings have been established as standard feature representations for words in most contemporary NLP tasks (kim2014convolutional; huang2015bidirectional; goldberg2016primer). Their incorporation into larger models – from CNNs and LSTMs for sentiment analysis (zhang2018deep), to sequence-to-sequence models for machine translation (qi2018and), to the input layer of deep contextualized embedders (peters2018deep) – enables high quality performance across a wide variety of problems.

Being the building blocks for many modern NLP applications, we argue that it is worthwhile to subject word embedding algorithms to close theoretical inspection. This work can be considered a retrospective analysis of the ground-breaking word embedding algorithms of the past, which simultaneously offers theoretical insights for how future, deeper models can be developed and understood. Indeed, analogous to a watchmaker who curiously scrutinizes the mechanical components comprising her watches’ oscillators, so too do we aim to uncover what makes word embeddings “tick”.

It is well-known that word embedding algorithms train two sets of embeddings: the vectors (“input” vectors) and the covectors (“output”, or, “context” vectors). However, the covectors tend to be regarded as an afterthought when used by NLP practitioners, either being thrown away (mikolov2013distributed), or averaged into the vectors (pennington2014glove; levy2015improving).

Nonetheless, recent work has found that separately incorporating pretrained covectors into downstream models can improve performance in specific tasks. This includes lexical substitution (melamud2015simple; roller2016pic), information retrieval (nalisnick2016improving), state-of-the-art metaphor detection (mao2018word) and generation (yu2019avoid), and more (press2017using; asr2018querying; deugirmenci2019waste). In this work, we contribute an engaged theoretical treatment of covectors, and later elucidate the different relationships learned separately by vectors and covectors (§6.5).

Training these vectors and covectors can be done by a variety of high-performing algorithms: the sampling-based shallow neural network of SGNS (mikolov2013distributed), GloVe’s weighted least squares over global corpus statistics (pennington2014glove), and matrix factorization methods (levy2014neural; levy2015improving; shazeer2016swivel). In this work, we propose a framework for understanding these algorithms from a common vantage point. We deconstruct each algorithm into its constituent parts, and find that, despite their many different hyperparameters, the algorithms collectively intersect upon the following two key design features:

  1. 1.

    vector-covector dot products are learned to approximate pointwise mutual information (PMI) statistics in the corpus; and,

  2. 2.

    modulation of the loss gradient, directly or indirectly, to balance weak and strong signals arising from the highly imbalanced distribution of corpus statistics.

Finding these commonalities across algorithms, we beg the question of whether or not these features are sufficient for reconstructing a new word embedding algorithm. Indeed, we derive and implement a novel embedding algorithm, Hilbert-MLE 11 1 The name is inspired by the intuitions behind this work concerning Hilbert spaces, and the maximum likelihood estimate that defines the model’s loss function., by following these two principles to derive their corresponding global matrix factorization loss function based on the maximum likelihood estimate of the multinomial distribution of corpus statistics.

Figure 1: Abstract depiction of the direction of algorithmic derivation presented in this work, in contrast to levy2014neural.

However, due to the infeasibility of matrix factorization objectives for large vocabulary sizes, we further derive a local sampling-based formulation by algebraically deconstructing Hilbert-MLE’s global objective function. As we abstractly depict in Figure 1, this derivation can be seen as a mirrored derivation of that which is presented by levy2014neural, who derived the global matrix factorization for SGNS from the original local sampling formulation (mikolov2013distributed).

We find that Hilbert-MLE produces word embeddings that earn equivalent or better performance against SGNS and GloVe across 17 intrinsic and extrinsic datasets, therefore demonstrating the sufficiency of the two principles for designing a word embedding algorithm.

To summarize, this work offers the following contributions:

  • Theoretical deconstruction of the existing dominant word embedding algorithms toward two common features (§4);

  • Theoretical reconstruction of a novel embedding algorithm, Hilbert-MLE, based solely on these two features (§5) which algorithmically derives a sampling-based implementation in a novel manner (§5.1);

  • Empirically demonstrating the sufficiency of the two common principles for designing a word embedding algorithm (§6).

2 Fundamental concepts

In this section, we introduce notation and concepts that we will draw upon throughout this paper. This includes formally defining embeddings, their vectors and covectors, and pointwise mutual information (PMI).

Embedding.

In general topology, an embedding is understood as an injective “structure preserving map”, f:XY, between two mathematical structures X and Y. A word embedding algorithm (f) learns an inner-product space (Y) to preserve a linguistic structure within a reference corpus of text, 𝒟 (X), based on the words in a vocabulary, 𝒱. The structure in 𝒟 is analyzed in terms of the relationships between words induced by their appearances in the corpus. In such an analysis, each word figures dually: (1) as a focal element inducing a local context; and (2) as elements of the local contexts induced by focal elements. To make these dual roles explicit, we distinguish two copies of the vocabulary: the focal words 𝒱T (or, terms), and the context words 𝒱C.

An embedding consists of two maps:

𝒱C1×d  𝒱Td×1ii|     j|j.

We use Dirac notation to distinguish vectors |j, associated to focal words, from covectors i|, associated to context words. In matrix notation, |j corresponds to a column vector and i| to a row vector. Their inner product is i|j; this inner product completely characterizes the learned vector space. We will later demonstrate that many word embedding algorithms, intentionally or not, learn a vector space where the inner product between a focal word j and context word i aims to approximate their PMI in the reference corpus: i|jPMI(i,j).

Pointwise mutual information (PMI).

PMI is a commonly used measure of association in computational linguistics, and has been shown to be consistent and reliable for many tasks (terra2003frequency). It measures the deviation of the cooccurrence probability between two words i and j from the product of their marginal probabilities:

PMI(i,j):=lnpijpipj=lnNNijNiNj, (1)

where pij is the probability of word i and word j cooccurring (for some notion of cooccurrence), and where pi and pj are marginal probabilities of words i and j occurring. The empirical PMI can be found by replacing probabilities with corpus statistics. Words are typically considered to cooccur if they are separated by no more than w words; Nij is the number of counted cooccurrences between a context i and a term j; Ni, Nj, and N are computed by marginalizing over the Nij statistics.

3 Word embedding algorithms

We will now introduce the low rank embedder framework for deconstructing word embedding algorithms, inspired by the theory of generalized low rank models (udell2016generalized). We unify several word embedding algorithms by observing them all from the common vantage point of their global loss function. Note that this framework is only used for theoretical analysis, not practical implementation.

The global loss function for a low rank embedder takes the form:

=(i,j)𝒱C×𝒱Tfij(ψ(i|,|j),ϕ(i,j)), (2)

and satisfies

fijψij=0atψij=ϕij, (3)

where ψ(i|,|j) is a kernel function, and ϕ(i,j) is some scalar function (such as a measure of association based on how i and j appear in the corpus); ψij and ϕij are abbreviations for the same.

The design variable ϕij is some function of corpus statistics, and its purpose is to quantitatively measure some relationship between word i and j. In apposition, the design variable ψij is a function of model parameters, and its purpose is to learn a succinct approximation:

ψijϕij,

and so represent the relationship measured by ϕij. Intuitively, we can think of a low rank embedder as trying to directly fit a kernel function of model parameters ψij to some (statistical) relationship between words ϕij of our choosing. For example, SGNS takes ϕij:=PMI(i,j)-lnk and ψij:=i|j, and then learns parameter values that approximate i|jPMI(i,j)-lnk.

Though the specific choice of ϕij varies slightly, existing low rank embedders generally base ϕij on cooccurrence of words within a linear window w words wide. But it is worth pointing out that ϕij can in principle be any pairwise relationship encoded as a scalar function of corpus statistics.

As for the kernel function ψij, one simple choice is to take ψij=i|j. But the framework allows any function that is symmetric and positive definite. This allows the framework to include the use of bias parameters (e.g. in GloVe) and subword parameterization (e.g. in FastText).

To understand the range of models encompassed, it is helpful to see how the framework relates (but is not limited) to matrix factorization. We can think of ϕij and ψij as providing the entries of two matrices:

𝐌:=[ϕij]ij𝐌^:=[ψij]ij.

For models that take ψij=i|j, we can write 𝐌^=𝐖𝐕, where 𝐖 is defined as having row i equal to i|, and 𝐕 as having column j equal to |j. Then, the loss function can be rewritten as:

=(i,j)𝒱C×𝒱Tfij((𝐖𝐕)ij,𝐌ij).

This loss function can be interpreted as matrix reconstruction error, because the constraint in Eq. 3 means that the gradient goes to zero as 𝐖𝐕𝐌.

Selecting a particular low rank embedder instance requires key design choices to be made: we must chose the embedding dimension d, the form of the loss terms fij, the kernel function ψij, and the association function ϕij. Only the gradient of fij actually affects the algorithm. The derivative of fij with respect to ψij, which we call the characteristic gradient, helps compare models because it exhibits the action of the gradient yet is symmetric in the parameters. Thus, we address a specific embedder by the tuple (d,fijψij,ψij,ϕij,).

In the following subsections, we present the derivations of fijψij, ψij, and ϕij for each algorithm. We later (§4) compare the algorithms and summarize their derivations in Table 1.

3.1 SGNS as a low rank embedder

levy2014neural provided the important result that skip gram with negative sampling (SGNS) (mikolov2013distributed) was implicitly factorizing the PMI-lnk matrix. However, levy2014neural did not derive the loss function needed to explicitly pose SGNS as matrix factorization, and required additional assumptions for their derivation to hold. Moreover, empirically, they used SVD on the related (but different) positive-PMI matrix, and this did not reproduce the results of SGNS. In other work, li2015word, provided an explicit MF formulation of SGNS from a “representation learning” perspective. This derivation diverges from levy2014neural’s result, and masks the connection between SGNS and other low rank embedders. In this work, we derive the complete global loss function for SGNS, free of additional assumptions.

The loss function of SGNS is as follows:

=-(i,j)D2{lnσi|j+=1k𝔼[ln(1-σi|j)]},

where σ is the logistic sigmoid function, D2 is a list containing each cooccurrence of a context-word i with a focal-word j in the corpus, and the expectation is taken by drawing i from the (smoothed) unigram distribution to generate k “negative samples” for a given focal-word. (mikolov2013distributed).

We rewrite this by counting the number of times each pair occurs in the corpus, Nij, and the number of times each pair is drawn as a negative sample, Nij-, while indexing the sum over the set 𝒱C×𝒱T:

=-(i,j)𝒱C×𝒱T{Nijlnσi|j+Nij-ln(1-σi|j)}.

We can now observe that the global loss is almost in the required form for a low rank embedder (Eq. 2), and that the appropriate setting for the model approximation function is ψij=i|j. The characteristic gradient is derived as, using the identity a(a+b)σ(lnab):

fijψij =i|j=Nij-σi|j-Nij(1-σi|j)
=(Nij+Nij-)[σ(i|j)-σ(lnNijNij-)].

This provides that the association function for SGNS is ϕij=ln(Nij/Nij-), since the derivative will be equal to zero at that point (Eq. 3). However, recall that negative samples are drawn according to the unigram distribution (or a smoothed variant (levy2015improving)). This means that Nij-=kNiNj/N. Therefore, in agreement with levy2014neural, we find that:

ϕij=lnNijNNiNjk=PMI(i,j)-lnk. (4)

3.2 GloVe as a low rank embedder

GloVe (global vectors) was proposed as a method that strikes a halfway point between local sampling and global matrix factorization, taking the best parts from both solution methods (pennington2014glove). Its efficiency came from the fact that it only performs a partial factorization of the lnNij matrix, only considering samples where Nij>0. We will demonstrate that GloVe is not so different from SGNS, and that it too implicitly factorizes the PMI matrix.

GloVe’s loss function is defined as follows:

=ijh(Nij)(i|j+bi+bj-lnNij)2h(Nij)=min(1,(NijNmax)α), (5)

where bi and bj are learned bias parameters; Nmax and α are empirically tuned hyperparameters for the weighting function h(Nij), which has h(Nij)=0 when Nij=0.

GloVe can be cast as a low rank embedder by using the model approximation function as a kernel with bias parameters, and setting the association measure to simply be the objective of the loss function:

ψij=[i|1i|dbi 1] [|j1|jd  1bj],
andϕij =lnNij.

Let us observe the optimal solution to the loss function, when fijψij=0:

fijψij =2h(Nij)[i|j+bi+bj-lnNij]=0
i|j+bi+bj=lnNij.

Multiplying the log operand by 1:

i|j+bi +bj=ln(NiNjNNNiNjNij) (6)
=lnNiN+lnNjN+PMI(i,j). (7)

On the right side, we have two terms that depend respectively only on i and j, which are candidates for the bias terms. Based on this equation alone, we cannot draw any conclusions. However, empirically the bias terms are in fact very near NiN and NjN, and PMI is nearly centered at zero, as can be seen in Fig. 2. This means that Eq. 7 provides i|jPMI(i,j).

Figure 2: A) Histogram of PMI(i,j) values, for all pairs (i,j) with Nij>0, from the corpus described in §6. B) Scatter plot of GloVe’s learned biases after 10 epochs, using default hyperparameter settings.

Analyzing the optimum of GloVe’s loss function yields important insights. First, GloVe can be added to the list of low rank embedders that learn a bilinear parameterization of PMI. Second, we can see why such a parameterization is advantageous. Generally, it helps to standardize features of low rank models (udell2016generalized), and this is essentially what transforming cooccurrence counts into PMI achieves. Thus, PMI can be viewed as a parameterization trick, providing an approximately normal target association to be modelled.

3.3 Other algorithms as low rank embedders

We now present additional algorithms that can be cast as low rank embedders: LDS (arora2016latent) and FastText (joulin2017bag). The derivations for SVD (levy2014neural; levy2015improving) and Swivel (shazeer2016swivel) as low rank embedders are trivial, as both are already posed as matrix factorizations of PMI statistics.

LDS.

arora2016latent introduced an embedding perspective based on generative modelling with random walks through a latent discourse space (LDS). While their only experiments were on analogy completion tasks (which do not correlate well with downstream performance (linzen2016issues; faruqui2016problems; rogers2017too)) LDS provided a theoretical basis for the surprisingly well-performing SIF document embedding algorithm soon afterwards (arora2017simple). We now demonstrate that LDS is also a low-rank embedder.

The low rank learning objective for LDS follows directly from Corollary 2.3, in arora2016latent:

PMI(i,j)=i|jd+γ+O(ϵ).

fijψij can be found by straightforward differentiation of LDS’s loss function:

=ijh(Nij)[lnNij-i|+|j2-C]2,

where h(Nij) is as defined by GloVe. The quadratic term is a valid kernel function because:

fijψij=i|+|j2=i~|j~,

where

i|~ =[2i|12i|di|i|    1  ],
|j~ =[2|j12|jd     1|j|j].

FastText.

Proposed by joulin2017bag, FastText’s motivation is orthogonal to the present work. It’s purpose is to provide subword-based representation of words to improve vocabulary coverage and generalizability of word embeddings. Nonetheless, it can also be understood as a low rank embedder.

In FastText, the vector for each word is taken as the sum of embeddings for its character n-grams, 3n6. Then the vector |j is given by the feature function |j=gz(j)|g, where |g is the vector for n-gram g, and z(j) is the set of n-grams in word j. Meanwhile covectors are accorded to words directly, rather than using n-gram covector embeddings. This provides ψij=i|j, and, by virtue of using the SGNS loss function, ϕij=PMI(i,j)-lnk.

Model fijψij ψij ϕij i|j
\Xhline3 SGNS (Nij+Nij-)[σ(ψij)-σ(ϕij)] i|j lgNijNij- PMI(i,j)-lnk
GloVe   2h(Nij)[ψij-ϕij]   i|j+bi+bj lgNij PMI(i,j)
LDS   4h(Nij)[ψij-ϕij+C] i|+|j2 lgNij dPMI(i,j)-dγ
Swivel   Nij[ψij-ϕij] i|j PMI(i,j) PMI(i,j)
  1σ(ψij-ϕij) PMI*(i,j) PMI*(i,j)
\Xhline3 Hilbert-MLE   (pipj)1τ[eψij-eϕij] i|j PMI(i,j) PMI(i,j)
Table 1: Comparison of low rank embedders. Final column shows the value of i|j at fijψij=0. GloVe and LDS set fij=0 when Nij=0. Swivel takes one form when Nij>0 (first row) and another when Nij=0 (second row). Nij- is the number of negative samples. For other symbols see: SGNS (mikolov2013distributed), GloVe (pennington2014glove), LDS (arora2016latent), Swivel (shazeer2016swivel).

4 Deconstructing the algorithms

Table 1 presents a summary of our derivations of existing algorithms as low rank embedders.

We observe several common features between each of the algorithms. In each case, fijψij takes the form (multiplier)(difference). The multiplier is always a “tempered” version of Nij (or NiNj), by which we mean that it increases sublinearly with Nij (or NiNj)22 2 In SGNS, Nij-NiNj; Nij and Nij- are tempered by undersampling and unigram smoothing..

Furthermore, for each algorithm, ϕij is equal to PMI or a scaled log of Nij. Yet, the choice of ψij in combination with ϕij provides that every model is optimized when i|j tends toward PMI(i,j) (with or without a constant shift or scaling). We have already seen that the optimum for SGNS is equivalent to the shifted PMI (§3.1). For GloVe, we theoretically and empirically showed that incorporation of the bias terms captures the unigram counts needed for PMI (§3.2). We observe this property similarly with regards to LDS’s incorporation of the L2 norm into its learning objective, where we suspect that the unigram probability is implicitly captured in the norms of the respective vectors and covectors (§3.3).

Therefore, we observe that these embedders converge on two key points: (1) an optimum in which model parameters are bilinearly related to PMI, and (2) the weighting of fijψij by some tempered form of Nij (or NiNj). In the next section, we introduce Hilbert-MLE, which is derived based on the shared principles observed between the algorithms in Table 1.

5 Reconstructing an algorithm

If the two basic principles that we have identified are sufficient, then the simplest low rank embedder should be one that derives from them without any other assumptions.

We begin with principle (1), which prescribes a bilinear parameterization of PMI. The definition of PMI (Eq. 1) provides a log-bilinear parameterization of cooccurrence probability, p^ij, if we presuppose that the aim of our model is to approximate the PMI with vector-covector dot products:

i|jPMI(i,j)=lnpijpipjp^ij=pipjei|j. (8)

In the expression above, p^ij represents the model’s estimate of the cooccurrence probability, provided by the parameterization which includes the unigram probabilities pi and pj.

Accordingly, given the matrix of covectors 𝐖 and vectors 𝐕, the likelihood of the observed cooccurrence statistics, 𝒟={Nij}ij, is distributed like the multinomial, Mult({p^ij}ij,N):

Pr(𝒟|𝐕,𝐖)=N!ijp^ijNijNij!,

where p^ij depends on 𝐕 and 𝐖 (whose rows and columns are respectively i| and |j) through Eq. 8. Taking the negative log likelihood as the loss:

=-ijNijlnp^ij, (9)

where we have dropped constant terms that do not affect the gradient.

The unitarity axiom of probability requires that ijp^ij=1. Including this constraint with a Lagrange multiplier, we obtain:

=(-ijNijlnp^ij)+λ(1-ijp^ij). (10)

At the feasible optimum, the original loss and constraint gradients should balance:

p^ij=-Nijp^ij-λ=0 (11)
λp^ij=-Nij. (12)

Eq. 12 represents |𝒱C×𝒱T| equations, one for each pair (i,j). Summing these equations together,

λijp^ij =-ijNijλ=-N. (13)

The constrained loss function is therefore,

=(-ijNijlnp^ij)-N(1-ijp^ij). (14)

Reintroducing the bilinear parameterization (Eq. 8), and dividing through by N to eliminate dependence on corpus size:

=ij(pipjei|j-piji|j), (15)

where, again, we have dropped constant terms that do not affect the gradient. Finally by differentiating we obtain the characteristic gradient:

fijψij=fiji|j=pipj[ei|j-ePMI(i,j)]=p^ij-pij. (16)

This yields a loss gradient closely resembling other members of the low rank embedders. Empirically, its performance is on par with the other low rank embedders (see §6).

The multiplier, pipj, determines how errors in fitting individual (i,j) pairs trade off. While it appropriately favors fitting statistics with lower standard error, the signal from rarer pairs will be weak for any non-divergent learning rate because pipj spans orders of magnitude. This slows down training. So, we apply a gradient conditioning measure as is done for the other low rank embedders: we apply a temperature parameter, τ, that reduces differences in magnitude of the multiplier:

fijψij =(pipj)1/τ[ei|j-ePMI] (17)

5.1 Solving the objective function

The objective function presented in Equation 15 is most straightforwardly solved via dense matrix factorization. This can be done relatively efficiently by using the sharding method presented by shazeer2016swivel for matrix factorization. Such a solution is acceptable given a small vocabulary size, but does not scale to large vocabularies, due to the quadratic dependency. GloVe (pennington2014glove) handled this problem by only training on statistics where Nij>0. levy2014neural; levy2015improving avoided the quadratic dependency by implementing sparse SVD on the positive-PMI matrix. However, both of these solutions may be missing out on important information that can be gained by “noticing what’s missing” (shazeer2016swivel).

Yet, SGNS (mikolov2013distributed) was never confronted with the vocabulary size problem due to the fact that it uses local sampling over the corpus. While this yields a linear time complexity on the corpus size, this is generally preferable to a quadratic memory complexity on the vocabulary size. levy2014neural derived the global matrix factorization formulation of SGNS by moving in the algorithmic direction of local to global. Conversely, we will now move in the direction of global to local and derive the local sampling formulation of the Hilbert-MLE loss function.

Locally sampling Hilbert-MLE.

Note how if we differentiate the loss function of Hilbert-MLE (Equation 15) relative to an arbitrary model parameter θ, we obtain a difference between two expectations:

θ=ijpipjei|ji|jθ-ijpiji|jθ=𝔼(i,j)p^ij[i|jθ]-𝔼(i,j)pij[i|jθ]. (18)

In words, the derivative of the loss function is the difference between the expected value of i|jθ when taken under the model distribution on one hand and under the corpus distribution on the other.

This leads to a remarkably simple training algorithm. Draw a sample of word pairs (i,j) from the corpus (using a local sampling approach as in SGNS), and draw a sample of pairs from the model distribution. Compute 𝔼[i|j] for both samples, and take their difference. The gradient of the result estimates the gradient of .

In the context of autodifferentiation libraries such as PyTorch (paszke2017automatic) it is adequate to use a simplified loss function ~,

~=𝔼(i,j)p^ij[i|j]-𝔼(i,j)pij[i|j], (19)

because in the first term, the autodifferential operator will ignore the appearance of model parameters in the distribution according to which the expectation 𝔼(ij)p^ij is taken, but will recognize model parameters in the expectation’s operand i|j. Thus autoθ~=θ.

Like SGNS, this uses positive samples drawn from the corpus, balanced against negative samples. But unlike SGNS, which draws negative samples according to the (distorted) unigram distribution, here we draw negative samples from model distribution p^ij. This can be done efficiently using Gibbs sampling, making this a form of contrastive divergence (hinton2002training; carreira2005contrastive). To approximately sample a cooccurring pair (i,j) from the model distribution, we start from a corpus-derived pair, and repeatedly perform Gibbs sampling steps: randomly fix either i or j and re-sample the other from the model distribution conditioned on the fixed variable. E.g. if we fix i, then we draw a new j from jp^(j|i)=pjei|j. Sampling from the conditional distribution can be done in constant time using an adaptive softmax (grave2017efficient). In theory, the model distribution is approximated after taking many Gibbs sample steps, but consistent with Hinton’s findings for contrastive divergence (hinton2002training; carreira2005contrastive), we find that a single Gibbs sampling step supports efficient training.

6 Experiments

We provide a simple set of experiments comparing the two characteristic models for word embeddings with ours: SGNS and GloVe against Hilbert-MLE. Our aim in these experiments is simply to verify the sufficiency of the principles we used to derive Hilbert-MLE (§5). In other words, we are testing the following hypothesis: if the principles we have proposed are sufficient for designing a word embedding algorithm, then Hilbert-MLE should perform equivalently or better than SGNS and GloVe, which were proposed with different motivating principles (mikolov2013distributed; pennington2014glove).

In our experiments, we use a matrix factorization implementation of Hilbert-MLE as the characteristic form of the model. During experimentation, we found that the Gibbs sampling implementation of Hilbert-MLE (§5.1) performed equivalently, as expected. We present results on word similarity (§6.1), analogical reasoning (§6.2), text classification (§6.3), and sequence labelling (§6.4).

Our reference corpus combines Gigaword 3 (graff2007english) with a Wikipedia 2018 dump, lower-cased, yielding 5.4 billion tokens. We limit 𝒱C and 𝒱T to be the 50,000 most frequent tokens in 𝒟. We use a 5-token context window, and d=300. We use the released implementations and hyperparameter choices of SGNS and GloVe. Our implementation of Hilbert-MLE uses PyTorch to take advantage of GPU-acceleration, automatic differentiation (paszke2017automatic), and the Adam gradient descent optimizer (kingma2014adam). Practically, Hilbert-MLE was implemented by using sharding (shazeer2016swivel). We use a single 12-GB GPU, and load 12500×12500-element shards to calculate each update. Training embeddings took less than 3 hours for a 50,000 word vocabulary.

Intrinsic eval. B143 MENd/t RMT RARE SE17 S999 WS-R/-S Y130 Analogy
SGNS .453 .753/.763 .680 .515 .656 .364 .589/.760 .514 .763/.312
GloVe .347 .760/.771 .663 .509 .662 .391 .602/.727 .541 .739/.310
Hilbert-MLE .397 .751/.761 .684 .579 .680 .462 .593/.765 .514 .705/.343
Table 2: Performance on intrinsic evaluation datasets; first 8 columns are on word similarity tasks (§6.1), final column is analogy tasks (Google analogies/BATS) (§6.2). The best result is in bold.

6.1 Word similarity

A word similarity task involves interpreting the cosine similarity between embeddings as a measure of similarity or relatedness between words. Performance is computed with the Spearman rank correlation coefficient between the model’s scoring of all pairs of words versus the gold standard human scoring. These tasks can reflect the degree of linear structure captured in the embeddings, which can provide useful insights into differences between models. However, they do not always correlate with performance in downstream tasks (chiu2016intrinsic; faruqui2016problems).

We used the following word similarity datasets: Simlex999 (S999) (hill2015simlex); Wordsim353 (finkelstein2002placing) divided into similarity (WS-S) and relatedness (WS-R) (agirre2009study); the SemEval 2017 task (SE17) (camacho2017semeval); Radinsky Mechanical Turk (RMT) (radinsky2011word); Baker Verbs 143 (B143) (baker2014unsupervised); Yang Powers Verbs 130 (Y130) (yang2006verb); MEN divided into a 2000-sample development set (MENd) and 1000-sample test set (MENt) (bruni2012distributional); Rare Words (RARE) (luong2013better). We had an average of 96% coverage over all word-pairs in each dataset, excepting RARE; we had 31% coverage over RARE, yielding 620 word pairs (i.e., more samples than SE17, RMT, WS-S/-R, and Y130). Results are computed on these covered word-pairs.

Word similarity results.

Table 2 presents results across the 10 word similarity tasks. We observe that Hilbert-MLE obtains the best performance in 5 out of 10 tasks. In particular, Hilbert-MLE obtains substantially better scores on S999 than the other models, earning a Spearman correlation coefficient of 0.462, an 18% relative improvement over the next best (SGNS). Note that S999 has been shown to have a high correlation with performance in extrinsic tasks such as Named Entity Recognition and NP-chunking, unlike the other word similarity datasets (chiu2016intrinsic). On the tasks with worse performance, we observe that the differences between the three algorithms are relatively marginal. Base on these experiments, and the ones that follow, we can therefore conclude that our hypothesis (§6) is valid.

6.2 Analogical reasoning

We performed intrinsic evaluation of our embeddings using standard analogy tasks (e.g., “man” is to “woman” as “king” is to X). We evaluated on the Google Analogy dataset (Google) (mikolov2013efficient) and the Balanced Analogy Test Set (BATS) (gladkova2016analogy). We observed 86% and 69% coverage of the words in each dataset, respectively. Preliminary experiments using 3CosAdd and 3CosMul (levy2014linguistic) as selection rules, showed 3CosMul was always superior, consistent with the findings of levy2014linguistic.

Analogy results.

Table 2 presents results on the two analogy datasets in the final column. Hilbert-MLE performs somewhat worse than the other models on the Google Analogy dataset. However, there has been a considerable amount of work finding that performance on these tasks does not necessarily provide a reliable judgment for embedding quality (faruqui2016problems; linzen2016issues; rogers2017too). Indeed, we can see that performance on the Google Analogy dataset does not correspond with performance on the other larger analogy dataset (BATS), where Hilbert-MLE gets the best performance.

Classification IMDB AGNews
SGNS .910 ± .001 .812 ± .003
GloVe .905 ± .001 .807 ± .003
Hilbert-MLE .911 ± .002 .812 ± .003
Table 3: Classification results when using a BiLSTM-max encoder. Best is bold.
Seq. labelling Semcor WSJ Brown
Baseline .6126 .8905 .9349
SGNS .6638 .9615 .9762
GloVe .6550 .9609 .9750
Hilbert-MLE .6663 .9617 .9767
Table 4: Sequence labelling results when using a BiLSTM. Best is bold.

6.3 Text classification

We performed extrinsic evaluation for classification tasks on two benchmark NLP classification datasets. First, the IMDB movie reviews dataset for sentiment analysis (maas2011learning), divided into train and test sets of 25,000 samples each. Second, the AGNews news classification dataset, as divided into 8 approximately 12,000-sample classes (such as Sports, Health, and Business) by kenyon2018clustering; here, we separate 30% of the samples as the final test set. On each, we separate 10% of the training set for validation tuning.

We use a standard BiLSTM-max sequence encoder for these tasks (conneau2017supervised). This model produces a sequence representation by max-pooling over the forward and backward hidden states produced a bidirectional LSTM. This representation is then passed through a MLP before final prediction. We found that validation performance was optimized with a 1-layer 128-d BiLSTM, followed by a 512-d MLP using a ReLU activation, a minibatch size of 64, dropout rate of 0.5, and normalizing the embeddings before input. We use a learning rate of 0.001 with Adam, and divide the learning rate by a factor of 10 if validation performance does not improve in 3 epochs, similar to conneau2017supervised; we schedule the learning rate in the same way for sequence labelling (§6.4).

Classification results.

In Table 3 we present the test set results from our classification experiments. We trained each BiLSTM-max 10 times with different random seeds for weight initialization and present the mean test accuracy plus/minus the standard deviation. These results show that each embedding model is similar, although GloVe is slightly worse than the others. Meanwhile, SGNS and Hilbert-MLE perform approximately the same, obtaining high quality results on both tasks.

6.4 Sequence labelling

Our final extrinsic evaluations are sequence labelling tasks on three datasets. The first task is supersense tagging (SST) (ciaramita2006broad) on the Semcor 3.0 corpus (miller1993semantic). SST is a coarse-grained semantic sequence labelling problem with 83 unique labels; we report results using the micro-F-score without the O-tag score due to the skew of label distribution, as is standard (alonso2017multitask; changpinyo2018multi). We divide Semcor into a 70-30% train-test split, and use 10% of the training set for validation tuning. The second task is syntactic part-of-speech tagging (POS); we use the Penn TreeBank Wall Street Journal corpus (WSJ) (marcus1993building) and the Brown corpus (as distributed by NLTK33 3 https://www.nltk.org/book/ch02.html). For the WSJ, we use the given 44-tag tagset, and for Brown we map the original tags to the 12-tag “universal tagset” (petrov2012universal). We use sections 22, 23, and 24 of the WSJ corpus as its test set, and separate out 30% of the sentences in Brown as its test set.

On each dataset, we train a standard sequence labelling model inspired by huang2015bidirectional: a 2-layer, 128-d bidirectional LSTM, using a minibatch size of 16, and a dropout rate of 0.5. Interestingly, we found that normalizing the embeddings substantially reduced validation performance, so we keep them in their original form.

Sequence labelling results.

To accompany our results in Table 4, we include results from a trivial most-frequent-tag baseline. This baseline returns that the tag of a token is the most frequently occurring tag for that token within the training set. In SST it is standard to include results from a most-frequent-supersense baseline, being inspired from the tradition of word sense disambiguation, which uses the most-frequent-sense baseline.

The results for the BiLSTMs are the mean test set score across 10 different runs with different random seeds for weight initialization. The low standard deviations were approximately the same for each model. As in the classification tasks, we find that the embeddings produced by each model obtain very similar results. Nonetheless, we observe that Hilbert-MLE offers marginal improvements over the others. Note that our performance on WSJ and Brown is expected since we use vanilla BiLSTMs that do not include any hand engineered character- or context-based features. Indeed, huang2015bidirectional report results of 96.04% on the WSJ with their vanilla BiLSTM, which suggests that our embeddings possess strong syntactic properties.

6.5 Qualitative analysis

Argmaxi Top most similar embeddings
|i|𝑐𝑎𝑡 kittens, cats, kitten, poodle
i|𝑐𝑎𝑡 burglar, siamese, schrödinger
|i|𝑚𝑜𝑛𝑒𝑦 funds, monies, billions, cash
i|𝑚𝑜𝑛𝑒𝑦 laundering, launder, extort
|i|𝑐𝑢𝑏𝑎 cubans, cuban, anti-castro
i|𝑐𝑢𝑏𝑎 gooding, guantanamo, havana
Table 5: Qualitative analysis of the difference between the embedding recovery between vectors and vectors (|i|j) versus between vectors and covectors (i|j).

We provide a final set of qualitative results in Table 5. Here, we use the vectors and covectors trained by the Hilbert-MLE model used in our experiments. These results elucidate the difference between using vector-vector similarity versus vector-covector dot product similarity (results are practically the same when using cosine similarity). The vector-vector similarity is well known as a way to measure semantic similarity between two concepts captured in word embeddings. As expected, we see recoveries like “cat” is similar to “kitten”, “money” with “funds”, etc.

However, when we instead obtain the most similar covector to the corresponding vector, the results are dramatically different. We see that the vector for “cat” is most similar to covectors for words with which it forms multi-word expressions: “cat burglar”, ”siamese cat”, ”schrödinger’s cat”. We see that “cuba” is most similar to the covector for “gooding” – this is because Cuba Gooding Jr. is a famous American actor whose Wikipedia page appears in our corpus. Indeed, the vector-covector recoveries are directly correlated to the PMIs between the terms in the corpus.

Overall, we see that vector-vector dot products recover semantic similarity, while vector-covector dot products recover co-occurrence similarity. melamud2015simple and asr2017artificial discuss these different statistical recoveries as paradigmatic (target-to-target) and syntagmatic (target-to-context) recoveries, respectively. However, to our knowledge, previous work has not explicitly explained the reason for these two different types of recoveries; i.e., because the learning objective for word embeddings is to approximate PMI. Therefore, these results qualitatively demonstrate exactly what our hypothesis anticipates: the dot product between trained vectors and covectors approximates the PMI between their corresponding words in the original corpus.

7 Discussion

In the past, probabilistic distributional semantic models for creating word embeddings surpassed the traditional count-based models (turney2010frequency) that preceded them, which was well-established by baroni2014don. At the same time, models like Word2vec (SGNS), GloVe, and SVD of PPMI (mikolov2013efficient; mikolov2013distributed; pennington2014glove; levy2015improving) offered strong improvements (in terms of performance and efficiency) over other probabilistically-motivated embedding models (collobert2008unified; mnih2009scalable; turian2010word; mnih2013learning).

Today, NLP seems to be orienting toward deep contextualized models (peters2018deep; devlin2018bert). Nonetheless, pretrained word embeddings are still highly relevant. Indeed, they have been used recently to greatly assist solving problems in materials science (tshitoyan2019unsupervised), biomedical text mining (zhang2019biowordvec), and law (chalkidis2019deep). Moreover, word embeddings are used in dynamic meta-embeddings to obtain state-of-the-art results (kiela2018dynamic), are used as inputs to ELMO (peters2018deep), and are crucial in memory-constrained NLP contexts (such as in mobile devices, which cannot store large deep neural networks (shu2017compressing)).

We believe a robust understanding of the “shallow” (or, non-deep), uncontextualized embedding models presented in this work is a prerequisite for informed development of deeper models. In this work, we advanced the theoretical understanding of word embeddings by proposing the low rank embedder framework. Cast under this framework, the similarities between many existing algorithms become apparent. After isolating two key principles shared by the low rank embedders—a probabilistically informed bilinear parameterization of PMI and a “tempered” gradient conditioning measure—we demonstrate that these ideas are sufficient to derive Hilbert-MLE, a model that is simpler yet has performance that is equivalent or better than the other models. This provides a parsimonious explanation for the success of the low rank embedders, demonstrating that many idiosyncratic features of other embedders are unnecessary.

The design parameters of our framework have not yet been fully explored. Our framework could be used to model more complex linguistic phenomena, by following the blueprint provided by our derivation of Hilbert-MLE. Moreover, based on our findings concerning the importance of covectors in parameterizing the model’s approximation of PMI, we believe that methods such as retrofitting (faruqui2015retrofitting), subword-based embedding decomposition (stratos2017reconstruction), and dynamic meta-embedding (kiela2018dynamic) could benefit by incorporating covectors into their modelling designs. As well, similar to how the theoretical basis of LDS (arora2016latent) was the grounding for the widely-used SIF document embedding algorithm (arora2017simple), we believe that the theoretical basis provided in this work can inform future development of document embedding techniques.

References