Abstract
Uncontextualized word embeddings are reliable feature representations ofwords used to obtain high quality results for various NLP applications. Giventhe historical success of word embeddings in NLP, we propose a retrospective onsome of the most wellknown word embedding algorithms. In this work, wedeconstruct Word2vec, GloVe, and others, into a common form, unveiling some ofthe necessary and sufficient conditions required for making performant wordembeddings. We find that each algorithm: (1) fits vectorcovector dot productsto approximate pointwise mutual information (PMI); and, (2) modulates the lossgradient to balance weak and strong signals. We demonstrate that these twoalgorithmic features are sufficient conditions to construct a novel wordembedding algorithm, HilbertMLE. We find that its embeddings obtain equivalentor better performance against other algorithms across 17 intrinsic andextrinsic datasets.
Quick Read (beta)
Deconstructing and reconstructing word embedding algorithms
Abstract
Uncontextualized word embeddings are reliable feature representations of words used to obtain high quality results for various NLP applications. Given the historical success of word embeddings in NLP, we propose a retrospective on some of the most wellknown word embedding algorithms. In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the necessary and sufficient conditions required for making performant word embeddings. We find that each algorithm: (1) fits vectorcovector dot products to approximate pointwise mutual information (PMI); and, (2) modulates the loss gradient to balance weak and strong signals. We demonstrate that these two algorithmic features are sufficient conditions to construct a novel word embedding algorithm, HilbertMLE. We find that its embeddings obtain equivalent or better performance against other algorithms across 17 intrinsic and extrinsic datasets.
Deconstructing and reconstructing word embedding algorithms
Edward Newell^{†}^{†}thanks: These authors contributed equally. Kian KenyonDean^{1}^{1}footnotemark: 1 Jackie Chi Kit Cheung Mila  Québec AI Institute, McGill University Montreal, Canada {edward.newell,kiankd}@gmail.com [email protected]
1 Introduction
Word embeddings have been established as standard feature representations for words in most contemporary NLP tasks (kim2014convolutional; huang2015bidirectional; goldberg2016primer). Their incorporation into larger models – from CNNs and LSTMs for sentiment analysis (zhang2018deep), to sequencetosequence models for machine translation (qi2018and), to the input layer of deep contextualized embedders (peters2018deep) – enables high quality performance across a wide variety of problems.
Being the building blocks for many modern NLP applications, we argue that it is worthwhile to subject word embedding algorithms to close theoretical inspection. This work can be considered a retrospective analysis of the groundbreaking word embedding algorithms of the past, which simultaneously offers theoretical insights for how future, deeper models can be developed and understood. Indeed, analogous to a watchmaker who curiously scrutinizes the mechanical components comprising her watches’ oscillators, so too do we aim to uncover what makes word embeddings “tick”.
It is wellknown that word embedding algorithms train two sets of embeddings: the vectors (“input” vectors) and the covectors (“output”, or, “context” vectors). However, the covectors tend to be regarded as an afterthought when used by NLP practitioners, either being thrown away (mikolov2013distributed), or averaged into the vectors (pennington2014glove; levy2015improving).
Nonetheless, recent work has found that separately incorporating pretrained covectors into downstream models can improve performance in specific tasks. This includes lexical substitution (melamud2015simple; roller2016pic), information retrieval (nalisnick2016improving), stateoftheart metaphor detection (mao2018word) and generation (yu2019avoid), and more (press2017using; asr2018querying; deugirmenci2019waste). In this work, we contribute an engaged theoretical treatment of covectors, and later elucidate the different relationships learned separately by vectors and covectors (§6.5).
Training these vectors and covectors can be done by a variety of highperforming algorithms: the samplingbased shallow neural network of SGNS (mikolov2013distributed), GloVe’s weighted least squares over global corpus statistics (pennington2014glove), and matrix factorization methods (levy2014neural; levy2015improving; shazeer2016swivel). In this work, we propose a framework for understanding these algorithms from a common vantage point. We deconstruct each algorithm into its constituent parts, and find that, despite their many different hyperparameters, the algorithms collectively intersect upon the following two key design features:

1.
vectorcovector dot products are learned to approximate pointwise mutual information (PMI) statistics in the corpus; and,

2.
modulation of the loss gradient, directly or indirectly, to balance weak and strong signals arising from the highly imbalanced distribution of corpus statistics.
Finding these commonalities across algorithms, we beg the question of whether or not these features are sufficient for reconstructing a new word embedding algorithm. Indeed, we derive and implement a novel embedding algorithm, HilbertMLE ^{1}^{1} 1 The name is inspired by the intuitions behind this work concerning Hilbert spaces, and the maximum likelihood estimate that defines the model’s loss function., by following these two principles to derive their corresponding global matrix factorization loss function based on the maximum likelihood estimate of the multinomial distribution of corpus statistics.
However, due to the infeasibility of matrix factorization objectives for large vocabulary sizes, we further derive a local samplingbased formulation by algebraically deconstructing HilbertMLE’s global objective function. As we abstractly depict in Figure 1, this derivation can be seen as a mirrored derivation of that which is presented by levy2014neural, who derived the global matrix factorization for SGNS from the original local sampling formulation (mikolov2013distributed).
We find that HilbertMLE produces word embeddings that earn equivalent or better performance against SGNS and GloVe across 17 intrinsic and extrinsic datasets, therefore demonstrating the sufficiency of the two principles for designing a word embedding algorithm.
To summarize, this work offers the following contributions:
2 Fundamental concepts
In this section, we introduce notation and concepts that we will draw upon throughout this paper. This includes formally defining embeddings, their vectors and covectors, and pointwise mutual information (PMI).
Embedding.
In general topology, an embedding is understood as an injective “structure preserving map”, $f:X\to Y$, between two mathematical structures $X$ and $Y$. A word embedding algorithm ($f$) learns an innerproduct space ($Y$) to preserve a linguistic structure within a reference corpus of text, $\mathcal{D}$ ($X$), based on the words in a vocabulary, $\mathcal{V}$. The structure in $\mathcal{D}$ is analyzed in terms of the relationships between words induced by their appearances in the corpus. In such an analysis, each word figures dually: (1) as a focal element inducing a local context; and (2) as elements of the local contexts induced by focal elements. To make these dual roles explicit, we distinguish two copies of the vocabulary: the focal words ${\mathcal{V}}_{T}$ (or, terms), and the context words ${\mathcal{V}}_{C}$.
An embedding consists of two maps:
$\begin{array}{cc}\hfill {\mathcal{V}}_{C}& \u27f6{\mathbb{R}}^{1\times d}\mathit{\hspace{1em}\hspace{1em}}{\mathcal{V}}_{T}\u27f6{\mathbb{R}}^{d\times 1}\hfill \\ \hfill i& \u27fc\u27e8i\hspace{1em}\hspace{1em}\hspace{1em}\hspace{0.5em}\hspace{1em}j\u27fcj\u27e9.\hfill \end{array}$ 
We use Dirac notation to distinguish vectors $j\u27e9$, associated to focal words, from covectors $\u27e8i$, associated to context words. In matrix notation, $j\u27e9$ corresponds to a column vector and $\u27e8i$ to a row vector. Their inner product is $\u27e8ij\u27e9$; this inner product completely characterizes the learned vector space. We will later demonstrate that many word embedding algorithms, intentionally or not, learn a vector space where the inner product between a focal word $j$ and context word $i$ aims to approximate their PMI in the reference corpus: $\u27e8ij\u27e9\approx \mathrm{PMI}(i,j)$.
Pointwise mutual information (PMI).
PMI is a commonly used measure of association in computational linguistics, and has been shown to be consistent and reliable for many tasks (terra2003frequency). It measures the deviation of the cooccurrence probability between two words $i$ and $j$ from the product of their marginal probabilities:
$$\begin{array}{c}\hfill \mathrm{PMI}(i,j):=\mathrm{ln}\frac{{p}_{ij}}{{p}_{i}{p}_{j}}=\mathrm{ln}\frac{N{N}_{ij}}{{N}_{i}{N}_{j}},\end{array}$$  (1) 
where ${p}_{ij}$ is the probability of word $i$ and word $j$ cooccurring (for some notion of cooccurrence), and where ${p}_{i}$ and ${p}_{j}$ are marginal probabilities of words $i$ and $j$ occurring. The empirical PMI can be found by replacing probabilities with corpus statistics. Words are typically considered to cooccur if they are separated by no more than $w$ words; ${N}_{ij}$ is the number of counted cooccurrences between a context $i$ and a term $j$; ${N}_{i}$, ${N}_{j}$, and $N$ are computed by marginalizing over the ${N}_{ij}$ statistics.
3 Word embedding algorithms
We will now introduce the low rank embedder framework for deconstructing word embedding algorithms, inspired by the theory of generalized low rank models (udell2016generalized). We unify several word embedding algorithms by observing them all from the common vantage point of their global loss function. Note that this framework is only used for theoretical analysis, not practical implementation.
The global loss function for a low rank embedder takes the form:
$\mathcal{L}$  $={\displaystyle \sum _{(i,j)\in {\mathcal{V}}_{C}\times {\mathcal{V}}_{T}}}{f}_{ij}(\psi (\u27e8i,j\u27e9),\varphi (i,j)),$  (2) 
and satisfies
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}}=0\mathit{\hspace{1em}}\text{at}\mathit{\hspace{1em}}{\psi}_{ij}={\varphi}_{ij},$  (3) 
where $\psi (\u27e8i,j\u27e9)$ is a kernel function, and $\varphi (i,j)$ is some scalar function (such as a measure of association based on how $i$ and $j$ appear in the corpus); ${\psi}_{ij}$ and ${\varphi}_{ij}$ are abbreviations for the same.
The design variable ${\varphi}_{ij}$ is some function of corpus statistics, and its purpose is to quantitatively measure some relationship between word $i$ and $j$. In apposition, the design variable ${\psi}_{ij}$ is a function of model parameters, and its purpose is to learn a succinct approximation:
${\psi}_{ij}\approx {\varphi}_{ij},$ 
and so represent the relationship measured by ${\varphi}_{ij}$. Intuitively, we can think of a low rank embedder as trying to directly fit a kernel function of model parameters ${\psi}_{ij}$ to some (statistical) relationship between words ${\varphi}_{ij}$ of our choosing. For example, SGNS takes ${\varphi}_{ij}:=\mathrm{PMI}(i,j)\mathrm{ln}k$ and ${\psi}_{ij}:=\u27e8ij\u27e9$, and then learns parameter values that approximate $\u27e8ij\u27e9\approx \mathrm{PMI}(i,j)\mathrm{ln}k$.
Though the specific choice of ${\varphi}_{ij}$ varies slightly, existing low rank embedders generally base ${\varphi}_{ij}$ on cooccurrence of words within a linear window $w$ words wide. But it is worth pointing out that ${\varphi}_{ij}$ can in principle be any pairwise relationship encoded as a scalar function of corpus statistics.
As for the kernel function ${\psi}_{ij}$, one simple choice is to take ${\psi}_{ij}=\u27e8ij\u27e9$. But the framework allows any function that is symmetric and positive definite. This allows the framework to include the use of bias parameters (e.g. in GloVe) and subword parameterization (e.g. in FastText).
To understand the range of models encompassed, it is helpful to see how the framework relates (but is not limited) to matrix factorization. We can think of ${\varphi}_{ij}$ and ${\psi}_{ij}$ as providing the entries of two matrices:
$\mathbf{M}:={\left[{\varphi}_{ij}\right]}_{ij}\mathit{\hspace{1em}}\widehat{\mathbf{M}}:={\left[{\psi}_{ij}\right]}_{ij}.$ 
For models that take ${\psi}_{ij}=\u27e8ij\u27e9$, we can write $\widehat{\mathbf{M}}=\mathrm{\mathbf{W}\mathbf{V}}$, where $\mathbf{W}$ is defined as having row $i$ equal to $\u27e8i$, and $\mathbf{V}$ as having column $j$ equal to $j\u27e9$. Then, the loss function can be rewritten as:
$\mathcal{L}$  $={\displaystyle \sum _{(i,j)\in {\mathcal{V}}_{C}\times {\mathcal{V}}_{T}}}{f}_{ij}({(\mathrm{\mathbf{W}\mathbf{V}})}_{ij},{\mathbf{M}}_{ij}).$ 
This loss function can be interpreted as matrix reconstruction error, because the constraint in Eq. 3 means that the gradient goes to zero as $\mathrm{\mathbf{W}\mathbf{V}}\approx \mathbf{M}$.
Selecting a particular low rank embedder instance requires key design choices to be made: we must chose the embedding dimension $d$, the form of the loss terms ${f}_{ij}$, the kernel function ${\psi}_{ij}$, and the association function ${\varphi}_{ij}$. Only the gradient of ${f}_{ij}$ actually affects the algorithm. The derivative of ${f}_{ij}$ with respect to ${\psi}_{ij}$, which we call the characteristic gradient, helps compare models because it exhibits the action of the gradient yet is symmetric in the parameters. Thus, we address a specific embedder by the tuple $(d,\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}},{\psi}_{ij},{\varphi}_{ij},)$.
In the following subsections, we present the derivations of $\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$, ${\psi}_{ij}$, and ${\varphi}_{ij}$ for each algorithm. We later (§4) compare the algorithms and summarize their derivations in Table 1.
3.1 SGNS as a low rank embedder
levy2014neural provided the important result that skip gram with negative sampling (SGNS) (mikolov2013distributed) was implicitly factorizing the $\mathrm{PMI}\mathrm{ln}k$ matrix. However, levy2014neural did not derive the loss function needed to explicitly pose SGNS as matrix factorization, and required additional assumptions for their derivation to hold. Moreover, empirically, they used SVD on the related (but different) positivePMI matrix, and this did not reproduce the results of SGNS. In other work, li2015word, provided an explicit MF formulation of SGNS from a “representation learning” perspective. This derivation diverges from levy2014neural’s result, and masks the connection between SGNS and other low rank embedders. In this work, we derive the complete global loss function for SGNS, free of additional assumptions.
The loss function of SGNS is as follows:
$$\mathcal{L}=\sum _{(i,j)\in {D}_{2}}\left\{\mathrm{ln}\sigma \u27e8ij\u27e9+\sum _{\mathrm{\ell}=1}^{k}\mathbb{E}\left[\mathrm{ln}(1\sigma \u27e8{i}_{\mathrm{\ell}}^{\prime}j\u27e9)\right]\right\},$$ 
where $\sigma $ is the logistic sigmoid function, ${D}_{2}$ is a list containing each cooccurrence of a contextword $i$ with a focalword $j$ in the corpus, and the expectation is taken by drawing ${i}_{\mathrm{\ell}}^{\prime}$ from the (smoothed) unigram distribution to generate $k$ “negative samples” for a given focalword. (mikolov2013distributed).
We rewrite this by counting the number of times each pair occurs in the corpus, ${N}_{ij}$, and the number of times each pair is drawn as a negative sample, ${N}_{ij}^{}$, while indexing the sum over the set ${\mathcal{V}}_{C}\times {\mathcal{V}}_{T}$:
$$\mathcal{L}=\sum _{(i,j)\in {\mathcal{V}}_{C}\times {\mathcal{V}}_{T}}\left\{{N}_{ij}\mathrm{ln}\sigma \u27e8ij\u27e9+{N}_{ij}^{}\mathrm{ln}(1\sigma \u27e8ij\u27e9)\right\}.$$ 
We can now observe that the global loss is almost in the required form for a low rank embedder (Eq. 2), and that the appropriate setting for the model approximation function is ${\psi}_{ij}=\u27e8ij\u27e9$. The characteristic gradient is derived as, using the identity $a\equiv (a+b)\sigma (\mathrm{ln}\frac{a}{b})$:
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$  $={\displaystyle \frac{\partial \mathcal{L}}{\partial \u27e8ij\u27e9}}={N}_{ij}^{}\sigma \u27e8ij\u27e9{N}_{ij}(1\sigma \u27e8ij\u27e9)$  
$=({N}_{ij}+{N}_{ij}^{})\left[\sigma \left(\u27e8ij\u27e9\right)\sigma \left(\mathrm{ln}{\displaystyle \frac{{N}_{ij}}{{N}_{ij}^{}}}\right)\right].$ 
This provides that the association function for SGNS is ${\varphi}_{ij}=\mathrm{ln}({N}_{ij}/{N}_{ij}^{})$, since the derivative will be equal to zero at that point (Eq. 3). However, recall that negative samples are drawn according to the unigram distribution (or a smoothed variant (levy2015improving)). This means that ${N}_{ij}^{}=k{N}_{i}{N}_{j}/N$. Therefore, in agreement with levy2014neural, we find that:
${\varphi}_{ij}=\mathrm{ln}{\displaystyle \frac{{N}_{ij}N}{{N}_{i}{N}_{j}k}}=\mathrm{PMI}(i,j)\mathrm{ln}k.$  (4) 
3.2 GloVe as a low rank embedder
GloVe (global vectors) was proposed as a method that strikes a halfway point between local sampling and global matrix factorization, taking the best parts from both solution methods (pennington2014glove). Its efficiency came from the fact that it only performs a partial factorization of the $\mathrm{ln}{N}_{ij}$ matrix, only considering samples where ${N}_{ij}>0$. We will demonstrate that GloVe is not so different from SGNS, and that it too implicitly factorizes the $\mathrm{PMI}$ matrix.
GloVe’s loss function is defined as follows:
$\begin{array}{c}\hfill \mathcal{L}={\displaystyle \sum _{ij}}h({N}_{ij}){\left(\u27e8ij\u27e9+{b}_{i}+{b}_{j}\mathrm{ln}{N}_{ij}\right)}^{2}\\ \hfill h({N}_{ij})=\mathrm{min}(1,{\left({\displaystyle \frac{{N}_{ij}}{{N}_{\mathrm{max}}}}\right)}^{\alpha}),\end{array}$  (5) 
where ${b}_{i}$ and ${b}_{j}$ are learned bias parameters; ${N}_{\mathrm{max}}$ and $\alpha $ are empirically tuned hyperparameters for the weighting function $h({N}_{ij})$, which has $h({N}_{ij})=0$ when ${N}_{ij}=0$.
GloVe can be cast as a low rank embedder by using the model approximation function as a kernel with bias parameters, and setting the association measure to simply be the objective of the loss function:
${\psi}_{ij}=\left[{\u27e8i}_{1}\mathrm{\cdots}{\u27e8i}_{d}{b}_{i}\mathrm{\hspace{0.33em}1}\right]$  $\cdot {\left[{j\u27e9}_{1}\mathrm{\cdots}{j\u27e9}_{d}\mathrm{\hspace{0.25em}\hspace{0.25em}1}{b}_{j}\right]}^{\u22ba},$  
$\text{and}\mathit{\hspace{1em}}{\varphi}_{ij}$  $=\mathrm{ln}{N}_{ij}.$ 
Let us observe the optimal solution to the loss function, when $\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}=0$:
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$  $=2h({N}_{ij})\left[\u27e8ij\u27e9+{b}_{i}+{b}_{j}\mathrm{ln}{N}_{ij}\right]=0$  
$\u27f9\u27e8ij\u27e9+{b}_{i}+{b}_{j}=\mathrm{ln}{N}_{ij}.$ 
Multiplying the log operand by $1$:
$\u27e8ij\u27e9+{b}_{i}$  $+{b}_{j}=\mathrm{ln}\left({\displaystyle \frac{{N}_{i}{N}_{j}}{N}}{\displaystyle \frac{N}{{N}_{i}{N}_{j}}}{N}_{ij}\right)$  (6)  
$=\mathrm{ln}{\displaystyle \frac{{N}_{i}}{\sqrt{N}}}+\mathrm{ln}{\displaystyle \frac{{N}_{j}}{\sqrt{N}}}+\mathrm{PMI}(i,j).$  (7) 
On the right side, we have two terms that depend respectively only on $i$ and $j$, which are candidates for the bias terms. Based on this equation alone, we cannot draw any conclusions. However, empirically the bias terms are in fact very near $\frac{{N}_{i}}{\sqrt{N}}$ and $\frac{{N}_{j}}{\sqrt{N}}$, and PMI is nearly centered at zero, as can be seen in Fig. 2. This means that Eq. 7 provides $\u27e8ij\u27e9\approx \mathrm{PMI}(i,j)$.
Analyzing the optimum of GloVe’s loss function yields important insights. First, GloVe can be added to the list of low rank embedders that learn a bilinear parameterization of PMI. Second, we can see why such a parameterization is advantageous. Generally, it helps to standardize features of low rank models (udell2016generalized), and this is essentially what transforming cooccurrence counts into PMI achieves. Thus, PMI can be viewed as a parameterization trick, providing an approximately normal target association to be modelled.
3.3 Other algorithms as low rank embedders
We now present additional algorithms that can be cast as low rank embedders: LDS (arora2016latent) and FastText (joulin2017bag). The derivations for SVD (levy2014neural; levy2015improving) and Swivel (shazeer2016swivel) as low rank embedders are trivial, as both are already posed as matrix factorizations of PMI statistics.
LDS.
arora2016latent introduced an embedding perspective based on generative modelling with random walks through a latent discourse space (LDS). While their only experiments were on analogy completion tasks (which do not correlate well with downstream performance (linzen2016issues; faruqui2016problems; rogers2017too)) LDS provided a theoretical basis for the surprisingly wellperforming SIF document embedding algorithm soon afterwards (arora2017simple). We now demonstrate that LDS is also a lowrank embedder.
The low rank learning objective for LDS follows directly from Corollary 2.3, in arora2016latent:
$$\begin{array}{cc}\hfill \mathrm{PMI}(i,j)& =\frac{\u27e8ij\u27e9}{d}+\gamma +O(\u03f5).\hfill \end{array}$$ 
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$ can be found by straightforward differentiation of LDS’s loss function:
$$\mathcal{L}=\sum _{ij}h({N}_{ij}){\left[\mathrm{ln}{N}_{ij}{\parallel {\u27e8i+j\u27e9}^{\u22ba}\parallel}^{2}C\right]}^{2},$$ 
where $h({N}_{ij})$ is as defined by GloVe. The quadratic term is a valid kernel function because:
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}}={\parallel {\u27e8i+j\u27e9}^{\u22ba}\parallel}^{2}=\u27e8\stackrel{~}{i}\stackrel{~}{j}\u27e9,$ 
where
$\stackrel{~}{\u27e8i}$  $=[\sqrt{2}\u27e8i{}_{1}\mathrm{\cdots}\sqrt{2}\u27e8i{}_{d}\u27e8i\u27e8i{}^{\u22ba}\mathrm{\hspace{0.33em}\hspace{0.33em}\hspace{0.33em}\hspace{0.33em}1}\mathit{\hspace{1em}\hspace{1em}}],$  
$\stackrel{~}{j\u27e9}$  $={[\sqrt{2}{j\u27e9}_{1}\mathrm{\cdots}\sqrt{2}{j\u27e9}_{d}\mathrm{\hspace{0.33em}\hspace{0.33em}\hspace{0.33em}\hspace{0.17em}\hspace{0.33em}1}\mathit{\hspace{1em}}{j\u27e9}^{\u22ba}j\u27e9]}^{\u22ba}.$ 
FastText.
Proposed by joulin2017bag, FastText’s motivation is orthogonal to the present work. It’s purpose is to provide subwordbased representation of words to improve vocabulary coverage and generalizability of word embeddings. Nonetheless, it can also be understood as a low rank embedder.
In FastText, the vector for each word is taken as the sum of embeddings for its character $n$grams, $3\le n\le 6$. Then the vector $j\u27e9$ is given by the feature function $j\u27e9={\sum}_{g\in z(j)}g\u27e9$, where $g\u27e9$ is the vector for $n$gram $g$, and $z(j)$ is the set of $n$grams in word $j$. Meanwhile covectors are accorded to words directly, rather than using $n$gram covector embeddings. This provides ${\psi}_{ij}=\u27e8ij\u27e9$, and, by virtue of using the SGNS loss function, ${\varphi}_{ij}=\mathrm{PMI}(i,j)\mathrm{ln}k$.
Model  $\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$  ${\psi}_{ij}$  ${\varphi}_{ij}$  $\u27e8ij\u27e9\approx $ 
\Xhline3 SGNS  $({N}_{ij}+{N}_{ij}^{})\cdot \left[\sigma ({\psi}_{ij})\sigma ({\varphi}_{ij})\right]$  $\u27e8ij\u27e9$  $\mathrm{lg}\frac{{N}_{ij}}{{N}_{ij}^{}}$  $\mathrm{PMI}(i,j)\mathrm{ln}k$ 
GloVe  $2h({N}_{ij})\cdot \left[{\psi}_{ij}{\varphi}_{ij}\right]\mathit{\hspace{1em}\hspace{0.25em}}$  $\u27e8ij\u27e9+{b}_{i}+{b}_{j}$  $\mathrm{lg}{N}_{ij}$  $\mathrm{PMI}(i,j)$ 
LDS  $4h({N}_{ij})\cdot \left[{\psi}_{ij}{\varphi}_{ij}+C\right]$  ${\parallel {\u27e8i+j\u27e9}^{\u22ba}\parallel}^{2}$  $\mathrm{lg}{N}_{ij}$  $d\mathrm{PMI}(i,j)d\gamma $ 
Swivel  $\sqrt{{N}_{ij}}\cdot \left[{\psi}_{ij}{\varphi}_{ij}\right]$  $\u27e8ij\u27e9$  $\mathrm{PMI}(i,j)$  $\mathrm{PMI}(i,j)$ 
$1\cdot \sigma \left({\psi}_{ij}{\varphi}_{ij}\right)$  ${\mathrm{PMI}}^{*}(i,j)$  ${\mathrm{PMI}}^{*}(i,j)$  
\Xhline3 HilbertMLE  ${\left({p}_{i}{p}_{j}\right)}^{\frac{1}{\tau}}\cdot \left[{e}^{{\psi}_{ij}}{e}^{{\varphi}_{ij}}\right]$  $\u27e8ij\u27e9$  $\mathrm{PMI}(i,j)$  $\mathrm{PMI}(i,j)$ 
4 Deconstructing the algorithms
Table 1 presents a summary of our derivations of existing algorithms as low rank embedders.
We observe several common features between each of the algorithms. In each case, $\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$ takes the form $(\mathrm{multiplier})\cdot (\mathrm{difference})$. The multiplier is always a “tempered” version of ${N}_{ij}$ (or ${N}_{i}{N}_{j}$), by which we mean that it increases sublinearly with ${N}_{ij}$ (or ${N}_{i}{N}_{j}$)^{2}^{2} 2 In SGNS, ${N}_{ij}^{}\propto {N}_{i}{N}_{j}$; ${N}_{ij}$ and ${N}_{ij}^{}$ are tempered by undersampling and unigram smoothing..
Furthermore, for each algorithm, ${\varphi}_{ij}$ is equal to PMI or a scaled log of ${N}_{ij}$. Yet, the choice of ${\psi}_{ij}$ in combination with ${\varphi}_{ij}$ provides that every model is optimized when $\u27e8ij\u27e9$ tends toward $\mathrm{PMI}(i,j)$ (with or without a constant shift or scaling). We have already seen that the optimum for SGNS is equivalent to the shifted PMI (§3.1). For GloVe, we theoretically and empirically showed that incorporation of the bias terms captures the unigram counts needed for PMI (§3.2). We observe this property similarly with regards to LDS’s incorporation of the L2 norm into its learning objective, where we suspect that the unigram probability is implicitly captured in the norms of the respective vectors and covectors (§3.3).
Therefore, we observe that these embedders converge on two key points: (1) an optimum in which model parameters are bilinearly related to PMI, and (2) the weighting of $\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$ by some tempered form of ${N}_{ij}$ (or ${N}_{i}{N}_{j}$). In the next section, we introduce HilbertMLE, which is derived based on the shared principles observed between the algorithms in Table 1.
5 Reconstructing an algorithm
If the two basic principles that we have identified are sufficient, then the simplest low rank embedder should be one that derives from them without any other assumptions.
We begin with principle (1), which prescribes a bilinear parameterization of PMI. The definition of PMI (Eq. 1) provides a logbilinear parameterization of cooccurrence probability, ${\widehat{p}}_{ij}$, if we presuppose that the aim of our model is to approximate the PMI with vectorcovector dot products:
$\begin{array}{cc}\hfill \u27e8ij\u27e9& \approx \mathrm{PMI}(i,j)=\mathrm{ln}{\displaystyle \frac{{p}_{ij}}{{p}_{i}{p}_{j}}}\hfill \\ \hfill \u27f9\mathit{\hspace{1em}}{\widehat{p}}_{ij}& ={p}_{i}{p}_{j}{e}^{\u27e8ij\u27e9}.\hfill \end{array}$  (8) 
In the expression above, ${\widehat{p}}_{ij}$ represents the model’s estimate of the cooccurrence probability, provided by the parameterization which includes the unigram probabilities ${p}_{i}$ and ${p}_{j}$.
Accordingly, given the matrix of covectors $\mathbf{W}$ and vectors $\mathbf{V}$, the likelihood of the observed cooccurrence statistics, $\mathcal{D}={\{{N}_{ij}\}}_{ij}$, is distributed like the multinomial, $\mathrm{Mult}({\{{\widehat{p}}_{ij}\}}_{ij},N)$:
$\text{Pr}(\mathcal{D}\mathbf{V},\mathbf{W})=N!{\displaystyle \prod _{ij}}{\displaystyle \frac{{\widehat{p}}_{ij}^{{N}_{ij}}}{{N}_{ij}!}},$ 
where ${\widehat{p}}_{ij}$ depends on $\mathbf{V}$ and $\mathbf{W}$ (whose rows and columns are respectively $\u27e8i$ and $j\u27e9$) through Eq. 8. Taking the negative log likelihood as the loss:
$\mathcal{L}={\displaystyle \sum _{ij}}{N}_{ij}\mathrm{ln}{\widehat{p}}_{ij},$  (9) 
where we have dropped constant terms that do not affect the gradient.
The unitarity axiom of probability requires that ${\sum}_{ij}{\widehat{p}}_{ij}=1$. Including this constraint with a Lagrange multiplier, we obtain:
$\mathcal{L}=\left({\displaystyle \sum _{ij}}{N}_{ij}\mathrm{ln}{\widehat{p}}_{ij}\right)+\lambda \left(1{\displaystyle \sum _{ij}}{\widehat{p}}_{ij}\right).$  (10) 
At the feasible optimum, the original loss and constraint gradients should balance:
$\frac{\partial \mathcal{L}}{\partial {\widehat{p}}_{ij}}}={\displaystyle \frac{{N}_{ij}}{{\widehat{p}}_{ij}}}\lambda =0$  (11)  
$\u27f9\lambda {\widehat{p}}_{ij}={N}_{ij}.$  (12) 
Eq. 12 represents ${\mathcal{V}}_{C}\times {\mathcal{V}}_{T}$ equations, one for each pair $(i,j)$. Summing these equations together,
$\lambda {\displaystyle \sum _{ij}}{\widehat{p}}_{ij}$  $={\displaystyle \sum _{ij}}{N}_{ij}\u27f9\lambda =N.$  (13) 
The constrained loss function is therefore,
$\mathcal{L}=\left({\displaystyle \sum _{ij}}{N}_{ij}\mathrm{ln}{\widehat{p}}_{ij}\right)N\left(1{\displaystyle \sum _{ij}}{\widehat{p}}_{ij}\right).$  (14) 
Reintroducing the bilinear parameterization (Eq. 8), and dividing through by $N$ to eliminate dependence on corpus size:
$\mathcal{L}={\displaystyle \sum _{ij}}\left({p}_{i}{p}_{j}{e}^{\u27e8ij\u27e9}{p}_{ij}\u27e8ij\u27e9\right),$  (15) 
where, again, we have dropped constant terms that do not affect the gradient. Finally by differentiating we obtain the characteristic gradient:
$\begin{array}{cc}\hfill {\displaystyle \frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}}={\displaystyle \frac{\partial {f}_{ij}}{\partial \u27e8ij\u27e9}}& ={p}_{i}{p}_{j}\left[{e}^{\u27e8ij\u27e9}{e}^{\mathrm{PMI}(\mathrm{i},\mathrm{j})}\right]\hfill \\ & ={\widehat{p}}_{ij}{p}_{ij}.\hfill \end{array}$  (16) 
This yields a loss gradient closely resembling other members of the low rank embedders. Empirically, its performance is on par with the other low rank embedders (see §6).
The multiplier, ${p}_{i}{p}_{j}$, determines how errors in fitting individual $(i,j)$ pairs trade off. While it appropriately favors fitting statistics with lower standard error, the signal from rarer pairs will be weak for any nondivergent learning rate because ${p}_{i}{p}_{j}$ spans orders of magnitude. This slows down training. So, we apply a gradient conditioning measure as is done for the other low rank embedders: we apply a temperature parameter, $\tau $, that reduces differences in magnitude of the multiplier:
$\frac{\partial {f}_{ij}}{\partial {\psi}_{ij}}$  $={({p}_{i}{p}_{j})}^{1/\tau}\left[{e}^{\u27e8ij\u27e9}{e}^{\mathrm{PMI}}\right]$  (17) 
5.1 Solving the objective function
The objective function presented in Equation 15 is most straightforwardly solved via dense matrix factorization. This can be done relatively efficiently by using the sharding method presented by shazeer2016swivel for matrix factorization. Such a solution is acceptable given a small vocabulary size, but does not scale to large vocabularies, due to the quadratic dependency. GloVe (pennington2014glove) handled this problem by only training on statistics where ${N}_{ij}>0$. levy2014neural; levy2015improving avoided the quadratic dependency by implementing sparse SVD on the positivePMI matrix. However, both of these solutions may be missing out on important information that can be gained by “noticing what’s missing” (shazeer2016swivel).
Yet, SGNS (mikolov2013distributed) was never confronted with the vocabulary size problem due to the fact that it uses local sampling over the corpus. While this yields a linear time complexity on the corpus size, this is generally preferable to a quadratic memory complexity on the vocabulary size. levy2014neural derived the global matrix factorization formulation of SGNS by moving in the algorithmic direction of local to global. Conversely, we will now move in the direction of global to local and derive the local sampling formulation of the HilbertMLE loss function.
Locally sampling HilbertMLE.
Note how if we differentiate the loss function of HilbertMLE (Equation 15) relative to an arbitrary model parameter $\theta $, we obtain a difference between two expectations:
$$\begin{array}{cc}\hfill \frac{\partial \mathcal{L}}{\partial \theta}& =\sum _{ij}{p}_{i}{p}_{j}{e}^{\u27e8ij\u27e9}\frac{\partial \u27e8ij\u27e9}{\partial \theta}\sum _{ij}{p}_{ij}\frac{\partial \u27e8ij\u27e9}{\partial \theta}\hfill \\ & =\hspace{1em}\underset{(i,j)\sim {\widehat{p}}_{ij}}{\mathbb{E}}\hspace{1em}\left[\frac{\partial \u27e8ij\u27e9}{\partial \theta}\right]\hspace{1em}\underset{(i,j)\sim {p}_{ij}}{\mathbb{E}}\hspace{1em}\left[\frac{\partial \u27e8ij\u27e9}{\partial \theta}\right].\hfill \end{array}$$  (18) 
In words, the derivative of the loss function is the difference between the expected value of $\frac{\partial \u27e8ij\u27e9}{\partial \theta}$ when taken under the model distribution on one hand and under the corpus distribution on the other.
This leads to a remarkably simple training algorithm. Draw a sample of word pairs $(i,j)$ from the corpus (using a local sampling approach as in SGNS), and draw a sample of pairs from the model distribution. Compute $\mathbb{E}[\u27e8ij\u27e9]$ for both samples, and take their difference. The gradient of the result estimates the gradient of $\mathcal{L}$.
In the context of autodifferentiation libraries such as PyTorch (paszke2017automatic) it is adequate to use a simplified loss function $\stackrel{~}{\mathcal{L}}$,
$$\stackrel{~}{\mathcal{L}}=\mathit{\hspace{1em}}\underset{(i,j)\sim {\widehat{p}}_{ij}}{\mathbb{E}}\mathit{\hspace{1em}}\left[\u27e8ij\u27e9\right]\mathit{\hspace{1em}}\underset{(i,j)\sim {p}_{ij}}{\mathbb{E}}\mathit{\hspace{1em}}\left[\u27e8ij\u27e9\right],$$  (19) 
because in the first term, the autodifferential operator will ignore the appearance of model parameters in the distribution according to which the expectation ${\mathbb{E}}_{(ij)\sim {\widehat{p}}_{ij}}$ is taken, but will recognize model parameters in the expectation’s operand $\u27e8ij\u27e9$. Thus $\frac{{\partial}_{\text{auto}}}{\partial \theta}\stackrel{~}{\mathcal{L}}=\frac{\partial}{\partial \theta}\mathcal{L}$.
Like SGNS, this uses positive samples drawn from the corpus, balanced against negative samples. But unlike SGNS, which draws negative samples according to the (distorted) unigram distribution, here we draw negative samples from model distribution ${\widehat{p}}_{ij}$. This can be done efficiently using Gibbs sampling, making this a form of contrastive divergence (hinton2002training; carreira2005contrastive). To approximately sample a cooccurring pair $(i,j)$ from the model distribution, we start from a corpusderived pair, and repeatedly perform Gibbs sampling steps: randomly fix either $i$ or $j$ and resample the other from the model distribution conditioned on the fixed variable. E.g. if we fix $i$, then we draw a new $j$ from $j\sim \widehat{p}(ji)={p}_{j}{e}^{\u27e8ij\u27e9}$. Sampling from the conditional distribution can be done in constant time using an adaptive softmax (grave2017efficient). In theory, the model distribution is approximated after taking many Gibbs sample steps, but consistent with Hinton’s findings for contrastive divergence (hinton2002training; carreira2005contrastive), we find that a single Gibbs sampling step supports efficient training.
6 Experiments
We provide a simple set of experiments comparing the two characteristic models for word embeddings with ours: SGNS and GloVe against HilbertMLE. Our aim in these experiments is simply to verify the sufficiency of the principles we used to derive HilbertMLE (§5). In other words, we are testing the following hypothesis: if the principles we have proposed are sufficient for designing a word embedding algorithm, then HilbertMLE should perform equivalently or better than SGNS and GloVe, which were proposed with different motivating principles (mikolov2013distributed; pennington2014glove).
In our experiments, we use a matrix factorization implementation of HilbertMLE as the characteristic form of the model. During experimentation, we found that the Gibbs sampling implementation of HilbertMLE (§5.1) performed equivalently, as expected. We present results on word similarity (§6.1), analogical reasoning (§6.2), text classification (§6.3), and sequence labelling (§6.4).
Our reference corpus combines Gigaword 3 (graff2007english) with a Wikipedia 2018 dump, lowercased, yielding 5.4 billion tokens. We limit ${\mathcal{V}}_{C}$ and ${\mathcal{V}}_{T}$ to be the 50,000 most frequent tokens in $\mathcal{D}$. We use a $5$token context window, and $d=300$. We use the released implementations and hyperparameter choices of SGNS and GloVe. Our implementation of HilbertMLE uses PyTorch to take advantage of GPUacceleration, automatic differentiation (paszke2017automatic), and the Adam gradient descent optimizer (kingma2014adam). Practically, HilbertMLE was implemented by using sharding (shazeer2016swivel). We use a single 12GB GPU, and load $12500\times 12500$element shards to calculate each update. Training embeddings took less than 3 hours for a 50,000 word vocabulary.
Intrinsic eval.  B143  MENd/t  RMT  RARE  SE17  S999  WSR/S  Y130  Analogy 
SGNS  .453  .753/.763  .680  .515  .656  .364  .589/.760  .514  .763/.312 
GloVe  .347  .760/.771  .663  .509  .662  .391  .602/.727  .541  .739/.310 
HilbertMLE  .397  .751/.761  .684  .579  .680  .462  .593/.765  .514  .705/.343 
6.1 Word similarity
A word similarity task involves interpreting the cosine similarity between embeddings as a measure of similarity or relatedness between words. Performance is computed with the Spearman rank correlation coefficient between the model’s scoring of all pairs of words versus the gold standard human scoring. These tasks can reflect the degree of linear structure captured in the embeddings, which can provide useful insights into differences between models. However, they do not always correlate with performance in downstream tasks (chiu2016intrinsic; faruqui2016problems).
We used the following word similarity datasets: Simlex999 (S999) (hill2015simlex); Wordsim353 (finkelstein2002placing) divided into similarity (WSS) and relatedness (WSR) (agirre2009study); the SemEval 2017 task (SE17) (camacho2017semeval); Radinsky Mechanical Turk (RMT) (radinsky2011word); Baker Verbs 143 (B143) (baker2014unsupervised); Yang Powers Verbs 130 (Y130) (yang2006verb); MEN divided into a 2000sample development set (MENd) and 1000sample test set (MENt) (bruni2012distributional); Rare Words (RARE) (luong2013better). We had an average of 96% coverage over all wordpairs in each dataset, excepting RARE; we had 31% coverage over RARE, yielding 620 word pairs (i.e., more samples than SE17, RMT, WSS/R, and Y130). Results are computed on these covered wordpairs.
Word similarity results.
Table 2 presents results across the 10 word similarity tasks. We observe that HilbertMLE obtains the best performance in 5 out of 10 tasks. In particular, HilbertMLE obtains substantially better scores on S999 than the other models, earning a Spearman correlation coefficient of 0.462, an $18\%$ relative improvement over the next best (SGNS). Note that S999 has been shown to have a high correlation with performance in extrinsic tasks such as Named Entity Recognition and NPchunking, unlike the other word similarity datasets (chiu2016intrinsic). On the tasks with worse performance, we observe that the differences between the three algorithms are relatively marginal. Base on these experiments, and the ones that follow, we can therefore conclude that our hypothesis (§6) is valid.
6.2 Analogical reasoning
We performed intrinsic evaluation of our embeddings using standard analogy tasks (e.g., “man” is to “woman” as “king” is to $X$). We evaluated on the Google Analogy dataset (Google) (mikolov2013efficient) and the Balanced Analogy Test Set (BATS) (gladkova2016analogy). We observed $86\%$ and $69\%$ coverage of the words in each dataset, respectively. Preliminary experiments using 3CosAdd and 3CosMul (levy2014linguistic) as selection rules, showed 3CosMul was always superior, consistent with the findings of levy2014linguistic.
Analogy results.
Table 2 presents results on the two analogy datasets in the final column. HilbertMLE performs somewhat worse than the other models on the Google Analogy dataset. However, there has been a considerable amount of work finding that performance on these tasks does not necessarily provide a reliable judgment for embedding quality (faruqui2016problems; linzen2016issues; rogers2017too). Indeed, we can see that performance on the Google Analogy dataset does not correspond with performance on the other larger analogy dataset (BATS), where HilbertMLE gets the best performance.
Classification  IMDB  AGNews 
SGNS  .910 $\pm $ .001  .812 $\pm $ .003 
GloVe  .905 $\pm $ .001  .807 $\pm $ .003 
HilbertMLE  .911 $\pm $ .002  .812 $\pm $ .003 
Seq. labelling  Semcor  WSJ  Brown 
Baseline  .6126  .8905  .9349 
SGNS  .6638  .9615  .9762 
GloVe  .6550  .9609  .9750 
HilbertMLE  .6663  .9617  .9767 
6.3 Text classification
We performed extrinsic evaluation for classification tasks on two benchmark NLP classification datasets. First, the IMDB movie reviews dataset for sentiment analysis (maas2011learning), divided into train and test sets of 25,000 samples each. Second, the AGNews news classification dataset, as divided into 8 approximately 12,000sample classes (such as Sports, Health, and Business) by kenyon2018clustering; here, we separate 30% of the samples as the final test set. On each, we separate 10% of the training set for validation tuning.
We use a standard BiLSTMmax sequence encoder for these tasks (conneau2017supervised). This model produces a sequence representation by maxpooling over the forward and backward hidden states produced a bidirectional LSTM. This representation is then passed through a MLP before final prediction. We found that validation performance was optimized with a 1layer $128$d BiLSTM, followed by a $512$d MLP using a ReLU activation, a minibatch size of 64, dropout rate of $0.5$, and normalizing the embeddings before input. We use a learning rate of $0.001$ with Adam, and divide the learning rate by a factor of $10$ if validation performance does not improve in 3 epochs, similar to conneau2017supervised; we schedule the learning rate in the same way for sequence labelling (§6.4).
Classification results.
In Table 3 we present the test set results from our classification experiments. We trained each BiLSTMmax 10 times with different random seeds for weight initialization and present the mean test accuracy plus/minus the standard deviation. These results show that each embedding model is similar, although GloVe is slightly worse than the others. Meanwhile, SGNS and HilbertMLE perform approximately the same, obtaining high quality results on both tasks.
6.4 Sequence labelling
Our final extrinsic evaluations are sequence labelling tasks on three datasets. The first task is supersense tagging (SST) (ciaramita2006broad) on the Semcor 3.0 corpus (miller1993semantic). SST is a coarsegrained semantic sequence labelling problem with 83 unique labels; we report results using the microFscore without the Otag score due to the skew of label distribution, as is standard (alonso2017multitask; changpinyo2018multi). We divide Semcor into a 7030% traintest split, and use 10% of the training set for validation tuning. The second task is syntactic partofspeech tagging (POS); we use the Penn TreeBank Wall Street Journal corpus (WSJ) (marcus1993building) and the Brown corpus (as distributed by NLTK^{3}^{3} 3 https://www.nltk.org/book/ch02.html). For the WSJ, we use the given 44tag tagset, and for Brown we map the original tags to the 12tag “universal tagset” (petrov2012universal). We use sections 22, 23, and 24 of the WSJ corpus as its test set, and separate out 30% of the sentences in Brown as its test set.
On each dataset, we train a standard sequence labelling model inspired by huang2015bidirectional: a 2layer, $128$d bidirectional LSTM, using a minibatch size of 16, and a dropout rate of 0.5. Interestingly, we found that normalizing the embeddings substantially reduced validation performance, so we keep them in their original form.
Sequence labelling results.
To accompany our results in Table 4, we include results from a trivial mostfrequenttag baseline. This baseline returns that the tag of a token is the most frequently occurring tag for that token within the training set. In SST it is standard to include results from a mostfrequentsupersense baseline, being inspired from the tradition of word sense disambiguation, which uses the mostfrequentsense baseline.
The results for the BiLSTMs are the mean test set score across 10 different runs with different random seeds for weight initialization. The low standard deviations were approximately the same for each model. As in the classification tasks, we find that the embeddings produced by each model obtain very similar results. Nonetheless, we observe that HilbertMLE offers marginal improvements over the others. Note that our performance on WSJ and Brown is expected since we use vanilla BiLSTMs that do not include any hand engineered character or contextbased features. Indeed, huang2015bidirectional report results of 96.04% on the WSJ with their vanilla BiLSTM, which suggests that our embeddings possess strong syntactic properties.
6.5 Qualitative analysis
Argmax${}_{i}$  Top most similar embeddings 
$i\u27e9\cdot \text{\mathit{c}\mathit{a}\mathit{t}}\u27e9$  kittens, cats, kitten, poodle 
$\u27e8i\text{\mathit{c}\mathit{a}\mathit{t}}\u27e9$  burglar, siamese, schrödinger 
$i\u27e9\cdot \text{\mathit{m}\mathit{o}\mathit{n}\mathit{e}\mathit{y}}\u27e9$  funds, monies, billions, cash 
$\u27e8i\text{\mathit{m}\mathit{o}\mathit{n}\mathit{e}\mathit{y}}\u27e9$  laundering, launder, extort 
$i\u27e9\cdot \text{\mathit{c}\mathit{u}\mathit{b}\mathit{a}}\u27e9$  cubans, cuban, anticastro 
$\u27e8i\text{\mathit{c}\mathit{u}\mathit{b}\mathit{a}}\u27e9$  gooding, guantanamo, havana 
We provide a final set of qualitative results in Table 5. Here, we use the vectors and covectors trained by the HilbertMLE model used in our experiments. These results elucidate the difference between using vectorvector similarity versus vectorcovector dot product similarity (results are practically the same when using cosine similarity). The vectorvector similarity is well known as a way to measure semantic similarity between two concepts captured in word embeddings. As expected, we see recoveries like “cat” is similar to “kitten”, “money” with “funds”, etc.
However, when we instead obtain the most similar covector to the corresponding vector, the results are dramatically different. We see that the vector for “cat” is most similar to covectors for words with which it forms multiword expressions: “cat burglar”, ”siamese cat”, ”schrödinger’s cat”. We see that “cuba” is most similar to the covector for “gooding” – this is because Cuba Gooding Jr. is a famous American actor whose Wikipedia page appears in our corpus. Indeed, the vectorcovector recoveries are directly correlated to the PMIs between the terms in the corpus.
Overall, we see that vectorvector dot products recover semantic similarity, while vectorcovector dot products recover cooccurrence similarity. melamud2015simple and asr2017artificial discuss these different statistical recoveries as paradigmatic (targettotarget) and syntagmatic (targettocontext) recoveries, respectively. However, to our knowledge, previous work has not explicitly explained the reason for these two different types of recoveries; i.e., because the learning objective for word embeddings is to approximate PMI. Therefore, these results qualitatively demonstrate exactly what our hypothesis anticipates: the dot product between trained vectors and covectors approximates the PMI between their corresponding words in the original corpus.
7 Discussion
In the past, probabilistic distributional semantic models for creating word embeddings surpassed the traditional countbased models (turney2010frequency) that preceded them, which was wellestablished by baroni2014don. At the same time, models like Word2vec (SGNS), GloVe, and SVD of PPMI (mikolov2013efficient; mikolov2013distributed; pennington2014glove; levy2015improving) offered strong improvements (in terms of performance and efficiency) over other probabilisticallymotivated embedding models (collobert2008unified; mnih2009scalable; turian2010word; mnih2013learning).
Today, NLP seems to be orienting toward deep contextualized models (peters2018deep; devlin2018bert). Nonetheless, pretrained word embeddings are still highly relevant. Indeed, they have been used recently to greatly assist solving problems in materials science (tshitoyan2019unsupervised), biomedical text mining (zhang2019biowordvec), and law (chalkidis2019deep). Moreover, word embeddings are used in dynamic metaembeddings to obtain stateoftheart results (kiela2018dynamic), are used as inputs to ELMO (peters2018deep), and are crucial in memoryconstrained NLP contexts (such as in mobile devices, which cannot store large deep neural networks (shu2017compressing)).
We believe a robust understanding of the “shallow” (or, nondeep), uncontextualized embedding models presented in this work is a prerequisite for informed development of deeper models. In this work, we advanced the theoretical understanding of word embeddings by proposing the low rank embedder framework. Cast under this framework, the similarities between many existing algorithms become apparent. After isolating two key principles shared by the low rank embedders—a probabilistically informed bilinear parameterization of PMI and a “tempered” gradient conditioning measure—we demonstrate that these ideas are sufficient to derive HilbertMLE, a model that is simpler yet has performance that is equivalent or better than the other models. This provides a parsimonious explanation for the success of the low rank embedders, demonstrating that many idiosyncratic features of other embedders are unnecessary.
The design parameters of our framework have not yet been fully explored. Our framework could be used to model more complex linguistic phenomena, by following the blueprint provided by our derivation of HilbertMLE. Moreover, based on our findings concerning the importance of covectors in parameterizing the model’s approximation of PMI, we believe that methods such as retrofitting (faruqui2015retrofitting), subwordbased embedding decomposition (stratos2017reconstruction), and dynamic metaembedding (kiela2018dynamic) could benefit by incorporating covectors into their modelling designs. As well, similar to how the theoretical basis of LDS (arora2016latent) was the grounding for the widelyused SIF document embedding algorithm (arora2017simple), we believe that the theoretical basis provided in this work can inform future development of document embedding techniques.