Abstract
We propose algorithms to train productionquality ngram language modelsusing federated learning. Federated learning is a distributed computationplatform that can be used to train global models for portable devices such assmart phones. Federated learning is especially relevant for applicationshandling privacysensitive data, such as virtual keyboards, because training isperformed without the users' data ever leaving their devices. While theprinciples of federated learning are fairly generic, its methodology assumesthat the underlying models are neural networks. However, virtual keyboards aretypically powered by ngram language models for latency reasons. We propose to train a recurrent neural network language model using thedecentralized FederatedAveraging algorithm and to approximate this federatedmodel serverside with an ngram model that can be deployed to devices for fastinference. Our technical contributions include ways of handling largevocabularies, algorithms to correct capitalization errors in user data, andefficient finite state transducer algorithms to convert word language models towordpiece language models and vice versa. The ngram language models trainedwith federated learning are compared to ngrams trained with traditionalserverbased algorithms using A/B tests on tens of millions of users of virtualkeyboard. Results are presented for two languages, American English andBrazilian Portuguese. This work demonstrates that highquality ngram languagemodels can be trained directly on client mobile devices without sensitivetraining data ever leaving the devices.
Quick Read (beta)
Federated Learning of Ngram Language Models
Abstract
We propose algorithms to train productionquality ngram language models using federated learning. Federated learning is a distributed computation platform that can be used to train global models for portable devices such as smart phones. Federated learning is especially relevant for applications handling privacysensitive data, such as virtual keyboards, because training is performed without the users’ data ever leaving their devices. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by ngram language models for latency reasons.
We propose to train a recurrent neural network language model using the decentralized FederatedAveraging algorithm and to approximate this federated model serverside with an ngram model that can be deployed to devices for fast inference. Our technical contributions include ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to wordpiece language models and vice versa. The ngram language models trained with federated learning are compared to ngrams trained with traditional serverbased algorithms using A/B tests on tens of millions of users of a virtual keyboard. Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that highquality ngram language models can be trained directly on client mobile devices without sensitive training data ever leaving the devices.
Federated Learning of Ngram Language Models
Mingqing Chen, Ananda Theertha Suresh, Rajiv Mathews, Adeline Wong, Cyril Allauzen, Françoise Beaufays, Michael Riley Google, Inc. {mingqing,theertha,mathews,adelinew,allauzen,fsb,riley} @google.com
1 Introduction
1.1 Virtual keyboard applications
Virtual keyboards for mobile devices provide a host of functionalities from decoding noisy spatial signals from tap and glide typing inputs to providing autocorrections, word completions, and nextword predictions. These features must fit within tight RAM and CPU budgets, and operate under strict latency constraints. A key press should result in visible feedback within about 20 milliseconds (Ouyang17Mobile; nsm). Weighted finitestate transducers have been used successfully to decode keyboard spatial signals using a combination of spatial and language models (Ouyang17Mobile; Hellsten17Transliterated). Figure 1 shows the glide trails of two spatiallysimilar words. Because of the similarity of the two trails, the decoder must rely on the language model to discriminate between viable candidates.
For memory and latency reasons, especially on lowend devices, the language models are typically based on ngrams and do not exceed ten megabytes. A language model (LM) is a probabilistic model on words. Given previous words ${x}_{1},{x}_{2},\mathrm{\dots},{x}_{m1}$, an LM assigns a probability to the new words, i.e. $p({x}_{m}{x}_{m1},\mathrm{\dots},{x}_{1})$. An ngram LM is a Markovian distribution of order $n1$, defined by
$$p({x}_{m}{x}_{m1},\mathrm{\dots},{x}_{1})=p({x}_{m}{x}_{m1},\mathrm{\dots},{x}_{mn+1}),$$ 
where $n$ is the order of the ngram. For computation and memory efficiency, keyboard LMs typically have higherorder ngrams over a subset of the vocabulary, e.g. the most frequent $64$K words, and the rest of the vocabulary only has unigrams. We consider ngram LMs that do not exceed $1.5$M ngrams and include fewer than $200$K unigrams.
Ngram models are traditionally trained by applying a smoothing method to ngram counts from a training corpus (chen1999empirical). The highest quality ngram models are trained over data that are wellmatched to the desired output (intelligent_training_data). For virtual keyboards, training over users’ typed text would lead to the best results. Of course, such data are very personal and need to be handled with care.
1.2 Federated learning
We propose to leverage Federated Learning (FL) (konevcny2016federated; konevcny2016federatedb), a technique where machine learning models are trained in a decentralized manner on endusers’ devices, so that raw data never leaves these devices. Only targeted and ephemeral parameter updates are aggregated on a centralized server. Figure 2 provides an illustration of the process.
Federated learning for keyboard input was previously explored in hard2018federated, in which a federated recurrent neural network (RNN) was trained for nextword prediction. However, latency constraints prevent the direct use of an RNN for decoding. To overcome this problem, we propose to derive an ngram LM from a federated RNN LM model and use that ngram LM for decoding. Specifically, the approximation algorithm is based on SampleApprox , which was recently proposed in suresh2018approx; suresh2019distilling. The proposed approach has several advantages:
Improved model quality: Since the RNN LM is trained directly on domainmatched user data, its predictions are more likely to match actual user behavior. In addition, as shown in suresh2018approx, an ngram LM approximated from such an RNN LM is of higher quality than an ngram LM trained on user data directly.
Minimum information transmission: In FL, only the minimal information necessary for model training (the model parameter deltas) is transmitted to centralized servers. The model updates contain much less information than the complete training data.
Additional privacypreserving techniques: FL can be further combined with privacypreserving techniques such as secure multiparty computation (Bonawitz2017) and differential privacy (McMahan2017LearningDP; agarwal2018cpsgd; abadi2016deep). By the postprocessing theorem, if we train a single differentially private recurrent model and use it to approximate ngram models, all the distilled models will also be differentially private with the same parameters (dwork2014algorithmic).
For the above reasons, we have not proposed to learn ngram models directly using FederatedAveraging of ngram counts for all orders.
2 Outline
The paper is organized along the lines of challenges associated with converting RNN LMs to ngram LMs for virtual keyboards: the feasibility of training neural models with a large vocabulary, inconsistent capitalization in the training data, and data sparsity in morphologically rich languages. We elaborate on each of these challenges below.
Large vocabulary: Keyboard ngram models are typically based on a carefully handcurated vocabulary to eliminate misspellings, erroneous capitalizations, and other artifacts. The vocabulary size often numbers in the hundreds of thousands. However, training a neural model directly over the vocabulary is memory intensive as the embedding and softmax layers require space $\left\mathcal{V}\right\times {N}_{e}$, where $\left\mathcal{V}\right$ is the vocabulary size and ${N}_{e}$ is the embedding dimension. We propose a way to handle large vocabularies for federated models in Section 3.
Incorrect capitalization: In virtual keyboards, users often type with incorrect casing (e.g. “She lives in new york” instead of “She lives in New York”). It would be desirable to decode with the correct capitalization even though the usertyped data may be incorrect. Before the discussion of capitalization, the SampleApprox algorithm is reviewed in Section 4. We then modify SampleApprox to infer capitalization in Section 5.
Language morphology: Many words are composed of root words and various morpheme components, e.g. “crazy”, “crazily”, and “craziness”. These linguistic features are prominent in morphologically rich languages such as Russian. The presence of a large number of morphological variants increases the vocabulary size and data sparsity ultimately making it more difficult to train neural models. Algorithms to convert between word and wordpiece models are discussed in Section 6.
Finally, we compare the performance of word and wordpiece models and present the results of A/B experiments on real users of a virtual keyboard in Section 7.
3 Unigram distributions
Among the $200$K words in the vocabulary, our virtual keyboard models only use the top $64$K words in the higherorder ngrams. We train the neural models only on these most frequent words and train a separate unigram model over the entire vocabulary. We interpolate the two resulting models to obtain the final model for decoding.
3.1 Collection
Unigrams are collected via a modified version of the FederatedAveraging algorithm. No models are sent to client devices. Instead of returning gradients to the server, counting statistics are compiled on each device and returned. In our experiments, we aggregate over groups of approximately 500 devices per training round. We count a unigram distribution $U$ from a whitelist vocabulary by $U={\sum}_{i}{w}_{i}{C}_{i}$, where $i$ is the index over devices, ${C}_{i}$ are the raw unigram counts collected from a single device $i$, and ${w}_{i}$ is a weight applied to device $i$.
To prevent users with large amounts of data from dominating the unigram distribution, we apply a form of L1clipping:
$${w}_{i}=\frac{\lambda}{\mathrm{max}(\lambda ,\sum {C}_{i})},$$  (1) 
where $\lambda $ is a threshold that caps each device’s contribution. When $\lambda =1$, L1clipping is equivalent to equal weighting. The limit $\lambda \to \mathrm{\infty}$ is equivalent to collecting the true counts, since ${w}_{i}\to 1$.
3.2 Convergence
Convergence of the unigram distribution is measured using the unbiased chisquared statistic (for simplicity, referred to as the $Z$statistic) defined in bhattacharya2015similarity, the number of unique unigrams seen, and a moving average of the number of rounds needed to observe new unigrams.
Figure 3 shows the overall distributional convergence based on the $Z$statistic. At round $k$, unigram counts after $k/2$ and $k$ rounds are compared.
Figure 3 plots the number of whitelist vocabulary words seen and a moving average of the number of rounds containing new unigrams. New unigrams are determined by comparing a round $k$ with all rounds through $k1$ and noting if any new words are seen. The shaded bands range from the LM’s unigram capacity to the size of the whitelist vocabulary.
3.3 Experiments
Since the whitelist vocabulary is uncased, capitalization normalization is applied based on an approach similar to Section 5. We then replace the unigram part of an ngram model with this distribution to produce the final LM.
In A/B experiments, unigram models with different L1clipping thresholds are compared against a baseline unigram model gathered from centralized log data. Results are presented in Table 1. Accuracy is unchanged and OOV rate is improved at $\lambda =1$ and $\lambda =1K$.
Model  [email protected] [%]  OOV rate [%] 

baseline  8.14  18.08 
$\lambda =1$  $+0.19\pm 0.21$  $1.33\pm 0.75$ 
$\lambda =1K$  $+0.11\pm 0.24$  $1.06\pm 0.66$ 
$\lambda =5K$  $0.08\pm 0.26$  $0.78\pm 0.93$ 
Before we discuss methods to address inconsistent capitalization and data sparsity in morphologically rich languages, we review SampleApprox .
4 Review of SampleApprox
SampleApprox , proposed in suresh2018approx; suresh2019distilling, can be used to approximate a RNN as a weighted finite automaton such as an ngram model. A weighted finite automaton (WFA) $A=(\mathrm{\Sigma},Q,E,i,F)$ over ${\mathbb{R}}_{+}$ (probabilities) is given by a finite alphabet $\mathrm{\Sigma}$ (vocabulary words), a finite set of states $Q$ (ngram contexts), an initial state $i\in Q$ (sentence start state), a set of final states $F\in Q$ (sentence end states), and a set of labeled transitions $E$ and associated weights that represent the conditional probability of labels (from $\mathrm{\Sigma}$) given the state (list of ngrams and their probabilities). WFA models allow a special backoff label $\phi $ for succinct representation as follows. Let $L[q]$ be the set of labels on transitions from state $q$. For $x\in L[q]$, let ${w}_{q}[x]$, be the weight of the transition of $x$ at state $q$ and ${d}_{q}[x]$ be the destination state. For a label $x$ and a state $q$,
$p(xq)$  $={w}_{q}[x]$  $\text{if}x\in L[q],$  
$={w}_{q}[\phi ]\cdot p(x{d}_{q}[\phi ])$  $\text{otherwise}.$ 
In other words, $\phi $ is followed if $x\notin L[q]$. The definition above is consistent with that of backoff ngram models (chen1999empirical). Let $B(q)$ denote the set of states from which $q$ can be reached by a path of backoff labels and let $q[x]$ be the first state at which label $x$ can be read by following a backoff path from $q$.
Given an unweighted finite automaton $A$ and a neural model, SampleApprox finds the probability model on $A$ that minimizes the KullbackLeibler (KL) divergence between the neural model and the WFA. The algorithm has two steps: a counting step and a KL minimization step. For the counting step, let $\overline{x}(1),\overline{x}(2),\mathrm{\dots},\overline{x}(k)$ be $k$ independent samples from the neural model. For a sequence $\overline{x}$, let ${x}_{i}$ denote the ${i}^{\text{th}}$ label and ${\overline{x}}^{i}={x}_{1},{x}_{2},\mathrm{\dots},{x}_{i}$ denote the first $i$ labels. For every $q\in Q$ and $x\in \mathrm{\Sigma}$, the algorithm computes $C(x,q)$ given by
$$\sum _{{q}^{\prime}\in B(q)}\sum _{j=1}^{m}\sum _{i\ge 0}{1}_{q({\overline{x}}^{i}(j))={q}^{\prime},q={q}^{\prime}[x]}\cdot {p}_{\text{nn}}(x{\overline{x}}^{i}(j)).$$ 
We illustrate this counting with an example. Suppose we are interested in the count of the bigram New York. Given a bigram LM, SampleApprox generates $m$ sentences and computes
$$C(\text{York},\text{New})=\sum _{j,i:{x}_{i}(j)=\text{New}}{p}_{\text{nn}}(\text{York}{\overline{x}}^{i}(j)).$$ 
In other words, it finds all sentences that have the word New, observes how frequently York appears subsequently, and computes the conditional probability. After counting, it uses a difference of convex (DC) programming based algorithm to find the KL minimum solution. If $\mathrm{\ell}$ is the average number of words per sentence, the computational complexity of counting is $\stackrel{~}{\mathcal{O}}(k\cdot \mathrm{\ell}\cdot \mathrm{\Sigma})$ ^{1}^{1} 1 ${a}_{n}=\stackrel{~}{\mathcal{O}}({b}_{n})$, means ${a}_{n}\le {b}_{n}\cdot \text{poly}\mathrm{log}(n),\forall n\ge {n}_{0}$. and the computational complexity of the KL minimization is $\stackrel{~}{\mathcal{O}}(E+Q)$ per iteration of DC programming.
5 Capitalization
As mentioned in Section 2, users often type with incorrect capitalization. One way of handling incorrect capitalization is to store an ondevice capitalization normalizer (beaufays2013language) to correctly capitalize sentences before using them to train the neural model. However, capitalization normalizers have large memory footprints and are not suitable for ondevice applications. To overcome this, the neural model is first trained on uncased user data. SampleApprox is then modified to approximate cased ngram models from uncased neural models.
As before, let $\overline{x}(1),\overline{x}(2),\mathrm{\dots},\overline{x}(k)$ be $k$ independent (uncased) samples from the neural model. We capitalize them correctly at the server using beaufays2013language. Let $\overline{y}(1),\overline{y}(2),\mathrm{\dots}\overline{y}(k)$ represent the corresponding $k$ correctly capitalized samples. Let ${p}_{\text{cap}}$ be another probability model on nonuser data that approximates the ratio of uncased to cased probabilities given a context. Given a label $y$, let $u(y)$ be the uncased symbol. For example, if $y$ is York, then $u(y)$ is york. With the above definitions, we modify the counting step of SampleApprox as follows:
$$\sum _{{q}^{\prime}\in B(q)}\sum _{j=1}^{m}\sum _{i\ge 0}{1}_{q({\overline{y}}^{i}(j))={q}^{\prime},q={q}^{\prime}[y]}\cdot \stackrel{~}{p}(y{\overline{y}}^{i}(j)),$$ 
where $\stackrel{~}{p}(y{\overline{y}}^{i}(j))$ is given by
$${p}_{\text{nn}}(u(y)u({\overline{y}}^{i}(j)))\cdot \frac{{p}_{\text{cap}}(y{\overline{y}}^{i}(j))}{{\sum}_{{y}^{\prime}:u({y}^{\prime})=u(y)}{p}_{\text{cap}}({y}^{\prime}{\overline{y}}^{i}(j))}.$$ 
We refer to this modified algorithm as CapSampleApprox . We note that wordpiece to word approximation incurs an additional computation cost of $\stackrel{~}{\mathcal{O}}((E+Q+\mathrm{\Delta})\mathrm{\ell})$, where $\mathrm{\Delta}$ is the number of words, $E$ and $Q$ are the set of arcs and set of states in the word ngram model, and $\mathrm{\ell}$ is the maximum number of wordpieces per word.
6 Morphologically rich languages
To train neural models on morphologically rich languages, subword segments such as bytepair encodings or wordpieces (shibata1999byte; schuster2012japanese; taku2018) are typically used. This approach assigns conditional probabilities to subword segments, conditioned on prior subword segments. It has proved successful in the context of speech recognition (chiu2018state) and machine translation (wu2016google). Following these successes, we propose to train RNN LMs with wordpieces for morphologically rich languages.



(a)  (b) 


(c) 
We apply the wordpiece approach of taku2018, which computes a wordpiece unigram LM using a wordpiece inventory ${\mathcal{V}}_{\mathcal{P}}$. Each wordpiece ${x}_{i}\in {\mathcal{V}}_{\mathcal{P}}$ is associated with a unigram probability $p({x}_{i})$. For a given word $y$ and its possible segmentation candidates, the word is encoded with the segmentation that assigns the highest probability.
Throughout this paper we apply $4$K, $16$K, and $30$K as the wordpiece inventory sizes. These values lie within a range that provides good tradeoff between the LSTM embedding size and the richness of the language morphology. We apply $100\%$ character coverage to include all the symbols that appeared in the unigram distribution (Section 3), including the common English letters, accented letters e.g. é, ô, and digits. Accented letters are important for languages like Portuguese. For fast decoding, the ngram models still need to be at the wordlevel, since wordpiece ngram models increase the depth of the beamsearch during decoding. We convert the word ngram topology to an equivalent wordpiece WFA topology and use SampleApprox to approximate the neural wordpiece model on the wordpiece WFA topology. We then convert the resulting wordpiece WFA LM to the equivalent ngram LM. The remainder of this section outlines efficient algorithms for converting between word and wordpiece WFA models.
A natural way to represent the transduction from wordpiece sequences to word sequences is with a finitestate transducer. Given the properties of our wordpiece representation, that transducer can be made sequential (i.e., input deterministic).
A sequential weighted finitestate transducer (WFST) is a deterministic WFA where each transition has an output label in addition to its (input) label and weight. We will denote by ${o}_{q}[x]$ the output label of the transition at state $q$ with input label $x$, ${o}_{q}[x]\in \mathrm{\Delta}\cup \{\u03f5\}$, where $\mathrm{\Delta}$ denotes the output alphabet of the transducer and $\u03f5$ the empty string/sequence.
Let $M$ be the minimal sequential (unweighted) finitestate transducer (FST) lexicon from wordpiece sequences in ${\mathrm{\Sigma}}^{*}$ to word sequences in ${\mathrm{\Delta}}^{*}$, where $\mathrm{\Sigma}$ denotes our wordpiece inventory, $\mathrm{\Delta}$ denotes our vocabulary, and $*$ is Kleene closure.
A wordpiece topology $B$ equivalent to the word topology $A$ can be obtained by composing the wordpiecetoword transducer $M$ with $A$:
$$B=M\circ A.$$ 
Since $A$ has backoff transitions, the generic composition algorithm of (allauzen11) is used with a custom composition filter that ensures the result, $B$, is deterministic with a wellformed backoff structure, and hence is suitable for the counting step of SampleApprox . We give an explicit description of the construction of $B$, from which readers familiar with allauzen11 can infer the form of the custom composition filter.
The states in $B$ are pairs $({q}_{1},{q}_{2})$, with ${q}_{1}\in {Q}_{M}$ and ${q}_{2}$ in ${Q}_{A}$, initial state ${i}_{B}=({i}_{M},{i}_{A})$, and final state ${f}_{B}=({f}_{M},{f}_{A})$. Given a state $({q}_{1},{q}_{2})\in {Q}_{B}$, the outgoing transitions and their destination states are defined as follows. If $x\in L[{q}_{1}]$, then an $x$labeled transition is created if one of two conditions holds:

1.
if ${o}_{{q}_{1}}[x]\in L[{q}_{2}]$, then
${d}_{({q}_{1},{q}_{2})}[x]$ $=({d}_{{q}_{1}}[x],{d}_{{q}_{2}}[{o}_{{q}_{1}}[x]])\text{and}$ ${o}_{({q}_{1},{q}_{2})}[x]$ $={o}_{{q}_{1}}[x];$ 
2.
if ${o}_{{q}_{1}}[x]=\u03f5$ and $R[{d}_{{q}_{1}}[x]]\cap L[{q}_{2}]\ne \mathrm{\varnothing}$, then
${d}_{({q}_{1},{q}_{2})}[x]$ $=({d}_{{q}_{1}}[x],{d}_{{q}_{2}}[{o}_{{q}_{1}}[x]])\text{and}$ ${o}_{({q}_{1},{q}_{2})}[x]$ $=\u03f5$
where $R[q]$ denotes the set of output non$\u03f5$ labels that can be emitted after following an output$\u03f5$ path from $q$. Finally if $\phi \in L[{q}_{1}]$, a backoff transition is created:
$${d}_{({q}_{1},{q}_{2})}[\phi ]=({q}_{1},{d}_{{q}_{2}}[\phi ])\text{and}{o}_{{q}_{1},{q}_{2}}[\phi ]=\u03f5.$$ 
The counting step of SampleApprox is applied to $B$, and transfers the computed counts from $B$ to $A$ by relying on the following key property of $M$. For every word $y$ in $\mathrm{\Delta}$, there exists a unique state ${q}_{y}\in {Q}_{M}$ and unique wordpiece ${x}_{y}$ in $\mathrm{\Sigma}$ such that ${o}_{{q}_{y}}[{x}_{y}]=y$. This allows us to transfer the counts from $B$ to $A$ as follows:
$${w}_{q}[y]={w}_{({q}_{y},q)}[{x}_{y}]$$ 
The KL minimization step of SampleApprox to $A$ is applied subsequently.
As an alternative, the unweighted word automaton $A$ could be used to perform the counting step directly. Each sample $\overline{x}(j)$ could be mapped to a corresponding word sequence $\overline{y}(j)$, mapping outofvocabulary wordpiece sequences to an unknown token. However, the counting steps would have become much more computationally expensive, since ${p}_{\text{nn}}(y{\overline{y}}^{i}(j))$ would have to be evaluated for all $i$, $j$ and for all words $y$ in the vocabulary, where ${p}_{\text{nn}}$ is now a wordpiece RNN.
7 Experiments
7.1 Neural language model
LSTM models (hochreiter1997long) have been successfully used in a variety of sequence processing tasks. LSTM models usually have a large number of parameters and are not suitable for ondevice learning. In this work, we use various techniques to reduce the memory footprint and to improve model performance.
We use a variant of LSTM with a Coupled Input and Forget Gate (CIFG) (greff2017lstm) for the federated neural language model. CIFG couples the forget and input decisions together, which reduces the number of LSTM parameters by 25%. We also use groupLSTM (GLSTM) (kuchaiev2017factorization) to reduce the number of trainable variables of an LSTM matrix by the number of feature groups, $k$. We set $k=5$ in experiments.
Model  ${N}_{l}$  ${N}_{h}$  ${N}_{e}$  ${S}_{e}$  ${S}_{\text{total}}$ 

${\text{W}}_{\text{30K}}$  1  670  96  2.91M  3.40M 
${\text{P}}_{\text{4KS}}$  1  670  96  0.38M  0.85M 
${\text{P}}_{\text{4KL}}$  2  1080  140  0.56M  2.70M 
${\text{P}}_{\text{4KG}}$  2  1080  280  1.12M  2.71M 
${\text{P}}_{\text{16KS}}$  1  670  96  1.54M  2.00M 
${\text{P}}_{\text{16KL}}$  1  670  160  2.56M  3.33M 
${\text{P}}_{\text{30K}}$  1  670  96  2.91M  3.40M 
Table 2 lists the parameter settings of the word (W) and wordpiece (P) models used in this study. Due to the memory limitations of ondevice training, all models use fewer than $3.5M$ parameters. For each vocabulary size, we first start with a base architecture consisting of one LSTM layer, a $96$dimensional embedding, and $670$ hidden state units. We then attempt to increase the representational power of the LSTM cell by increasing the number of hidden units and using multilayer LSTM cells (sutskever2014sequence). Residual LSTM (kim2017residual) and layer normalization (lei2016layer) are used throughout experiments, as these techniques were observed to improve convergence. To avoid the restriction that ${N}_{h}={N}_{e}$ in the output, we apply a projection step at the output gate of the LSTM (sak2014long). This step reduces the dimension of the LSTM hidden state from ${N}_{h}$ to ${N}_{e}$. We also share the embedding matrix between the input embedding and output softmax layer, which reduces the memory requirement by $\left\mathcal{V}\right\times {N}_{e}$. We note that other recurrent neural models such as gated recurrent units (chung2014empirical) can also be used instead of CIFG LSTMs.
The federated RNN LMs are trained on two language settings of the virtual keyboard: American English (en_US) and Brazilian Portuguese (pt_BR). Following mcmahan2016communication, 500 reporting clients are used to compute the gradient updates for each round. A serverside learning rate of 1.0, a clientside learning rate of 0.5, and Nesterov momentum of 0.9 are used. Both the word and wordpiece models are trained over the same time range and with the same hyperparameters. Prior to federated training of the RNN LM, the wordpiece inventory is constructed from the unigram distribution collected via the federated approach introduced in Section 3.
A common evaluation metric for both word and wordpiece models is desirable during federated training. Such a metric can be used to monitor the training status and select models to be used for the CapSampleApprox algorithm. Neither crossentropy nor accuracy serves this need due to the mismatch in vocabularies used. Wordlevel accuracy is hard to compute for the wordpiece model, since it requires hundreds of inference calls to traverse all combinations of a word from the wordpiece vocabulary. In this study, we apply sentence log likelihood (SLL) in the evaluation. Given a sentence ${\overline{x}}^{m}=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{m}\}$ composed of $m$ units (either words or wordpieces), SLL is evaluated as ${\sum}_{i=1}^{m}\mathrm{log}({p}_{nn}({x}_{i}{\overline{x}}^{i1}))$. One issue that arises is the handling of outofvocabulary (OOV) words. The OOV probability of the word model is about $8\%$. The comparable probability of an OOV word (according to $\mathcal{V}$) for wordpiece models is the product of the corresponding wordpiece conditional probabilities, which is much smaller than $8\%$. To mitigate this issue, we define SLL excluding OOV as:
$${\text{SLL}}^{e}=\sum _{i:{x}_{i}\ne \text{OOV}}^{m}\mathrm{log}({p}_{nn}({x}_{i}{\overline{x}}^{i1})),$$ 
where the OOV in the equation includes wordpieces that are components of OOV words. In the following, ${\text{SLL}}^{e}$ is used as model selection metric.
7.2 Approximated ngram model
Algorithm 1 illustrates the workflow we use to generate different ngram models for evaluation. Recall that CapSampleApprox takes a RNN LM, an ngram topology, and a reweighting FST for capitalization normalization. The ngram topology is empty under selfinference mode. suresh2018approx showed that inferring topology from the RNN LM does not perform as well as using the true ngram topology obtained from the training corpus. Hence, we supplement the neuralinferred topology with the topology obtained by a large external large corpus denoted by ${A}_{W}$. We use CapSampleApprox on four topologies and compare the resulting models: an ngram model obtained from an external corpus’s topology ${A}_{e}$, an ngram model obtained from a neural inferred topology ${A}_{i}$, an ngram model obtained by interpolating (merging) the two models above ${A}_{m}$, and an ngram model obtained by approximating on the interpolated topology ${A}_{r}$. We repeat this experiment for both word and wordpiece RNN LMs and use subscripts $W$ and $P$, respectively. We evaluate all eight produced ngram models directly on the traffic of a production virtual keyboard, where prediction accuracy is evaluated over usertyped words.
7.3 Results
Model  en_US  pt_BR 

Baseline  $10.03\%$  $8.55\%$ 
${A}_{{W}_{e}}$  $10.52\pm 0.03\%$  $9.66\pm 0.02\%$ 
${A}_{{W}_{i}}$  $10.47\pm 0.02\%$  $9.67\pm 0.02\%$ 
${A}_{{W}_{m}}$  $10.27\pm 0.03\%$  $9.40\pm 0.02\%$ 
${A}_{{W}_{r}}$  $10.49\pm 0.03\%$  $9.65\pm 0.02\%$ 
Model  top1 

Baseline  $10.03\%$ 
${A}_{{P}_{e}}$  $10.49\pm 0.03\%$ 
${A}_{{P}_{i}}$  $10.46\pm 0.03\%$ 
${A}_{{P}_{m}}$  $10.48\pm 0.04\%$ 
${A}_{{P}_{r}}$  $10.53\pm 0.03\%$ 
Figure 5 shows the ${\text{SLL}}^{e}$ metric for all the experiments listed in Table 2. In general, larger models generate better results than smaller baseline models. For the baseline architectures with same RNN size, having a larger vocabulary leads to some gains. For the larger architectures that have similar total numbers of parameters, 4K wordpiece models are shown to be superior to 16K and 30K. For 4K wordpiece models, GLSTM is in general onpar with its ${\text{P}}_{\text{4KL}}$ counterpart. The word model is better than all the wordpiece models in both languages in ${\text{SLL}}^{e}$. We were surprised by this result, and hypothesize that it is due to the ${\text{SLL}}^{e}$ metric discounting wordpiece models’ ability to model the semantics of OOV words. The solid lines are the best models we pick for A/B experiment evaluation for the virtual keyboard (${\text{P}}_{\text{4KL}}$ and ${\text{W}}_{\text{30K}}$).
Table 3 shows the A/B evaluation result on both en_US and pt_BR populations. The baseline model is an ngram model trained directly from centralized logs. All of the federated trained models perform better than the baseline model. We repeated the A/B evaluation with wordpiece models on en_US and the results are in Table 4. The performance of wordpiece models is similar to that of word models. Among the federated models for en_US, ${A}_{{P}_{r}}$ has the best result. This meets our expectation that the supplemental corpus helps improve the performance of the topology inferred from the RNN LM.
8 Conclusion
We have proposed methods to train productionquality ngram language models using federated learning, which allows training models without usertyped text ever leaving devices. The proposed methods are shown to perform better than traditional serverbased algorithms in A/B experiments on real users of a virtual keyboard.
Acknowledgments
The authors would like to thank colleagues in Google Research for providing the federated learning framework and for many helpful discussions.