Membership Inference Attacks on Sequence-to-Sequence Models

  • 2019-04-11 02:53:21
  • Sorami Hisamoto, Matt Post, Kevin Duh
  • 26


Data privacy is an important issue for "machine learning as a service"providers. We focus on the problem of membership inference attacks: given adata sample and black-box access to a model's API, determine whether the sampleexisted in the model's training data. Our contribution is an investigation ofthis problem in the context of sequence-to-sequence models, which are importantin applications such as machine translation and video captioning. We define themembership inference problem for sequence generation, provide an open datasetbased on state-of-the-art machine translation models, and report initialresults on whether these models leak private information against several kindsof membership inference attacks.


Quick Read (beta)

Membership Inference Attacks on Sequence-to-Sequence Models

Sorami Hisamoto   Matt Post   Kevin Duh
Johns Hopkins University
[email protected],   {post, kevinduh}

Data privacy is an important issue for “machine learning as a service” providers. We focus on the problem of membership inference attacks: given a data sample and black-box access to a model’s API, determine whether the sample existed in the model’s training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications such as machine translation and video captioning. We define the membership inference problem for sequence generation, provide an open dataset based on state-of-the-art machine translation models, and report initial results on whether these models leak private information against several kinds of membership inference attacks.



Membership Inference Attacks on Sequence-to-Sequence Models

Sorami Hisamoto   Matt Post   Kevin Duh Johns Hopkins University [email protected],   {post, kevinduh}

1 Motivation

There are many situations where private entities are worried about the privacy of their data. For example, many companies provide black-box training services where users are able to upload their data and have customized models built for them, without requiring machine learning expertise on the client’s part. A common concern in these “machine learning as a service” offerings is that the uploaded data be visible only to the client that owns it.

Currently, these entities are in the position of having to trust that service providers abide by the terms of their agreements. While trust is an important component in relationships of all kinds, it has its limitations. In particular, it falls short of a well known security maxim, originating in a Russian proverb, translating as, “trust, but verify”.11 1 Popularized by Ronald Reagan in the context of nuclear disarmament. Ideally, customers would be able to verify that their private data was not being slurped up by the serving company, whether by design or accident.

This problem has been formalized as the so-called membership inference problem, first introduced by Shokri et al. (2017) and defined as: “Given a machine learning model and a record, determine whether this record was used as part of the model’s training dataset or not.” The problem can be tackled in an adversarial framework: the attacker is interested in answering this question with high accuracy, while the defender would like this question to be unanswerable (see Figure 1). Since then, machine learning and security/privacy researchers have proposed many ways to attack and defend the privacy of various types models. However, the work so far has only focused on standard classification problems, where the output space of the model is a fixed set of labels.



Figure 1: Membership Inference Attack

In this paper, we propose to investigate membership inference for sequence generation problems, where the output space can be viewed as a chained sequence of classifications. Prime examples of sequence generation includes machine translation and text summarization: in these problems, the output is a sequence of words whose length is undetermined a priori. Other examples include speech synthesis and video caption generation. Sequence generation problems are more complex than classification problems, and it is unclear whether the methods and results developed for membership inference in classification problems will transfer. For example, one might imagine that while a flat classification model might leak private information when the output is a single label, a recurrent sequence generation model might obfuscate this leakage when labels are generated successively with complex dependencies.

We focus on machine translation (MT) as the example sequence generation problem. Recent advances in neural sequence-to-sequence models have improved the quality of MT systems significantly, and many commercial service providers are deploying these models via public API’s. We pose the main question in the following form

Given black-box access to an MT model, is it possible to determine whether a particular sentence pair was in the training set for that model?

and investigate it in the context of an adversarial arms race, first attempting to answer the question under various situations, then what measures can be taken to make answering it more difficult, and so on.

In the following, we first define the problem of membership inference for sequence generation in Section 2. Then we contrast this with prior work on classification problems (Section 3). Next in Section 4 we describe a novel dataset and the evaluation protocol using MT as an example. The goal is to provide a public dataset built on current state-of-the-art sequence-to-sequence models to encourage further research in this new problem. Finally, we propose several attack methods in Section 5 and present a series of experiments evaluating their ability to answer the membership inference question (Section 6). Our conclusion is that simple one-off attacks based on shadow models, which proved successful in classification problems, are not successful on sequence generation problems; this is a result that favors the defender. Nevertheless, we describe the specific conditions where sequence-to-sequence models still leak private information, and discuss the possibility of more powerful attacks (Section 7).

2 Problem Definition

2.1 General Framework

We now define the membership inference attack problem for sequence-to-sequence models in detail. Following tradition in the security research literature, we introduce three characters:

  • Alice (the service provider) builds a sequence-to-sequence model based on an undisclosed dataset 𝒜train and provides a public API. Using MT as the running example, this API takes a foreign sentence f as input and returns an English translation e^.

  • Bob (the attacker) is interested in discerning whether a data sample was included in Alice’s training data 𝒜train by exploiting Alice’s public API. This sample is called a “probe” and consists of a foreign sentence f and its reference English translation, e. Together with the API’s output e^, Bob has to make a binary decision using a membership inference classifier g(), whose goal is to predict:22 2 In the experiments, we will also consider extending the information available to Bob. For example, if Alice additionally provides the translation probabilities ρ in the API, then Bob can exploit that in the classifier as g(f,e,e^,ρ).

    g(f,e,e^)={𝐢𝐧if probe 𝒜train𝐨𝐮𝐭otherwise (1)

    We term in-probes to be those probes where the true class is in, and out-probes to be those whose true class is out. Importantly, note that Bob has access not only to f but also to e in the probe. Intuitively, if e^ is equivalent to e, then Bob may believe that the probe was contained in 𝒜train; however, it may also be possible that Alice’s model generalizes well to new samples and translates this probe correctly. The challenge for Bob is to make this distinction; the challenge for Alice is to prevent Bob from doing so.

  • Carol (the neutral third-party) is in charge of setting up the experiment between Alice and Bob. She decides which data samples should be used as in-probes and out-probes and evaluates Bob’s classification accuracy. Carol is introduced only to clarify the exposition and to setup a fair experiment for research purposes. In practical scenarios, Carol does not exist: Bob decides his own probes, and Alice decides her own 𝒜train.

2.2 Detailed Specification

In order to be precise about how Carol sets up the experiment, we will explain in terms of machine translation, but note that the problem definition applies to any sequence-to-sequence problem. A training set for MT consists of a set of sentence pairs {(fi(d),ei(d)) }. We use a label d{1,2,} to indicate the domain (the subcorpus or the data source), and an index i{1,2,,I(d)} to indicate the sample id in the domain (subcorpus). For example, ei(d) with d=1 and i=1 might refer to the first sentence in the Europarl subcorpus, while ei(d) with d=2 and i=1 might refer to the first sentence in the CommonCrawl subcorpus. I(d) is the maximum number of sentences in the subcorpus with label d. The distinction among subcorpora is not necessary in the abstract definition of membership inference attacks, but is important in practice. Generally, it is beneficial to distinguish among domains or subcorpora in the training set when the differences in data distribution may reveal some signals in membership inference.

Without loss of generality, in this section let us assume that Carol has a finite number of samples from two subcorpora d{1,2}. First, Carol creates an out-probe of k samples from subcorpus 1:

𝒜out_probe={(fi(d),ei(d)):d=1i=1,,k} (2)

Then Carol creates the data for Alice to train Alice’s MT model, using subcorpora 1 and 2:

𝒜train={(fi(d),ei(d)):d=1,2i=k+1,,I(d)} (3)

Importantly, the two sets are totally disjoint: i.e. 𝒜out_probe𝒜train=. By definition, out-probes are sentence pairs that are not in Alice’s training data.

Finally, Carol creates the in-probe of k samples by drawing from 𝒜train, i.e. 𝒜in_probe𝒜train, which is defined to be samples that are included in training:

𝒜in_probe={(fi(d),ei(d)):d=1i=k+1,,2k} (4)

Note that both 𝒜in_probe and 𝒜out_probe are sentence pairs that come from the same subcorpus; the only difference is that the former is included in 𝒜train while the latter is not.

There are several ways in which Bob’s data can be created. For this work, we will assume that Bob also has some data to train MT models, in order to mimic Alice and design his attacks. This data could either be disjoint from 𝒜train, or contain parts of 𝒜train. We choose the latter, which assumes that there might be some public data that is accessible to both Alice and Bob. This scenario slightly favors Bob. In the case of MT, parallel data can be hard to come by, and datasets like Europarl are widely accessible to anyone, so presumably both Alice and Bob would use it. However, we expect that Alice has in-house dataset (e.g., crawled data) which Bob does not have access to. Thus, Carol creates data for Bob by:

all={(fi(d),ei(d)):d=1i=2k+1,,I(d)} (5)

Note that this dataset is like 𝒜train but with two exceptions: all samples from subcorpora 2 and all samples from 𝒜in_probe are discarded. One can view 2 as Alice’s own in-house corpus which Bob has no knowledge of or access to, and 1 as the shared corpus where membership inference attacks are performed.

To summarize, Carol gives 𝒜train to Alice, who uses it in whatever way she chooses to build a sequence-to-sequence model M[𝒜train,Θ]. The model is trained on 𝒜train with hyperparameters Θ (e.g., neural network architecture) known only to Alice. In parallel, Carol gives all to Bob, who uses it to design various attack strategies, resulting in a classifier g() (see Section 5). When it is time for evaluation, Carol provides both probes 𝒜in_probe and 𝒜out_probe to Bob in randomized order and asks Bob to classify each sample as in or out. For each probe (fi(d),ei(d)), Bob is allowed to make one call to Alice’s API to obtain e^i(d). In more elaborate versions of the attack, Bob will be allowed multiple API calls per classification decision.

As an additional evaluation, Carol creates a third probe based on data from entirely new subcorpus 3. We call this the “out-of-domain (ood) probe”:

𝒜ood={(fi(d),ei(d)):d=3i=1,,k} (6)

Both 𝒜out_probe and 𝒜ood should be classified as out by Bob’s classifier. However, it has been known that sequence-to-sequence models behave very differently on data from domains/genre that is significantly different from the training data (Koehn and Knowles, 2017). The goal of having two out probes is to quantify the difficulty or ease of membership inference in different situations.



Figure 2: Illustration of data splits for Alice and Bob. There are k samples each for 𝒜in_probe, 𝒜out_probe, and 𝒜ood. Alice’s training data 𝒜train excludes 𝒜out_probe and 3, while including 𝒜in_probe. Bob’s data all is a subset of Alice’s data, excluding 𝒜in_probe and 2.

2.3 Summary and Alternative Definitions

Figure 2 summarizes the problem definition. The probes 𝒜out_probe and 𝒜ood are by construction outside of Alice’s training data 𝒜train, while the probe 𝒜in_probe is included. Bob’s goal is to produce a classifier that can make this distinction. He has at his disposal a smaller dataset all, which he can use in whatever way he desires.

In our problem definition, for each sample in the probe, Bob can make one call to Alice’s API to obtain e^, which then he feeds into his membership inference classifier g(f,e,e^). Bob’s accuracy is measured on a per-sample basis.

There are alternative definitions of this membership inference problem. For example, one can allow Bob to make multiple API calls to Alice’s model for each probe. This enlarges the repository of potential attack strategies for Bob. Or, one could evaluate Bob’s accuracy not on a per-sample basis, but allow for a coarser-grained granularity where Bob can aggregate inferences over multiple samples. There is also a distinction between white-box and black-box attacks: we focus on the black-box case where Bob has no internal access to the internal parameters of Alice’s model, but can only guess at likely model architectures. In the white-box case, Bob would have access to Alice’s model internals, so different attacks would be possible (e.g., backpropagation of gradients). In these respects, our problem definition makes the problem more challenging for Bob the attacker.

Finally, we note that Bob is not necessarily always the “bad guy” in this story. Here are some practical examples of who Alice and Bob might be in the MT space:

  • Organizations (Bob) that provide bitext training data under license restrictions might be interested to determine whether their licenses are being complied with in published models and public APIs (Alice).

  • The Conference on Machine Translation organizes an annual bakeoff, in which teams submit systems in both constrained and unconstrained settings, which are then evaluated and compared with automatic metrics and a human evaluation. The goal of the evaluation is to help direct research, but victory often confers bragging rights, with stakes increasing as the competition becomes more fierce. Organizers (Bob) might wish to establish or confirm that the participants (Alice) are following the rules.

  • “MT as a service” providers may provide customized engines if users upload their own bitext training data, promising to deliver superior translation results. This is usually accomplished by continued training of a general-domain model across the user-supplied data. The provider needs to provide guarantees that either (a) the user-supplied data will not be used in the customized engines of other users, or (b) if the data is used to improve performance for everyone, the privacy of the data is not leaked. In this case, the provider may play the role of both Alice and Bob, attacking its own model to provide quantitative guarantees to the user.

Alice, Bob, and Carol are fictitious entities created to clarify what information each party holds, and the long-term goal of this kind of adversarial framework is to drill down into the problem as both attackers and defenders successively improve their respective abilities.

3 Related Work

Shokri et al. (2017) introduced the problem of membership inference attacks on machine learning models. They showed that with shadow models trained on either realistic or synthetic datasets, Bob can build classifiers that can discriminate 𝒜in_probe and 𝒜out_probe with high accuracy. They focus on classification problems such as CIFAR image recognition and Kaggle’s Purchase benchmark, and demonstrate successful attacks on both convolutional neural net models as well as the models provided by Amazon ML and Google Prediction API.

Why do these attacks work? The main information exploited by Bob’s classifier is the output distribution of class labels returned by Alice’s API. The prediction uncertainty differs for data samples inside and outside the model training data, and this can be exploited. Shokri et al. (2017) proposes defense strategies for Alice, such as restricting the prediction vector to top-k classes, coarsening the values of the output probabilities, and increasing the entropy of the prediction vector. The crucial difference between their work and ours, besides our focus on sequence generation problems, is the availability of this kind of output distribution provided by Alice. While it is common to provide the whole distribution of output probabilities in classification problems, this is not possible in sequence generation problems because the output space of sequences is exponential in the output length. At most, sequence models can provide a score for the output prediction e^i(d), for example with a beam search procedure, but this is only one number and not normalized. We do experiment with having Bob exploit this score (Table 3), but it appears far inferior to the use of the whole distribution available in classification problems.

Subsequent work on membership inference has focused on different angles of the problem. Salem et al. (2018) investigated the effect of training the shadow model and datasets that match or does not match the distribution of 𝒜train, and compared training a single shadow model as opposed to many. Truex et al. (2018) presents a comprehensive evaluation of different model types, training data, and attack strategies; Borrowing ideas from adversarial learning and minimax games, Hayes et al. (2017) proposes attack methods based on generative adversarial networks, while Nasr et al. (2018) provides adversarial regularization techniques for the defender. Nasr et al. (2019) extends the analysis of membership inference to white-box attacks and a federated learning setting. Pyrgelis et al. (2018) provides an empirical study on location data. Veale et al. (2018) discusses the membership inference problem, along with the related model inversion problem, in the broader context of data protection laws like GDPR.

As noted by Shokri et al. (2017), there is a synergistic connection between the goals of learning and the goals of privacy in the case of membership inference: the goal of learning is to generalize to data outside the training set (e.g., so that 𝒜out_probe and 𝒜ood are translated well), while the goal of privacy is to prevent leaking information about data in the training set. The common enemy of both goals is overfitting. Yeom et al. (2017) analyze how overfitting by Alice’s increases the risk privacy leakage; Long et al. (2018) showed that even a well-generalized model holds such risks in classification problems, implying that overfitting by Alice is a sufficient but not necessary condition for privacy leakage.

A large body of work exists in differential privacy (Dwork, 2008; Machanavajjhala et al., 2017). Differential privacy provides guarantees that a model trained on some dataset 𝒜train will produce statistically similar predictions as a model trained on another dataset which differs in exactly one sample. This is one way in which Alice can defend her model (Rahman et al., 2018), but note that differential privacy is a stronger notion and often involves a cost in Alice’s model accuracy. Membership inference assumes that content of the data is known to Bob and only is concerned whether it was used. Differential privacy also protects the content of the data (i.e., the actual words in (fi(d),ei(d)) should not be inferred).

4 Data and Evaluation Protocol

Based on the problem definition we introduced in section 2 we construct the actual dataset and design an evaluation protocol to investigate the possibility of the membership inference attack on MT models. We need to carefully design a benchmark that is fair for both Alice and Bob, so that we can correctly measure the effectiveness of a certain attack. We will release the dataset in the future and as a result others can evaluate their own attacks in the same setting and compare the performance to each other.

4.1 Data: subcorpora and splits

MT models, especially Neural MT, need a large amount of data to train an MT model that works reasonably well. We also would like to consider the effect of different domains, thus we chose data from multiple subcorpora to construct a dataset for Alice and Bob. Some of the subcorpora are included in Alice model training data and also available to Bob, while others are considered as out-of-domain subcorpora therefore Alice and Bob never see them.

We used corpora from the Third Conference on Machine Translation (WMT18) (Bojar et al., 2018). We chose German–English setting for MT because it has a reasonably large amount of training data, and previous work shows that the model performs reasonably well compared to other pairs. The attack is expected to be more difficult if the model is generalized well, therefore if the attack is successful on such model, we may expect that it is also effective for models with lesser performance. We also chose German as source and English as the target language as it will be easier for English speakers to interpret the model output.

We now describe how Carol prepares the data for Alice and Bob. First, Carol treats a filtered version of ParaCrawl (described in the next section) as a subcorpus which is only available to Alice and never to Bob (2 in section 2.2). We can think of it as an in-house data the service provider has. Carol additionally selects 4 subcorpora for the training data of Alice, namely CommonCrawl, Europarl v7, News Commentary v13, and Rapid 2016. A subset of samples from these 4 subcorpora are also available to Bob (1 in section 2.2).

For all these subcorpora, Carol first performs basic preprocessing: (a) tokenization of both the German and English sides using the Moses tokenizer, (b) de-duplication of sentence pairs so that only unique pairs are present, and (c) randomly shuffling all sentences prior to splitting into probes and MT training data.33 3 These are all design decisions that balance between having a simple experiment setup vs. having a realistic condition. For example, Carol doing a common tokenization removes some of the MT-specific complexity for researchers who want to focus on the Alice or Bob models on this data. However, in a real-world public API, Alice’s tokenization is likely to be unknown to Bob. In this case, we decided on a middle point to have Carol perform a common tokenization, but Alice and Bob do their own subword segmentation, if desired.



Figure 3: Illustration of how Carol splits the subcorpora for Alice and Bob. 𝒜train does not contain 𝒜out_probe, and all is a subset of 𝒜train with 𝒜in_probe and ParaCrawl excluded. In addition to these subcorpora, we have out-of-domain (ood) subcorpora 𝒜ood which are not included in 𝒜train nor all.

Figure 3 illustrates how Carol splits subcorpora for Alice and Bob. For each subcorpus, Carol splits them to create probes 𝒜in_probe and 𝒜out_probe, and 𝒜train and all. Carol sets k=5,000, meaning each probe set per subcorpus has 5,000 samples. For each subcorpus, Carol selects 5,000 samples to create 𝒜out_probe. She then uses the rest as 𝒜train and select 5,000 from it as 𝒜in_probe. She excludes 𝒜in_probe and ParaCrawl from 𝒜train to create a dataset for Bob, all.44 4 We prepared two different pairs of 𝒜in_probe and 𝒜out_probe. Thus all has 10k less samples than 𝒜train, and not 5k less. For the experiment we used only one pair, and kept the other for future use.

In addition to these subcorpora, Carol has 4 other domains to create out-of-domain probe set 𝒜ood, namely, EMEA and Subtitles 18 (Tiedemann, 2012), Koran (Tanzil), and TED (Duh, 2018). These subcorpora are equivalent to 3 in section 2.2. The size of 𝒜ood is 5,000 per subcorpus, same as 𝒜in_probe and 𝒜out_probe.

The number of samples for each set and subcorpus is summarized in table 1.55 5 After the data is prepared, Carol also later removed samples where source or reference sentence is empty. Carol removed 38 Europarl samples from the Alice probe set, 182 Europarl samples from the Bob probe set, and 1 EMEA sample from the OOD probe set. We explain Bob’s probe in section 5.1.

𝒜out_probe 𝒜in_probe 𝒜train all 𝒜ood
ParaCrawl 5,000 5,000 4,518,029 0 N/A
CommonCrawl 5,000 5,000 2,389,123 2,379,123 N/A
Europarl 5,000 5,000 1,865,271 1,855,271 N/A
News 5,000 5,000 273,702 263,702 N/A
Rapid 5,000 5,000 1,062,214 1,052,214 N/A
\hdashlineEMEA N/A N/A N/A N/A 5,000
Koran N/A N/A N/A N/A 5,000
Subtitles N/A N/A N/A N/A 5,000
TED N/A N/A N/A N/A 5,000
TOTAL 25,000 25,000 10,108,339 5,550,310 20,000
Table 1: Number of sentences per set and subcorpus. For each subcorpus, 𝒜train includes 𝒜in_probe and does not include 𝒜out_probe. all is a subset of 𝒜train, excluding 𝒜in_probe and ParaCrawl. 𝒜ood is for evaluation only, and only Carol has access to them.

In Shokri et al. (2017) the attacker generates training data, by either using the target model itself, or by statistics-based synthesis assuming the attacker knows some statistical information about the population from which the target model’s data was drawn, or from noisy real data. On the other hand, in our setting Bob has actual subsets of the data Alice uses to train her model.

4.2 Alice MT Architecture

Alice uses her dataset 𝒜train to train her own MT model. Neither the full training data nor the detail of the model is available to Bob.

ParaCrawl subcorpus is only available to Alice and never to Bob. NMT is highly susceptible to noise in the training data (Khayrallah and Koehn, 2018), and Paracrawl is very noisy, so Alice applied dual conditional cross-entropy filtering (Junczys-Dowmunt, 2018), retaining roughly the top 4.5 million lines. After filtering, Alice concatenated all the training data (𝒜train in Table 1). Alice then applied normalization and tokenization from the Moses toolkit (Koehn et al., 2007), and trained a joint BPE subword model (Sennrich et al., 2016) using 32,000 merge operations. No recasing was applied.

Alice’s model is a six-layer Transformer (Vaswani et al., 2017) using default parameters in Sockeye (Hieber et al., 2017).66 6 Three-way tied embeddings, model and embedding size 512, eight attention heads, 2,048 hidden states in the feed forward layers, layer normalization applied before each self-attention layer, and dropout and residual connections applied afterward, word-based batch size of 4,096. The model was trained until perplexity on newstest2017 (Bojar et al., 2017) had not improved for five consecutive checkpoints, computed every 5,000 batches.

The BLEU score (Papineni et al., 2002) on newstest2018 was 42.6, computed using sacreBLEU (Post, 2018) with the default settings.77 7 Version 1.2.12, case-sensitive, “13a” tokenization for comparability with WMT.

4.3 Evaluation Protocol

Bob uses his dataset all to create his own MT models and use them to construct a membership inference classifier. We will explain the detail of the attacks in section 5. Note that the detail of the MT models, such as its architecture or training algorithm, are not necessary same for Alice and Bob, as Bob does not have that information about the Alice model. For our experiments Alice and Bob models are created by different authors of this paper and without them knowing the detail of each other. One may also consider a scenario where Bob knows something about the Alice model. Shokri et al. (2017) is more similar to such scenario, as the attacker either knows the model architecture and training algorithm, or has access to the same machine learning as a service API used to train the model.

For a fair evaluation, we conduct the following procedure to attack the model and compute its accuracy. First, Bob sends sentences to Alice. Alice uses her model to translate the given sentences, and return the output to Bob. Given the translation, Bob then uses his classifier to infer their membership and send result to Carol the evaluator. Carol then finally computes the accuracy of the membership inference, and report it to Bob.

We define accuracy as a percentage of samples where the classification result is correct. More formally, given a probe set P containing a list of (f,e,e^,l), where each symbol represents source sentence, reference sentence, translation of the source sentence by the target MT model, and label (in or out), the accuracy of the attack using membership inference classifier g() is defined as follows:

accuracy(g,P)=1|P|P[g(f,e,e^)=l] (7)

If the accuracy is 50%, then the binary classification is same as random, and Alice is safe. If it is above 50% then it suggests that some private information is leaked and Bob has some chance to infer the membership.

5 Membership Inference Attacks

5.1 Shadow Model Framework

Bob’s initial approach for the attack is a straightforward one, using “shadow models”, similar to that of Shokri et al. (2017). The idea is that Bob creates models to mimic the target model with his data, then train a membership inference classifier on these shadow models, and apply it to the Alice model.

Bob splits his data all into his own version of in-probe, out-probe, and training set in multiple ways to train MT models. Then he translates these probe sentences, and use the resulting (f,e,e^) with its in or out label to train a binary classifier g(f,e,e^).

Bob first selects 10 sets of 5,000 sentences per subcorpus in all. He then chooses 2 sets and use one as in-probe and the other as out-probe, and combine in-probe and the rest (all minus 10 sets) as a training set. We use notations in_probe1+ out_probe1+, and train1+ for the first group of in-probe, out-probe, and training set. Bob then swaps the in-probe and out-probe to create another group. We notate this as in_probe1-, out_probe1-, and train1-. With 10 sets of 5,000 sentences, Bob can create 10 different groups of in-probe, out-probe, and training set. Figure 4 illustrates the data splits.



Figure 4: Illustration of how Bob splits all for each shadow model. Blue boxes are the in-probe in_probe and training data train, where small box is the in-probe and small and large boxes combined is the training data. Green box indicates the out-probe out_probe. Bob uses models from splits 1 to 3 as a train, 4 as a validation, and 5 as a test sets for his membership inference attacker.

For each group of data, Bob first trains a shadow MT model using the training set. He then uses this model to translate sentences in the in-probe and out-probe sets. Bob has now a list of (f,e,e^) from different shadow models, and he knows for each sample if it was in or out of the training data for the MT model used to translate that sentence.

5.2 Bob MT Architecture

In Shokri et al. (2017) the attacker either knows the model architecture or the training algorithm of the target model, or has access to the same machine learning as a service API used to train the target model. In comparison to that scenario, in our setting Bob does not know the detail of the Alice model, therefore the model architecture and training algorithm of shadow models are different from that of Alice model described in Section 4.2.

Bob’s model is a four-layer Transformer, with no tied embedding, model and embedding size 512, eight attention heads, 1,024 hidden states in the feed forward layers, word-based batch size of 4,096. Adam (Kingma and Ba, 2015) is used for the optimization. For regularization, label smoothing was done with parameter = 0.1. Bob has BPE subword models with vocab size 30,000 for each language. The models were trained until perplexity on newstest2016 (Bojar et al., 2016) had not improved for sixteen consecutive checkpoints, computed every 4,000 batches.

Note that for our setup, the sentence tokenization tool is the same for Alice and Bob.88 8 Since practices around these processes are fairly standard across the MT community, this assumption is not unreasonable. However, the BPE units are different as Bob does not have an access to the vocabulary of the Alice data. The mean BLEU scores of the ten shadow models on newstest2018 is 38.6±0.2 (compared to 42.6 for Alice).

5.3 Membership Inference Classifier

Bob extracts features from (f,e,e^) for a binary classifier. He uses modified 1-4 gram precisions and sentence-level BLEU score as features. Bob’s intuition is that if an unusually large number of n-grams in e^ matches e, then it could be a sign that this was in the training data and Alice memorized it. Bob calculates n-gram precision by counting the number of n-grams in translation that appear in the reference sentence. An n-gram in reference will be considered exhausted once a matching n-gram in translation is found to mitigate the effect of repeated n-grams in the translation. He calculates sentence-level BLEU score using 1-4 gram precisions, with the smoothing method from Lin and Och (2004). In the experiments Bob has tried several other smoothing methods, though we did not see a large difference in the results. In the later investigation Bob also considered the MT model score as an extra feature.

For training the binary classifiers, Bob uses models from data splits 1 to 3 for training, 4 for validation, and 5 for his own internal testing. Note that the final evaluation of the attack is done using the translations of 𝒜in_probe and 𝒜out_probe with Alice MT model, by Carol.

Bob tried different types of classifiers, namely Perceptron, Decision Tree, Gaussian Naïve Bayes, Nearest Neighbors, and Multi-layer Perceptron. For Decision Tree, we used GINI impurity for the splitting metrics, and the max depth to be 5. For the Nearest Neighbor algorithm, we set the k value to be 5. For Multi-layer Perceptron, we set the size of hidden layer to be 100, activation function to be ReLU, and L2 regularization term α to be 0.0001.

Pseudocode 1 summarizes the procedure to construct a membership inference classifier g() using Bob’s dataset all.

\SetAlgoLined\KwDataall \KwResultg() Split all into multiples groups of (in_probei, out_probei, traini) \[email protected]
\ForEachi Train a shadow model Mi using traini \[email protected]
Translate in_probei and out_probei with Mi \[email protected]
Use in_probei, out_probei, and their translations to train g() \[email protected]
\algorithmcfname 1 Construction of A Membership Inference Classifier

6 Attack Results

We now present a series of results based on the shadow model attack method described previously. In Section 6.1 we will observe that Bob has difficulty attacking Alice under our definition of membership inference. In Sections 6.2 and 6.3 we will see that Alice nevertheless does leak some private information under more nuanced conditions.

6.1 Main Result

Alice Bob:train Bob:valid Bob:test
P 50.0 50.0 50.0 50.0
DT 50.4 51.4 51.2 51.1
NB 50.4 51.2 51.1 51.0
NN 49.9 61.6 50.5 50.0
MLP 50.2 50.8 50.8 50.8
Table 2: Accuracy of membership inference per classifier type, Perceptron (P), Decision Tree (DT), Naïve Bayes (NB), Nearest Neighbors (NN), and Multi-layer Perceptron (MLP). Alice column shows the accuracy of attack on Alice probes 𝒜in_probe and 𝒜out_probe. Bob columns show the accuracy on the classifiers’ train, validation, and test set. Note that, following the evaluation protocol explained in 4.3, only Carol the evaluator can observe the accuracy of the attacks on Alice model.


Figure 5: Confusion matrices of the attacks on Alice model, per classifier model.

Table 2 shows the accuracy of the membership inference classifiers. There are 5 different types of classifiers, as described in section 5.3, namely Perceptron (P), Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), Nearest Neighbors (NN), and Multi-layer Perceptron (MLP). The numbers in the Alice column shows the attack accuracy on Alice probes 𝒜in_probe and 𝒜out_probe; these are the main results. The numbers in Bob columns show the results on the Bob classifiers’ train, validation, and test sets, as described in section 5.3.

The results of the attacks on the Alice model show that it is almost 50%, meaning that the attack is not successful and the binary classification is almost the same as a random choice. The accuracy is around 50% for “Bob:eval”, meaning that Bob also has difficulty attacking his own simulated probes, so the poor performance on 𝒜in_probe and 𝒜out_probe is not due to potential mismatches between Alice’s model and Bob’s model.

The accuracy is around 50% for “Bob:train” as well, (except for NN), which shows that the overfitting by the classifier g() is not the reason of the unsuccessful attack. The accuracy of NN in in-sample case is higher, as there is a exact same datapoint in the model as the input, and that always becomes the nearest neighbor. When the k value is increased, the accuracy on in-sample data decreased. The result suggests that the current features do not provide enough information to distinguish in-probe and out-probe sentences.

Figure 5 shows the confusion matrices of the attack on Alice probes. We can see that the P (Perception) classifier always predicts out hence gets 50% accuracy, as the size of in and out probe sets are the same. For other classifiers the prediction is more balanced.

Alice Bob:train Bob:valid Bob:test
P 49.7 (-0.3) 49.2 (-0.8) 49.3 (-0.7) 49.4 (-0.6)
DT 50.4 (+0.0) 51.5 (+0.1) 51.1 (-0.1) 51.2 (+0.1)
NB 50.1 (-0.3) 50.2 (-1.0) 50.1 (-1.0) 50.2 (-0.8)
NN 50.2 (+0.3) 67.1 (+5.5) 50.2 (-0.3) 50.0 (+0.0)
MLP 50.4 (+0.2) 51.2 (+0.4) 51.2 (+0.4) 51.1 (+0.3)
Table 3: Membership inference accuracy when MT model score is added as an extra feature for the classifier.

Table 3 shows the result when MT model score is added as an extra feature for classification. The result indicates that this extra information does not help the attack. In summary, these results suggest that Bob is not able to reveal membership information at the sentence/sample level. This result is in contrast to previous work on membership inference in “classification” problems, which demonstrated high accuracies with Bob’s shadow model attack.

6.2 Out-of-Domain Subcorpora

ParaCrawl CommonCrawl Europarl News Rapid EMEA Koran Subtitles TED
P 50.0 50.0 50.0 50.0 50.0 100.0 100.0 100.0 100.0
DT 50.3 51.1 49.7 50.7 50.0 67.2 94.1 80.2 67.1
NB 50.1 51.2 49.9 50.6 50.2 69.5 96.1 81.7 70.5
NN 49.4 50.7 50.3 49.7 49.2 43.3 52.6 48.7 49.9
MLP 49.6 50.8 49.9 50.3 50.7 73.6 97.9 84.8 85.0
Table 4: Membership inference accuracy per subcorpus. Right 4 columns are results for out-of-domain subcorpora. Note that ParaCrawl is out-of-domain for Bob and his classifier, whereas in-domain for Alice and her MT model.


Figure 6: Distribution of sentence-level BLEU per subcorpora for 𝒜in_probe (blue boxes), 𝒜out_probe (green, left five boxes), and 𝒜ood (green, right four boxes).

Carol prepared out-of-domain (OOD) subcorpora, 𝒜ood, that are totally separate from 𝒜train and all. The membership inference accuracy of each subcorpus is shown in Table 4. We can see that the accuracy for OOD subcorpora are much higher than that of original in-domain subcorpora. For example, the accuracy with Decision Tree was 50.3% and 51.1% for ParaCrawl and CommonCrawl (in-domain), whereas 67.2% and 94.1% for EMEA and Koran (out-of-domain). This suggests that for OOD data Bob has a better chance to infer the membership.

Figure 6 shows the distribution of sentence-level BLEU scores per subcorpus. The BLEU scores tends to be lower for OOD subcorpora, and the classifier may exploit this information to distinguish the membership better. None of the samples from the OOD data are in the training data of the MT model, therefore we can expect the output quality of the model to be worse than when the input is from an in-domain subcorpus.

Overall, these results suggest that Bob’s accuracy depends on the specific type of probe being tested. If there is a wide distribution of domains for the problem at hand, there is a higher chance that Bob may be able to reveal membership information.

6.3 Out-of-Vocabulary Words

OOV in src OOV in ref OOV in both
P 100.0 100.0 100.0
DT 73.9 74.1 68.0
NB 77.4 77.0 70.3
NN 49.9 49.2 49.3
MLP 89.0 85.8 80.4
Table 5: Membership inference accuracy on the sentences containing Out-of-vocabulary (OOV) words.

We also focused on the samples which contain the words that never appear in the training data, i.e., out-of-vocabulary (OOV) words. For this analysis, we focus only on vocabulary that does not exist in the training data of Bob’s shadow models, rather than Alice’s, since Bob does not have access to her vocabulary.

For Bob’s shadow models, 7.4%, 3.2%, and 1.9% of samples in the probe sets had one or more OOV words in source, reference, or both sentences, respectively. Table 5 shows the membership inference accuracy of the OOV subsets. Similar to the results with OOD subcorpora in section 6.2, the accuracy is higher than that of entire probe sets. We can expect that the sentences with OOV words are translated poorly compared to the ones without OOV words, and classifier can exploit this difference to distinguish the membership.

In practical scenarios, Bob will not know the vocabulary of the training data Alice used. Thus so far, Alice does not need to worry about leaking information in probes that have OOV; however the high accuracy of Bob in Table 5 suggest if there is a OOV-related signal that can potentially be exploited by Bob, he may be able to easily attack Alice’s privacy.

7 Discussion

The results in section 6 show that Alice is generally safe and it is difficult for Bob to infer membership. In Shokri et al. (2017) the membership inference attacks were successful for the standard classification problem, however, for sequence generation problem the attack is much harder. One reason may be that because the possible input and output space of the latter is far larger and complex, which makes it difficult to determine the quality of the model output or how confident the model is. The attacks on former setting were successful because it exploits the model output distribution difference when the input is in or out of the training set. For sequence generation, it is not trivial to quantify the distribution or quality of the output. For our experiments we used a single output, and its BLEU score and n-gram precision as features, however, they do not provide enough information to make its membership distinguishable.

Our results also show that when we consider the out-of-domain (OOD) or out-of-vocabulary (OOV) data, Bob may have a better chance to make the attack successful. We can expect that such data makes MT model harder to produce a high quality translation, and that behavior gives the classifier a better chance to infer the membership correctly.



Figure 7: How the number of samples in a probe set affects the overall attack accuracy. The numbers for the graphs are from the Decision Tree classifier, however, the trends of convergence are similar for other classifiers as well.

It is difficult for Bob to determine the sentence-level membership. However, if we loosen the attack definition and consider the subcorpora-level membership, he may have a more confident conclusion compared to the classification result of a single sentence. If Bob uses multiple samples at once to determine the membership the reliability of the decision may increase. For example, Bob may say that if the majority of the samples in the probe set is classified as in, conclude that the subcorpus is in the training data. Figure 7 shows how the avccuracy changes as the number of samples in a probe set varies. We can see that the accuracy stabilizes with less than a thousand samples.

The current approach of the attack is a simple one, and we can consider a more complex strategies, for example, by using the API multiple times per sentence pair. Given a sentence, Bob can manipulate it, for example, by dropping or adding words, and translate each of the manipulated sentences to observe how the output of Alice model changes.

In the current results Alice seems to be safe, therefore we did have an adversarial arms race situation. However, in the future if Bob has a better chance to infer the membership, Alice can think of the protection methods to mitigate the attack. For example, Alice may subsample the datasets when including the training data, or regularize the model to reduce the influence. On the other hand, we can further think of Bob’s side to make the attack more successful, such as adding “watermark sentences” that have some distinguishable characteristics to influence the Alice model and make the membership inference easier.

In our experiment, the performance of Alice and Bob MT models turned out to be similar (in terms of BLEU score), and we expect the difference in translation quality to be small. However, in practical scenarios Bob is not guaranteed to be able to create shadow models of the same standard, nor verify how well it performs compared to the Alice model. If the difference between Alice and Bob models is large, the attack can be unsuccessful even if the classifier performs well on evaluation with Bob’s data.

In addition to the data and model architecture difference between Alice and Bob, difference in the available computational resources may cause difficulty for Bob to achieve successful results. Bob may not have computational resources comparable to Alice (e.g., GPU’s to train a very large neural network). This may cause Bob’s models to perform worse compared to the Alice model. Even when Bob has enough resources, the time constraint may cause Bob not to be able to train the models long enough. Bob may need to train a number of models for the attack, which will require even more resources then Alice who just needs a single model.

8 Conclusion

We formalized the problem of membership inference attacks on sequence generation tasks, and used Machine Translation as an example to investigate the feasibility of a privacy attack. Our results show that, unlike the attacks on standard classification problems shown in Shokri et al. (2017), the attack is much harder in sequence generation setting. For standard classification the attacker exploits the difference in output distribution of class labels returned by the target model API. On the other hand, for sequence generation it is not possible to provide such information because the output space of the sequence is exponential in the output length, and we expect that lack of such information makes the attack more difficult. We did also consider the model score for the attack, however, this extra information did not improve the result. Thus, one of our conclusions is that currently Bob is not able to perform sentence-level membership inference attacks with an accuracy rate above chance.

However, this does not mean that Alice has no risk of leaking private information. Our analyses show that Bob’s accuracy on out-of-domain and out-of-vocabulary data is above chance, suggesting that attacks may be feasible in conditions where unseen words and sentences cause the model to behave slightly differently.

Our attack approach was a simple one, using shadow models to mimic the target model. We can attempt more complex attacks, for example, by modifying and translating a same sentence multiple times and observe how the translation changes. As future work, we also plan to use some of the metrics proposed by Carlini et al. (2018) as features for Bob; they show how recurrent models might unintentionally memorize rare or unique sequences in the training data, and proposed a method measure such effects.

We used Machine Translation as an example, however, the formulation is applicable to other kinds of sequence generation models such as text summarization, video captioning, and speech synthesis, and our findings from the experiments may be useful for such cases as well.


  • Bojar et al. (2017) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the Second Conference on Machine Translation, pages 169–214, Copenhagen, Denmark. Association for Computational Linguistics.
  • Bojar et al. (2016) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
  • Bojar et al. (2018) Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303, Belgium, Brussels. Association for Computational Linguistics.
  • Carlini et al. (2018) Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Xiaodong Song. 2018. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232.
  • Duh (2018) Kevin Duh. 2018. The multitarget TED talks task.
  • Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, pages 1–19, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Hayes et al. (2017) Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017. LOGAN: evaluating privacy leakage of generative models using generative adversarial networks. CoRR, abs/1705.07663.
  • Hieber et al. (2017) Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. 2017. Sockeye: A toolkit for neural machine translation. CoRR, abs/1712.05690.
  • Junczys-Dowmunt (2018) Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
  • Khayrallah and Koehn (2018) Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation.
  • Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 605–612, Barcelona, Spain.
  • Long et al. (2018) Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng Wang, Haixu Tang, Carl A. Gunter, and Kai Chen. 2018. Understanding membership inferences on well-generalized learning models. CoRR, abs/1802.04889.
  • Machanavajjhala et al. (2017) Ashwin Machanavajjhala, Xi He, and Michael Hay. 2017. Differential privacy in the wild: A tutorial on current practices & open challenges. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1727–1730, New York, NY, USA. ACM.
  • Nasr et al. (2018) Milad Nasr, Reza Shokri, and Amir Houmansadr. 2018. Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 634–646, New York, NY, USA. ACM.
  • Nasr et al. (2019) Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy (SP).
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  • Pyrgelis et al. (2018) Apostolos Pyrgelis, Carmela Troncoso, and Emiliano De Cristofaro. 2018. Knock knock, who’s there? membership inference on aggregate location data. CoRR, abs/1708.06145.
  • Rahman et al. (2018) Md Atiqur Rahman, Tanzila Rahman, Robert Laganiere, Noman Mohammed, and Yang Wang. 2018. Membership inference attack against differentially private deep learning model. 11:61–79.
  • Salem et al. (2018) Ahmed Salem, Yonghui Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. 2018. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. CoRR, abs/1806.01246.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  • Shokri et al. (2017) R. Shokri, M. Stronati, C. Song, and V. Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  • Truex et al. (2018) Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. 2018. Towards demystifying membership inference attacks. CoRR, abs/1807.09173.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
  • Veale et al. (2018) Michael Veale, Reuben Binns, and Lilian Edwards. 2018. Algorithms that remember: model inversion attacks and data protection law. 376.
  • Yeom et al. (2017) Samuel Yeom, Matt Fredrikson, and Somesh Jha. 2017. The unintended consequences of overfitting: Training data inference attacks. CoRR, abs/1709.01604.