Abstract
We present a mechanism to compute a sketch (succinct summary) of how acomplex modular deep network processes its inputs. The sketch summarizesessential information about the inputs and outputs of the network and can beused to quickly identify key components and summary statistics of the inputs.Furthermore, the sketch is recursive and can be unrolled to identifysubcomponents of these components and so forth, capturing a potentiallycomplicated DAG structure. These sketches erase gracefully; even if we erase afraction of the sketch at random, the remainder still retains the `highweight'information present in the original sketch. The sketches can also be organizedin a repository to implicitly form a `knowledge graph'; it is possible toquickly retrieve sketches in the repository that are related to a sketch ofinterest; arranged in this fashion, the sketches can also be used to learnemerging concepts by looking for new clusters in sketch space. Finally, in thescenario where we want to learn a ground truth deep network, we show thataugmenting input/output pairs with these sketches can theoretically make iteasier to do so.
Quick Read (beta)
Recursive Sketches for Modular Deep Learning
Abstract
We present a mechanism to compute a sketch (succinct summary) of how a complex modular deep network processes its inputs. The sketch summarizes essential information about the inputs and outputs of the network and can be used to quickly identify key components and summary statistics of the inputs. Furthermore, the sketch is recursive and can be unrolled to identify subcomponents of these components and so forth, capturing a potentially complicated DAG structure. These sketches erase gracefully; even if we erase a fraction of the sketch at random, the remainder still retains the “highweight” information present in the original sketch. The sketches can also be organized in a repository to implicitly form a “knowledge graph”; it is possible to quickly retrieve sketches in the repository that are related to a sketch of interest; arranged in this fashion, the sketches can also be used to learn emerging concepts by looking for new clusters in sketch space. Finally, in the scenario where we want to learn a ground truth deep network, we show that augmenting input/output pairs with these sketches can theoretically make it easier to do so.
calc \usetikzlibrarypositioning \usetikzlibraryshapes.geometric
1 Introduction
Machine learning has leveraged our understanding of how the brain functions to devise better algorithms. Much of classical machine learning focuses on how to correctly compute a function; we utilize the available data to make more accurate predictions. More recently, lines of work have considered other important objectives as well: we might like our algorithms to be small, efficient, and robust. This work aims to further explore one such subquestion: can we design a system on top of neural nets that efficiently stores information?
Our motivating example is the following everyday situation. Imagine stepping into a room and briefly viewing the objects within. Modern machine learning is excellent at answering immediate questions about this scene: “Is there a cat? How big is said cat?” Now, suppose we view this room every day over the course of a year. Humans can reminisce about the times they saw the room: “How often did the room contain a cat? Was it usually morning or night when we saw the room?”; can we design systems that are also capable of efficiently answering such memorybased questions?
Our proposed solution works by leveraging an existing (already trained) machine learning model to understand individual inputs. For the sake of clarity of presentation, this base machine learning model will be a modular deep network.^{1}^{1} 1 Of course, it is possible to cast many models as deep modular networks by appropriately separating them into modules. We then augment this model with sketches of its computation. We show how these sketches can be used to efficiently answer memorybased questions, despite the fact that they take up much less memory than storing the entire original computation.
A modular deep network consists of several independent neural networks (modules) which only communicate via one’s output serving as another’s input. Figure 1 presents a cartoon depiction of a modular network for our example. Modular networks have both a biological basis (Azam, 2000) and evolutionary justification (Clune et al., 2013). They have inspired several practical architectures such as neural module networks (Andreas et al., 2016; Hu et al., 2017), capsule neural networks (Hinton et al., 2000, 2011; Sabour et al., 2017), and PathNet (Fernando et al., 2017) and have connections to suggested biological models of intelligence such as hierarchical temporal memory (Hawkins & Blakeslee, 2007). We choose them as a useful abstraction to avoid discussing specific network architectures.
What do these modules represent in the context of our room task? Since we are searching for objects in the room, we think of each module as attempting to identify a particular type of object, from the low level edge to the high level cat. For the reader familiar with convolutional neural networks, it may help to think of each module as a filter or kernel. We denote the event where a module produces output as an object, the produced output vector as an attribute vector, and all objects produced by a module as a class. In fact, our sketching mechanism will detail how to sketch each object, and these sketches will be recursive, taking into account the sketches corresponding to the inputs of the module.
Armed with this view of modular networks for image processing, we now entertain some possible sketch properties. One basic use case is that from the output sketch, we should be able to say something about the attribute vectors of the objects that went into it. This encompasses a question like “How big was the cat?” Note that we need to assume that our base network is capable of answering such questions, and the primary difficulty stems from our desire to maintain this capability despite only keeping a small sketch. Obviously, we should not be able to answer such questions as precisely as we could with the original input. Hence a reasonable attribute recovery goal is to be able to approximately recover an object’s attribute vector from its sketch. Concretely, we should be able to recover cat attributes from our cat sketch.
Since we plan to make our sketches recursive, we would actually like to recover the cat attributes from the final output sketch. Can we recover the cat’s sketch from final sketch? Again, we have the same tugofwar with space and should expect some loss from incorporating information from many sketches into one. Our recursive sketch property is that we can recover (a functional approximation of) a sketch from a later sketch which incorporated it.
Zooming out to the entire scene, one fundamental question is “Have I been in this room before?” In other words, comparing sketches should give some information about how similar the modules and inputs that generated them. In the language of sketches, our desired sketchtosketch similarity property is that two completely unrelated sketches will be dissimilar while two sketches that share common modules and similar objects will be similar.
Humans also have the impressive ability to recall approximate counting information pertaining to previous encounters (Brannon & Roitman, 2003). The summary statistics property states that from a sketch, we can approximately recover the number of objects produced by a particular module, as well as their mean attributes.
Finally, we would like the nuance of being able to forget information gradually, remembering more important facts for longer. The graceful erasure property expresses this idea: if (a known) portion of our sketch is erased, then the previous properties continue to hold, but with additional noise depending on the amount erased.
We present a sketching mechanism which simultaneously satisfies all these desiderata: attribute recovery, recursive sketch, sketchtosketch similarity, summary statistics, and graceful erasure.
1.1 Potential Applications
We justify our proposed properties by describing several potential applications for a sketching mechanism satisfying them.
Sketch Repository. The most obvious use case suggested by our motivating example is to maintain a repository of sketches produced by the (augmented) network. By the recursive sketch and attribute recovery properties, we can query our repository for sketches with objects similar to a particular object. From the sketchtosketch similarity property, we can think of the sketches as forming a knowledge graph; we could cluster them to look for patterns in the data or use localitysensitive hashing to quickly fish out related sketches. We can combine these overall sketches once more and use the summary statistics property to get rough statistics of how common certain types of objects are, or what a typical object of a class looks like. Our repository may follow a generalization of a “least recently used” policy, taken advantage of the graceful erasure property to only partially forget old sketches.
Since our sketches are recursively defined, such a repository is not limited to only keeping overall sketches of inputs; we can also store lowerlevel object sketches. When we perform object attribute recovery using the attribute recovery property, we can treat the result as a fingerprint to search for a more accurate sketch of that object in our repository.
Learning New Modules. This is an extension of the sketch repository application above. Rather than assuming we start with a pretrained modular network, we start with a barebones modular network and want to grow it. We feed it inputs and examine the resulting sketches for emerging clusters. When we find such a cluster, we use its points to train a corresponding (reversible) module. In fact, this clustering may be done in a hierarchical fashion to get finer classes. For example, we may develop a module corresponding to “cat” based on a coarse cluster of sketches which involve cats. We may get subclusters based on cat species or even finer subclusters based on individual cats repeatedly appearing.
The modules produced by this procedure can capture important primitives. For example, the summary statistics property implies that a singlelayer module could count the number of objects produced by another module. If the modules available are complex enough to capture curves, then this process could potentially allow us to identify different types of motion.
Interpretability. Since our sketches store information regarding how the network processed an input, they can help explain how that network came to its final decision. The sketchtosketch similarity property tells us when the network thinks two rooms are similar, but what exactly did the network find similar? Using the recursive sketch property, we can drill down into two top level sketches, looking for the pairs of objects that the network found similar.
Model Distillation. Model distillation (Buciluǎ et al., 2006; Hinton et al., 2015) is a popular technique for improving machine learning models. The use case of model distillation is as follows. Imagine that we train a model on the original data set, but it is unsatisfactory in some regard (e.g. too large or ensemble). In model distillation, we take this first model and use its features to train a second model. These stepping stones allows the second model to be smaller and less complex. The textbook example of feature choice is to use class probabilities (e.g. the object is $90\%$ A and $10\%$ B) versus the exact class found in the original input (e.g. the object is of class A). We can think of this as an intermediate option between the two extremes of (i) providing just the original data set again and (ii) providing all activations in the first model. Our sketches are an alternative option. To showcase the utility of our sketches in the context of learning from a teacher network, we prove that (under certain assumptions) augmenting the teacher network with our sketches allows us to learn the network. Note that in general, it is not known how the student can learn solely from many input/output pairs.
1.2 Techniques
The building block of our sketching mechanism is applying a random matrix to a vector. We apply such transformations to object attribute vectors to produce sketches, and then recursively to merge sketches. Consequently, most of our analysis revolves around proving and utilizing properties of random matrices. In the case where we know the matrices used in the sketch, then we can use an approximate version of isometry to argue about how noise propagates through our sketches. On the other hand, if we do not know the matrices involved, then we use sparse recovery/dictionary learning techniques to recover them. Our recovery procedure is at partially inspired by several works on sparse coding and dictionary learning (Spielman et al., 2012; Arora et al., 2014c), (Arora et al., 2014a) and (Arora et al., 2015) as well as their known theoretical connections to neural networks (Arora et al., 2014b; Papyan et al., 2018). We have more requirements on our procedure than previous literature, so we customdesign a distribution of blockrandom matrices reminiscent of the sparse JohnsonLindenstrauss transform. Proving procedure correctness is done via probabilistic inequalities and analysis tools, including the Khintchine inequality and the HansonWright inequality.
One could consider an alternate approach based on the serialization of structured data (for example, protocol buffers). Compared to this approach, ours provides more finegrained control over the accuracy versus space tradeoff and aligns more closely with machine learning models (it operates on real vectors).
1.3 Related Work
There have been several previous attempts at adding a notion of memory to neural networks. There is a line of work which considers augmenting recurrent neural networks with external memories (Graves et al., 2014; Sukhbaatar et al., 2015; Joulin & Mikolov, 2015; Grefenstette et al., 2015; Zaremba & Sutskever, 2015; Danihelka et al., 2016; Graves et al., 2016). In this line of work, the onus is on the neural network to learn how to write information to and read information from the memory. In contrast to this line of work, our approach does not attempt to use machine learning to manage memory, but rather demonstrates that proper use of random matrices suffices. Our results can hence be taken as theoretical evidence that this learning task is relatively easy.
Vector Symbolic Architectures (VSAs) (Smolensky, 1990; Gayler, 2004; Levy & Gayler, 2008) are a class of memory models which use vectors to represent both objects and the relationships between them. There is some conceptual overlap between our sketching mechanism and VSAs. For example, in Gayler’s MAP (Multiply, Add, Permute) scheme (Gayler, 2004), vector addition is used to represent superposition and permutation is used to encode quotations. This is comparable to how we recursively encode objects; we use addition to combine multiple input objects after applying random matrices to them. One key difference in our model is that the object attribute vectors we want to recover are provided by some pretrained model and we make no assumptions on their distribution. In particular, their possible correlation makes it necessary to first apply a random transformation before combining them. In problems where data is generated by simple programs, several papers (Wu et al., 2017; Yi et al., 2018; Ellis et al., 2018; Andreas et al., 2016; Oh et al., 2017) attempt to infer programs from the generated training data possibly in a modular fashion.
Our result on the learnability of a ground truth network is related to the recent line of work on learning smalldepth neural networks (Du et al., 2017b, a; Li & Yuan, 2017; Zhong et al., 2017a; Zhang et al., 2017; Zhong et al., 2017b). Recent research on neural networks has considered evolving neural network architectures for image classification (e.g., (Real et al., 2018)), which is conceptually related to our suggested Learning New Models application.
1.4 Organization
We review some basic definitions and assumptions in Section 2. An overview of our final sketching mechanism is given in Section 3 (Theorems 1, 2, 3). Section 4 gives a sequence of sketch prototypes leading up to our final sketching mechanism. In Section 5, we explain how to deduce the random parameters of our sketching mechanism from multiple sketches via dictionary learning. Further discussion and detailed proofs are provided in the appendices.
2 Preliminaries
In this section, we cover some preliminaries that we use throughout the remainder of the paper. Throughout the paper, we denote by $[n]$ the set $\{1,\mathrm{\dots},n\}$. We condense large matrix products with the following notation.
$\stackrel{\curvearrowright}{{\displaystyle \prod _{i=1}^{n}}}{R}_{i}$  $\u2254{R}_{1}{R}_{2}\mathrm{\cdots}{R}_{n}$  
$\stackrel{\curvearrowleft}{{\displaystyle \prod _{i=1}^{n}}}{R}_{i}$  $\u2254{R}_{n}\mathrm{\cdots}{R}_{2}{R}_{1}$ 
2.1 Modular Deep Networks for Object Detection
For this paper, a modular deep network consists of a collection of modules $\{M\}$. Each module is responsible for detecting a particular class of object, e.g. a cat module may detect cats. These modules use the outputs of other modules, e.g. the cat module may look at outputs of the edge detection module rather than raw input pixels. When the network processes a single input, we can depict which objects feed into which other objects as a (weighted) communication graph. This graph may be datadependent. For example, in the original Neural Module Networks of Andreas et al. (Andreas et al., 2016), the communication graph is derived by taking questions, feeding them through a parser, and then converting parse trees into structured queries. Note that this pipeline is disjoint from the modular network itself, which operates on images. In this paper, we will not make assumptions about or try to learn the process producing these communication graphs. When a module does appear in the communication graph, we say that it detected an object $\theta $ and refer to its output as ${x}_{\theta}$, the attribute vector of object $\theta $. We assume that these attribute vectors ${x}_{\theta}$ are nonnegative and normalized. As a consequence of this view of modular networks, a communication graph may have far more objects than modules. We let our input size parameter $N$ denote twice the maximum between the number of modules and the number of objects in any communication graph.
We assume we have access to this communication graph, and we use ${\theta}_{1},\mathrm{\dots},{\theta}_{k}$ to denote the $k$ input objects to object $\theta $. We assume we have a notion of how important each input object was; for each input ${\theta}_{i}$, there is a nonnegative weight ${w}_{i}$ such that ${\sum}_{i=1}^{k}{w}_{i}=1$. It is possible naively choose ${w}_{i}=1/k$, with the understanding that weights play a role in the guarantees of our sketching mechanism (it does a better job of storing information about highweight inputs).
We handle the networklevel input and output as follows. The input data can be used by modules, but it itself is not produced by a module and has no associated sketch. We assume the network has a special output module, which appears exactly once in the communication graph as its sink. It produces the output (pseudo)object, which has associated sketches like other objects but does not count as an object when discussing sketching properties (e.g. two sketches are not similar just because they must have output pseudoobjects). This output module and pseudoobject only matter for our overall sketch insofar as they group together highlevel objects.
2.2 Additional Notation for Sketching
We will (recursively) define a sketch for every object, but we consider our sketch for the output object of the network to be the overall sketch. This is the sketch that we want to capture how the network processed a particular input. One consequence is that if an object does not have a path to the output module in the communication graph, it will be omitted from the overall sketch.
Our theoretical results will primarily focus on a single path from an object $\theta $ to the output object. We define the “effective weight” of an object, denoted ${w}_{\theta}$, to be the product of weights along this path. For example, if an edge was 10% responsible for a cat and the cat was 50% of the output, then this edge has an effective weight of 5% (for this path through the cat object). We define the “depth” of an object, denoted $0pt$, to be the number of objects along this path. In the preceding example, the output object has a depth of one, the cat object has a depth of two, and the edge object has a depth of three. We make the technical assumption that all objects produced by a module must be at the same depth. As a consequence, modules $M$ also have a depth, denoted $0pt[M]$. To continue the example, the output module has depth one, the cat module has depth two, and the edge module has depth three. Furthermore, this implies that each object and module can be assigned a unique depth (i.e. all paths from any object produced by a module to the output object have the same number of objects).
Our recursive object sketches are all ${d}_{\text{sketch}}$ dimensional vectors. We assume that ${d}_{\text{sketch}}$ is at least the size of any object attribute vector (i.e. the output of any module). In general, all of our vectors are ${d}_{\text{sketch}}$dimensional; we assume that shorter object attribute vectors are zeropadded (and normalized). For example, imagine that the cat object $\theta $ is output with the threedimensional attribute vector $(0.6,0.0,0.8)$, but ${d}_{\text{sketch}}$ is five. Then we consider its attribute vector to be ${x}_{\theta}=(0.6,0.0,0.8,0.0,0.0)$.
Symbol  Meaning 

$b$  a bit in $\{\pm 1\}$ 
$d$  dimension size 
$\mathcal{D}$  distribution over random matrices 
$f$  a coin flip in $\{\pm 1\}$ 
$h$  depth (oneindexed) 
$H$  $3\cdot \text{maximum depth}$ 
$I$  identity matrix 
$k$  #inputs 
$\mathrm{\ell}$  $\mathrm{\ell}$th moment of a random variable 
$M$  module 
$N$  $3\cdot \mathrm{max}(\text{\#modules},\text{\#objects})$ 
$q$  block nonzero probability 
$R$  random matrix 
$s$  sketch 
$S$  #samples 
$w$  importance weight 
$x$  attribute vector / unknown dictionary learning vector 
$y$  dictionary learning sample vector 
$z$  dictionary learning noise 
$\delta $  error bound, isometry and desynchronization 
$\u03f5$  error bound, dictionary learning 
$\theta $  object 
3 Overview of Sketching Mechanism
In this section, we present our sketching mechanism, which attempts to summarize how a modular deep network understood an input. The mechanism is presented at a high level; see Section 4 for how we arrived at this mechanism and Appendix D for proofs of the results stated in this section.
3.1 Desiderata
As discussed in the introduction, we want our sketching mechanism to satisfy several properties, listed below.
Attribute Recovery. Object attribute vectors can be approximately recovered from the overall sketch, with additive error that decreases with the effective weight of the object.
SketchtoSketch Similarity. With high probability, two completely unrelated sketches (involving disjoint sets of modules) will have a small inner product. With high probability, two sketches that share common modules and similar objects will have a large inner product (increasing with the effective weights of the objects).
Summary Statistics. If a single module detects several objects, then we can approximately recover summary statistics about them, such as the number of such objects or their mean attribute vector, with additive error that decreases with the effective weight of the objects.
Graceful Erasure. Erasing everything but ${d}_{\text{sketch}}^{\prime}$prefix of the overall sketch yields an overall sketch with the same properties (albeit with larger error).
3.2 Random Matrices
The workhorse of our recursive sketching mechanism is random (square) matrices. We apply these to transform input sketches before incorporating them into the sketch of the current object. We will defer the exact specification of how to generate our random matrices to Section 5, but while reading this section it may be convenient for the reader to think of our random matrices as uniform random orthonormal matrices. Our actual distribution choice is similar with respect to the properties needed for the results in this section. One notable parameter for our family of random matrices is $\delta \ge 0$, which expresses how well they allow us to estimate particular quantities with high probability. For both our random matrices and uniform random orthonormal matrices, $\delta $ is $\stackrel{~}{O}(1/\sqrt{{d}_{\text{sketch}}})$. We prove various results about the properties of our random matrices in Appendix B.
We will use $R$ with a subscript to denote such a random matrix. Each subscript indicates a fresh draw of a matrix from our distribution, so every ${R}_{1}$ is always equal to every other ${R}_{1}$ but ${R}_{1}$ and ${R}_{\text{cat},0}$ are independent. Throughout this section, we will assume that we have access to these matrices when we are operating on a sketch to retrieve information. In Section 5, we show how to recover these matrices given enough samples.
3.3 The Sketching Mechanism
The basic building block of our sketching mechanism is the tuple sketch, which we denote ${s}_{\text{tuple}}$. As the name might suggest, its purpose is to combine $k$ sketches ${s}_{1},\mathrm{\dots},{s}_{k}$ with respective weights ${w}_{1},\mathrm{\dots},{w}_{k}\ge 0$ into a single sketch (for proofs, we will assume that these weights sum to at most one). In the case where the $k$ sketches are input sketches, these weights will be the importance weights from our network. Otherwise, they will all be $1/k$. Naturally, sketches with higher weight can be recovered more precisely. The tuple sketch is formally defined as follows. If their values are obvious from context, we will omit ${w}_{1},\mathrm{\dots},{w}_{k}$ from the arguments to the tuple sketch. The tuple sketch is computed as follows.^{2}^{2} 2 Rather than taking a convex combination of $\left(\frac{I+{R}_{i}}{2}\right){s}_{i}$, one might instead sample a few of them. Doing so would have repercussions on the results in this section, swapping many bounds from high probability to conditioning on their objects getting through. However, it also makes it easier to do the dictionary learning in Section 5.
$${s}_{\text{tuple}}({s}_{1},\mathrm{\dots},{s}_{k},{w}_{1},\mathrm{\dots},{w}_{k})\u2254\sum _{i=1}^{k}{w}_{i}\left(\frac{I+{R}_{i}}{2}\right){s}_{i}$$ 
Note that $I$ is the identity matrix, and we will define a tuple sketch of zero things to be the zero vector.
Each object $\theta $ is represented by a sketch, which naturally we will refer to as an object sketch. We denote such a sketch as ${s}_{\text{object}}$. We want the sketch of object $\theta $ to incorporate information about the attributes of $\theta $ itself as well as information about the inputs that produced it. Hence we also define two subsketches for object $\theta $. The attribute subsketch, denoted ${s}_{\text{attr}}$, is a sketch representation of object $\theta $’s attributes. The input subsketch, denoted ${s}_{\text{input}}$, is a sketch representation of the inputs that produced object $\theta $. These three sketches are computed as follows.
${s}_{\text{object}}(\theta )$  $\u2254\left(\frac{I+{R}_{M(\theta ),0}}{2}\right){s}_{\text{tuple}}({s}_{\text{attr}}(\theta ),{s}_{\text{input}}(\theta ),\frac{1}{2},\frac{1}{2})$  
${s}_{\text{attr}}(\theta )$  $\u2254\frac{1}{2}{R}_{M(\theta ),1}{x}_{\theta}+\frac{1}{2}{R}_{M(\theta ),2}{e}_{1}$  
${s}_{\text{input}}(\theta )$  $\u2254{s}_{\text{tuple}}({s}_{\text{object}}({\theta}_{1}),\mathrm{\dots},{s}_{\text{object}}({\theta}_{k}),{w}_{1},\mathrm{\dots},{w}_{k})$ 
Note that ${e}_{1}$ in the attribute subsketch is the first standard basis vector; we will use it for frequency estimation as part of our Summary Statistics property. Additionally, ${\theta}_{1},\mathrm{\dots},{\theta}_{k}$ in the input sketch are the input objects for $\theta $ and ${w}_{1},\mathrm{\dots},{w}_{k}$ are their importance weights from the network.
The overall sketch is just the output psuedoobject’s input subsketch. It is worth noting that we do not choose to use its object sketch.
$${s}_{\text{overall}}\u2254{s}_{\text{input}}(\text{output pseudoobject})$$ 
We want to use this to (noisily) retrieve the information that originally went into these sketches. We are now ready to present our results for this sketching mechanism. Note that we provide the technical machinery for proofs of the following results in Appendix A and we provide the proofs themselves in Appendix D. Our techniques involve using an approximate version of isometry and reasoning about the repeated application of matrices which satisfy our approximate isometry.
Our first result concerns Attribute Recovery (example use case: roughly describe the cat that was in this room).
Theorem 1.
[Simplification of Theorem 9] Our sketch has the simplified Attribute Recovery property, which is as follows. Consider an object ${\theta}^{\mathrm{\star}}$ at constant depth $h\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}$ with effective weight ${w}_{{\theta}^{\mathrm{\star}}}$.

(i)
Suppose no other objects in the overall sketch are also produced by $M({\theta}^{\star})$. We can produce a vector estimate of the attribute vector ${x}_{{\theta}^{\star}}$, which with high probability has at most $O(\delta /{w}_{{\theta}^{\star}})$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.

(ii)
Suppose we know the sequence of input indices to get to ${\theta}^{\star}$ in the sketch. Then even if other objects in the overall sketch are produced by $M({\theta}^{\star})$, we can still produce a vector estimate of the attribute vector ${x}_{{\theta}^{\star}}$, which with high probability has at most $O(\delta /{w}_{{\theta}^{\star}})$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.
As a reminder, ${\mathrm{\ell}}_{\mathrm{\infty}}$ error is just the maximum error over all coordinates. Also, note that if we are trying to recover a quantized attribute vector, then we may be able to recover our attribute vector exactly when the quantization is larger than our additive error. This next result concerns SketchtoSketch Similarity (example use case: judge how similar two rooms are). We would like to stress that the output psuedoobject does not count as an object for the purposes of (i) or (ii) in the next result.
Theorem 2.
[Simplification of Theorem 10] Our sketch has the SketchtoSketch Similarity property, which is as follows. Suppose we have two overall sketches $s$ and $\overline{s}$.

(i)
If the two sketches share no modules, then with high probability they have at most $O(\delta )$ dotproduct.

(ii)
If $s$ has an object ${\theta}^{\star}$ of weight ${w}_{{\theta}^{\star}}$ and $\overline{s}$ has an object ${\overline{\theta}}^{\star}$ of weight ${\overline{w}}_{{\theta}^{\star}}$, both objects are produced by the same module, and both objects are at constant depth $h({\theta}^{\star})$ then with high probability they have at least $\mathrm{\Omega}({w}_{{\theta}^{\star}}{\overline{w}}_{{\overline{\theta}}^{\star}})O(\delta )$ dotproduct.

(iii)
If the two sketches are identical except that the attributes of any object $\theta $ differ by at most $\u03f5$ in ${\mathrm{\ell}}_{2}$ distance, then with probability one the two sketches are at most $\u03f5/2$ from each other in ${\mathrm{\ell}}_{2}$ distance.
Although our SketchtoSketch Similarity result is framed in terms of two overall sketches, the recursive nature of our sketches makes it not too hard to see how it also applies to any pair of sketches computed along the way. This is depicted in Figure 3.
This next result concerns Summary Statistics (example use case: how many walls are in this room).
Theorem 3.
[Simplification of Theorem 11] Our sketch has the simplified Summary Statistics property, which is as follows. Consider a module ${M}^{\mathrm{\star}}$ which lies at constant depth $h\mathit{}\mathrm{(}{M}^{\mathrm{\star}}\mathrm{)}$, and suppose all the objects produced by ${M}^{\mathrm{\star}}$ has the same effective weight ${w}^{\mathrm{\star}}$.

(i)
Frequency Recovery. We can produce an estimate of the number of objects produced by ${M}^{\star}$. With high probability, our estimate has an additive error of at most $O(\delta /{w}^{\star})$.

(ii)
Summed Attribute Recovery. We can produce a vector estimate of the summed attribute vectors of the objects produced by ${M}^{\star}$. With high probability, our estimate has at most $O(\delta /{w}^{\star})$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.
Again, note that quantization may allow for exact recovery. In this case, the number of objects is an integer and hence can be recovered exactly when the additive error is less than $1/2$. If we can recover the exact number of objects, we can also estimate the mean attribute vector.
4 The Road to a Sketch
In this section, we present a series of sketch prototypes which gradually grow more complex, capturing more of our desiderata while also considering more complex network possibilities. The final prototype is our actual sketch.
We begin by making the following simplifying assumptions: (i) each object is produced by a unique module, (ii) our network is of depth one (i.e. all objects are direct inputs to the output pseudoobject), and (iii) we can generate random orthonormal matrices $R$ to use in our sketches. In this section, we eventually relax all three assumptions to arrive at our final sketch, which was presented in Section 3. Orthonormal matrices are convenient for us due to the following two properties.
Isometry: If $x$ is a unit vector and $R$ is an orthonormal matrix, then $\left{\left\leftRx\right\right}_{2}^{2}{\left\leftx\right\right}_{2}^{2}\right=0$. One nice consequence of this fact is that ${R}^{T}$ is the inverse of $R$.
Desynchronization: If $x$ and $y$ are unit vectors and then $R$ is drawn to be a random orthonormal matrix, then $\left\u27e8Rx,y\u27e9\right\le O(\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$ with probability at least $11/{N}^{\mathrm{\Omega}(1)}$ (i.e. with high probability).
When we later use block sparse random matrices, these properties are noisier and we need to carefully control the magnitude of the resulting noise. Roughly speaking, the absolute values in both definitions will be bounded by a parameter $\delta $ (the same $\delta $ as in Section 3). In Appendix A, we establish the technical machinery to control this noise.
Proof Ideas. Appendix C contains proof sketches of the claims in this section. At an even higher level, the key idea is as follows. Our sketch properties revolve around taking products of our sketches and random matrices. These products are made up of terms which in turn contain matrix products of the form ${R}_{2}^{T}{R}_{1}$. If these two random matrices are the same, then this is the identity matrix by Isometry. If they are independent, then this is a uniform random orthonormal matrix and by desynchronization everything involved in the product will turn into bounded noise. The sharp distinction between these two cases means that $Rx$ encodes the information of $x$ in a way that only applying ${R}^{T}$ will retrieve.
Prototype A: A Basic Sketch. Here is our first attempt at a sketch, which we will refer to as prototype A. We use A superscripts to distinguish the definitions for this prototype. The fundamental building block of all our sketches is the tuple sketch, which combines $k$ (weighted) sketches into a single sketch. Its input is $k$ sketches ${s}_{1},\mathrm{\dots},{s}_{k}$ along with their nonnegative weights ${w}_{1},\mathrm{\dots},{w}_{k}\ge 0$ which sum to one. For now, we just take the convex combination of the sketches:
$${s}_{\text{tuple}}^{A}({s}_{1},\mathrm{\dots},{s}_{k},{w}_{1},\mathrm{\dots},{w}_{k})\u2254\sum _{i=1}^{k}{w}_{i}{s}_{i}$$ 
Next, we describe how to produce the sketch of an object $\theta $. Due to assumption (ii), we don’t need to incorporate the sketches of input objects. We only need to take into account the object attribute vector. We will keep a random matrix ${R}_{M}$ for every module $M$ and apply it when computing object sketches:
$${s}_{\text{object}}^{A}(\theta )\u2254{R}_{M(\theta )}{x}_{\theta}$$ 
Finally, we will specially handle the sketch of our top level object so that it can incorporate its direct inputs (we will not bother with its attributes for now):
$${s}_{\text{overall}}^{A}\u2254{s}_{\text{tuple}}^{A}({s}_{\text{object}}^{A}({\theta}_{1}),\mathrm{\dots},{s}_{\text{object}}^{A}({\theta}_{k}))$$ 
where ${\theta}_{1},\mathrm{\dots},{\theta}_{k}$ are the direct inputs to the output pseudoobject and ${w}_{1},\mathrm{\dots},{w}_{k}$ are their importance weights from the network.
Our first result is that if we have the overall sketch ${s}_{\text{overall}}^{A}$, we can (approximately) recover the attribute vector for object $\theta $ by computing ${R}_{M(\theta )}^{T}{s}_{\text{overall}}^{A}$.
Claim 1.
Prototype A has the Attribute Recovery property: for any object ${\theta}^{\mathrm{\star}}$, with high probability (over the random matrices), its attribute vector ${x}_{{\theta}^{\mathrm{\star}}}$ can be recovered up to at most $O\mathit{}\mathrm{(}\mathrm{log}\mathit{}N\mathrm{/}\mathrm{(}{w}_{{\theta}^{\mathrm{\star}}}\mathit{}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}\mathrm{)}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.
Our next result is that overall sketches which have similar direct input objects will be similar:
Claim 2.
Prototype A has the SketchtoSketch Similarity property: (i) if two sketches share no modules, then with high probability they have at most $O\mathit{}\mathrm{(}\sqrt{\mathrm{log}\mathit{}N}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$ dotproduct, (ii) if two sketches share an object ${\theta}^{\mathrm{\star}}$, which has ${w}_{{\theta}^{\mathrm{\star}}}$ in the one sketch and ${\overline{w}}_{{\theta}^{\mathrm{\star}}}$ in the other, then with high probability they have at least ${w}_{{\theta}^{\mathrm{\star}}}\mathit{}{\overline{w}}_{{\theta}^{\mathrm{\star}}}\mathrm{}O\mathit{}\mathrm{(}\sqrt{\mathrm{log}\mathit{}N}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$ dotproduct, (iii) if two sketches have objects produced by identical modules and each pair of objects produced by the same module differ by at most $\u03f5$ in ${\mathrm{\ell}}_{\mathrm{2}}$ distance and are given equal weights, then they are guaranteed to be at most $\u03f5$ from each other in ${\mathrm{\ell}}_{\mathrm{2}}$ distance.
Prototype B: Handling Module Multiplicity. We will now relax assumption (i); a single module may appear multiple times in the communication graph, producing multiple objects. This assumption previously prevented the Summary Statistics property from making sense, but it is now up for grabs. Observe that with this assumption removed, computing ${R}_{M}^{T}{s}_{\text{overall}}^{A}$ now sums the attribute vectors of all objects produced by module $M$. As a consequence, we will now need a more nuanced version of our Attribute Recovery property, but we can leverage this fact for the Summary Statistics property. In order to estimate object frequencies, we will (in spirit) augment our object attribute vectors with a entry which is always one. In reality, we draw another random matrix for the module and use its first column.
$${s}_{\text{object}}^{B}(\theta )\u2254\frac{1}{2}{R}_{M(\theta ),1}{x}_{\theta}+\frac{1}{2}{R}_{M(\theta ),2}{e}_{1}$$ 
where ${e}_{1}$ is the first standard basis vector in ${d}_{\text{sketch}}$ dimensions.
We still need to fix attribute recovery; how can we recover the attributes of an object that shares a module with another object? We could apply another random matrix to every input in a tuple sketch, but then we would not be able to get summed statistics. The solution is to both apply another random matrix and apply the identity mapping. We will informally refer to $\left(\frac{I+R}{2}\right)$ as the transparent version of matrix $R$. The point is that when applied to a vector $x$, it both lets through information about $x$ and sets things up so that ${R}^{T}$ will be able to retrieve the same information about $x$. Our new tuple sketch is:
$${s}_{\text{tuple}}^{B}({s}_{1},\mathrm{\dots},{s}_{k},{w}_{1},\mathrm{\dots},{w}_{k})\u2254\sum _{i=1}^{k}{w}_{i}\left(\frac{I+{R}_{i}}{2}\right){s}_{i}$$ 
Our special overall sketch is computed as before.
$${s}_{\text{overall}}^{B}\u2254{s}_{\text{tuple}}^{B}({s}_{\text{object}}^{B}({\theta}_{1}),\mathrm{\dots},{s}_{\text{object}}^{B}({\theta}_{k}))$$ 
Here is our more nuanced attribute recovery claim.
Claim 3.
Prototype B has the Attribute Recovery property, which is as follows. Consider an object ${\theta}^{\mathrm{\star}}$ with weight ${w}_{{\theta}^{\mathrm{\star}}}$. Then (i) if no other objects are produced by the same module in the sketch, with high probability we can recover ${w}_{\theta}$ up to $O\mathit{}\mathrm{(}\frac{\sqrt{\mathrm{log}\mathit{}N}}{{w}_{{\theta}^{\mathrm{\star}}}\mathit{}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{)}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error and (ii) even if other objects are output by the same module, if we know ${\theta}^{\mathrm{\star}}$’s index in the overall sketch, with high probability we can recover ${w}_{\theta}$ up to $O\mathit{}\mathrm{(}\frac{\sqrt{\mathrm{log}\mathit{}N}}{{w}_{{\theta}^{\mathrm{\star}}}\mathit{}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{)}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.
Sketches are still similar, and due to the ${e}_{1}$ we inserted into the object sketch, two objects produced by the same module now always have a base amount of similarity.
Claim 4.
Prototype B has the SketchtoSketch Similarity property, just as in Claim 2, except in case (ii) the dotproduct is at least $\frac{\mathrm{1}}{\mathrm{4}}\mathit{}{w}_{{\theta}^{\mathrm{\star}}}\mathit{}{\overline{w}}_{{\theta}^{\mathrm{\star}}}\mathrm{}O\mathit{}\mathrm{(}\sqrt{\mathrm{log}\mathit{}N}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$, in case (iii) the sketches are at most $\u03f5\mathrm{/}\mathrm{2}$ from each other in ${\mathrm{\ell}}_{\mathrm{2}}$ distance, and we have an additional case (iv) if one sketch has an object ${\theta}^{\mathrm{\star}}$ of weight ${w}_{{\theta}^{\mathrm{\star}}}$ and the other sketch has an object ${\overline{\theta}}^{\mathrm{\star}}$ of weight ${\overline{w}}_{{\overline{\theta}}^{\mathrm{\star}}}$ and both objects are produced by the same module, then with high probability they have at least $\frac{\mathrm{1}}{\mathrm{16}}\mathit{}{w}_{{\theta}^{\mathrm{\star}}}\mathit{}{\overline{w}}_{{\theta}^{\mathrm{\star}}}\mathrm{}O\mathit{}\mathrm{(}\sqrt{\mathrm{log}\mathit{}N}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$.
Finally, we can recover some summary statistics.
Claim 5.
Prototype B has the Summary Statistics property, which is as follows. Consider a module ${M}^{\mathrm{\star}}$, and suppose all objects produced by ${M}^{\mathrm{\star}}$ have the same weight in the overall sketch. Then (i) we can estimate the number of objects produced by ${M}^{\mathrm{\star}}$ up to an additive $O\mathit{}\mathrm{(}\frac{\sqrt{\mathrm{log}\mathit{}N}}{{w}^{\mathrm{\star}}\mathit{}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{)}$ and (ii) we can estimate the summed attribute vector of objects produced by ${M}^{\mathrm{\star}}$ up to $O\mathit{}\mathrm{(}\frac{\sqrt{\mathrm{log}\mathit{}N}}{{w}^{\mathrm{\star}}\mathit{}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{)}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$error.
Last Steps: Handling Deep Networks. We conclude by describing how to get from prototype B to the sketch we presented in Section 3. We have two assumptions left to relax: (ii) and (iii). We gloss over how to relax assumption (iii) here, since the push back comes from the requirements of dictionary learning in Section 5, and the exact choice of random matrices is not so illuminating for this set of results. Instead, we focus on how to relax assumption (ii) and make our sketches recursive. As a first point, we’ll be keeping the tuple sketch from prototype B:
$${s}_{\text{tuple}}({s}_{1},\mathrm{\dots},{s}_{k},{w}_{1},\mathrm{\dots},{w}_{k})\u2254\sum _{i=1}^{k}{w}_{i}\left(\frac{I+{R}_{i}}{2}\right){s}_{i}\phantom{\rule{veryverythickmathspace}{0ex}}(={s}_{\text{tuple}}^{B}({s}_{1},\mathrm{\dots},{s}_{k},{w}_{1},\mathrm{\dots},{w}_{k}))$$ 
The main difference is that we now want object sketches to additionally capture information about their input objects. We define two important subsketches for each object, one to capture its attributes and one to capture its inputs. The attribute subsketch of an object what was the object sketch in prototype B:
$${s}_{\text{attr}}(\theta )\u2254\frac{1}{2}{R}_{M(\theta ),1}{x}_{\theta}+\frac{1}{2}{R}_{M(\theta ),2}{e}_{1}\phantom{\rule{veryverythickmathspace}{0ex}}(={s}_{\text{object}}^{B}(\theta ))$$ 
The input subsketch is just a tuple sketch:
$${s}_{\text{input}}(\theta )\u2254{s}_{\text{tuple}}({s}_{\text{object}}({\theta}_{1}),\mathrm{\dots},{s}_{\text{object}}({\theta}_{k}),{w}_{1},\mathrm{\dots},{w}_{k})$$ 
The object sketch just tuple sketches together the two subsketches and applies a transparent matrix. The idea is that we might both want to query over the input object’s module, or both the current object’s module and the input object’s module (e.g. summary statistics about all edges versus just edges that wound up in a cat). The transparent matrix means the information gets encoded both ways.
$${s}_{\text{object}}(\theta )\u2254\left(\frac{I+{R}_{M(\theta ),0}}{2}\right){s}_{\text{tuple}}({s}_{\text{attr}}(\theta ),{s}_{\text{input}}(\theta ),\frac{1}{2},\frac{1}{2})$$ 
One nice consequence of all this is that we can now stop specialcasing the overall sketch, since we have defined the notion of input subsketches. Note that we do not want the overall sketch to be the output pseudoobject’s object sketch, because that would incorporate a ${R}_{\text{output},2}{e}_{1}$, which in turn gives all sketches a base amount of dotproduct similarity that breaks the current SketchtoSketch Similarity property.
$${s}_{\text{overall}}\u2254{s}_{\text{input}}(\text{output pseudoobject})$$ 
5 Learning Modular Deep Networks via Dictionary Learning
In this section, we demonstrate that our sketches carry enough information to learn the network used to produce them. Specifically, we develop a method for training a new network based on (input, output, sketch) triples obtained from a teacher modular deep network. Our method is powered by a novel dictionary learning procedure. Loosely speaking, dictionary learning tries to solve the following problem. There is an unknown dictionary matrix $R$, whose columns are typically referred to as atoms. There is also a sequence of unknown sparse vectors ${x}^{(k)}$; we only observe how they combine the atoms, i.e., $\{{y}^{(k)}=R{x}^{(k)}\}$. We want to use these observations ${y}^{(k)}$ to recover the dictionary matrix $R$ and the original sparse vectors ${x}^{(k)}$.
While dictionary learning has been wellstudied in the literature from both an applied and a theoretical standpoint, our setup differs from known theoretical results in several key aspects. The main complication is that since we want to apply our dictionary learning procedure recursively, our error in recovering the unknown vectors ${x}^{(k)}$ will become noise on the observations ${y}^{(k)}$ in the very next step. Note that classical dictionary learning can only recover the unknown vectors ${x}^{(k)}$ up to permutation and coordinatewise sign. To do better, we will carefully engineer our distribution of dictionary matrices $R$ to allow us to infer the permutation (between the columns of a matrix) and signs, which is needed to recurse. Additionally, we want to allow very general unknown vectors ${x}^{(k)}$. Rather than assuming sparsity, we instead make the weaker assumption that they have bounded ${\mathrm{\ell}}_{2}$ norm. We also allow for distributions of ${x}^{(k)}$ which make it impossible to recover all columns of $R$; we simply recover a subset of essential columns.
With this in mind, our desired dictionary learning result is Theorem 4. We plan to apply this theorem with $N$ as thrice the maximum between the number of modules and the number of objects in any communication graph, $\underset{~}{S}$ as the number of samples necessary to learn a module, and $H$ as three times the depth of our modular network. Additionally, we think of ${\u03f5}_{1}$ as the tolerable input ${\mathrm{\ell}}_{\mathrm{\infty}}$ error while ${\u03f5}_{H}$ is the desired output ${\mathrm{\ell}}_{\mathrm{\infty}}$ error after recursing $H$ times.
Theorem 4.
[Recursable Dictionary Learning] There exists a family of distributions $\mathrm{\{}\mathrm{D}\mathit{}\mathrm{(}b\mathrm{,}q\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}\mathrm{\}}$ which produce ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\times}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ matrices satisfying the following. For any positive integer $N$, positive constant $H$, positive real ${\u03f5}_{H}$, base number of samples $S\mathrm{\le}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}N\mathrm{)}$, block size $b\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{,}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{)}$, nonzero block probability $q\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{,}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{)}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$, and dimension ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{,}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}S\mathrm{)}$, there exists a sequence of ${\mathrm{\ell}}_{\mathrm{\infty}}$ errors $$ with ${\u03f5}_{\mathrm{1}}\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}{\u03f5}_{H}\mathrm{)}$, such that the following is true. For any $h\mathrm{\in}\mathrm{[}H\mathrm{}\mathrm{1}\mathrm{]}$, let ${S}_{h}\mathrm{=}S\mathit{}{N}^{h\mathrm{}\mathrm{1}}$. For any unknown vectors ${x}^{\mathrm{(}\mathrm{1}\mathrm{)}}\mathrm{,}\mathrm{\dots}\mathrm{,}{x}^{\mathrm{(}{S}_{h}\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\times}N}$ with ${\mathrm{\ell}}_{\mathrm{2}}$norm at most $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$, if we draw ${R}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{R}_{N}\mathrm{\sim}\mathrm{D}\mathit{}\mathrm{(}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ and receive ${S}_{h}$ noisy samples ${y}^{\mathrm{(}k\mathrm{)}}\mathrm{\u2254}\mathrm{[}{R}_{\mathrm{1}}\mathit{}{R}_{\mathrm{2}}\mathit{}\mathrm{\cdots}\mathit{}{R}_{N}\mathrm{]}\mathit{}{x}^{\mathrm{(}k\mathrm{)}}\mathrm{+}{z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{+}{z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}$ where each ${z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ is noise with ${\mathrm{\left}\mathrm{\left}{z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{\right}\mathrm{\right}}_{\mathrm{1}}\mathrm{\le}O\mathit{}\mathrm{(}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$ (independent of our random matrices) and each ${z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ is noise with ${\mathrm{\left}\mathrm{\left}{z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}\mathrm{\right}\mathrm{\right}}_{\mathrm{\infty}}\mathrm{\le}{\u03f5}_{h}$ (also independent of our random matrices), then there is an algorithm which takes as input $h$, ${y}^{\mathrm{(}\mathrm{1}\mathrm{)}}$, …, ${y}^{\mathrm{(}{S}_{h}\mathrm{)}}$, runs in time $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}{S}_{h}\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$, and with high probability outputs ${\widehat{R}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\widehat{R}}_{N}\mathrm{,}{\widehat{x}}^{\mathrm{(}\mathrm{1}\mathrm{)}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\widehat{x}}^{\mathrm{(}{S}_{h}\mathrm{)}}$ satisfying the following for some permutation $\pi $ of $\mathrm{[}N\mathrm{]}$:

•
for every $i\in [N]$ and $j\in [{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}]$, if there exists ${k}^{\star}\in [{S}_{h}]$ such that $\left{x}_{(i1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{({k}^{\star})}\right\ge {\u03f5}_{h+1}$ then the ${j}^{th}$ column of ${\widehat{R}}_{\pi (i)}$ is $0.2{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$close in Hamming distance from the ${j}^{th}$ column of ${R}_{i}$.

•
for every $k\in [{S}_{h}],i\in [N],j\in [{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}]$, $\left{x}_{(\pi (i)1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{(k)}{x}_{(i1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{(k)}\right\le {\u03f5}_{h+1}$.
5.1 Recursable Dictionary Learning Implies Network Learnability
We want to use Theorem 4 to learn a teacher modular deep network, but we need to first address an important issue. So far, we have not specified how the deep modular network decides upon its inputdependent communication graph. As a result, the derivation of the communication graph cannot be learned (it’s possibly an uncomputable function!). When the communication graph is a fixed tree (always the same arrangement of objects but with different attributes), we can learn it. Note that any fixed communication graph can be expanded into a tree; doing so is equivalent to not reusing computation. Regardless of the communication graph, we can learn the input/output behavior of each module regardless of whether the communication graph is fixed.
Theorem 5.
If the teacher deep modular network has constant depth, then any module ${M}^{\mathrm{\star}}$ which satisfies the following two properties:

•
(Robust Module Learnability) The module is learnable from $\left(\alpha =\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}(N)\right)$ input/output training pairs which have been corrupted by ${\mathrm{\ell}}_{\mathrm{\infty}}$ error at most a constant $\u03f5>0$.

•
(Sufficient Weight) In a $\left(\beta =\frac{1}{\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}(N)}\right)$fraction of the inputs, the module produces an object and all of the input objects to that object have effective weight at least $w$.
can, with high probability, be learned from $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}N\mathrm{)}$ overall sketches of dimension $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}w\mathrm{,}\mathrm{1}\mathrm{/}\u03f5\mathrm{)}\mathit{}{\mathrm{log}}^{\mathrm{2}}\mathit{}N$.
Suppose we additionally know that the communication graph is a fixed tree. We can identify the subgraph of objects which each have effective weight $w$ in a $\mathrm{\left(}\beta \mathrm{=}\frac{\mathrm{1}}{\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}N\mathrm{)}}\mathrm{\right)}$fraction of the inputs.
5.2 Recursable Dictionary Learning: Proof Outline
The main idea of our recursable dictionary learning algorithm is the following. The algorithm proceeds by iteratively examining each of the ${d}_{\text{sketch}}/b$ blocks of each of the ${S}_{h}$ received samples. For the $\mathrm{\ell}$th block of the $i$th sample, denoted by ${y}^{(i)}[(\mathrm{\ell}1)b+1:\mathrm{\ell}b+1]$, it tries to determine whether it is dominated by the contribution of a single $({\sigma}_{s},{\sigma}_{c},{\sigma}_{m})$ (and is not composed of the superposition of more than one of these). To do so, it first normalizes this block by its ${\mathrm{\ell}}_{1}$norm and checks if the result is close to a point on the Boolean hypercube in ${\mathrm{\ell}}_{\mathrm{\infty}}$ distance. If so, it rounds (the ${\mathrm{\ell}}_{1}$normalized version of) this block to the hypercube; we here denote the resulting rounded vector by ${\overline{y}}^{(i)}[(\mathrm{\ell}1)b+1:\mathrm{\ell}b+1]$. We then count the number of blocks ${\mathrm{\ell}}^{\prime}\in [{d}_{\text{sketch}}/b]$ whose rounding ${\overline{y}}^{(i)}[({\mathrm{\ell}}^{\prime}1)b+1:{\mathrm{\ell}}^{\prime}b+1]$ is close in Hamming distance to ${\overline{y}}^{(i)}[(\mathrm{\ell}1)b+1:\mathrm{\ell}b+1]$. If there are at least $0.8q{d}_{\text{sketch}}$ of these, we add the rounded block ${\overline{y}}^{(i)}[(\mathrm{\ell}1)b+1:\mathrm{\ell}b+1]$ to our collection (if it’s not close to an already added vector). We also cluster the added blocks in terms of their matrix signatures ${\sigma}_{m}$ in order to associate each of them to one of the matrices ${R}_{1},\mathrm{\dots},{R}_{N}$. Using the matrix signatures as well as the column signatures $\{{\sigma}_{c}\}$ allows us to recover all the essential columns of the original matrices. The absolute values of the $x$ coordinates can also be inferred by adding the ${\mathrm{\ell}}_{1}$norm of a block when adding its rounding to the collection. The signs of the $x$ coordinates can also be inferred by first inferring the true sign of the block and comparing it to the sign of the vector whose rounding was added to the collection.
5.3 The BlockSparse Distribution $\mathcal{D}$
We are now ready to define our distribution $\mathcal{D}$ on random (square) matrices $R$. It has the following parameters:

•
The block size $b\in {\mathbb{Z}}^{+}$ which must be a multiple of $3$ and at least $3\mathrm{max}(\lceil {\mathrm{log}}_{2}N\rceil ,\lceil {\mathrm{log}}_{2}{d}_{\text{sketch}}\rceil +3)$.

•
The block sampling probability $q\in [0,1]$; this is the probability that a block is nonzero.

•
The number of rows/columns ${d}_{\text{sketch}}\in {\mathbb{Z}}^{+}$. It must be a multiple of $b$, as each column is made out of blocks.
Each column of our matrix is split into ${d}_{\text{sketch}}/b$ blocks of size $b$. Each block contains three subblocks: a random string, the matrix signature, and the column signature. To specify the column signature, we define an encoding procedure $\mathrm{Enc}$ which maps two bits ${b}_{m},{b}_{s}$ and the column index into a $(b/3)$length sequence. $\mathrm{Enc}:{\{\pm 1\}}^{2}\times [{d}_{\text{sketch}}]\to {\{\pm \frac{1}{\sqrt{{d}_{\text{sketch}}q}}\}}^{b/3}$, which is presented as Algorithm 4. These two bits encode the parity of the other two subblocks, which will aid our dictionary learning procedure in recovery of the correct signs. The sampling algorithm which describes our distribution is presented as Algorithm ‣ 5.3.
[h]
\[email protected]@algorithmic[1]
\STATEInput: Column index $j\in [{d}_{\text{sketch}}]$, two bits ${b}_{m},{b}_{s}\in \{\pm 1\}$, .
\STATEOutput: A vector ${\sigma}_{c}\in {\{\pm \frac{1}{\sqrt{{d}_{\text{sketch}}q}}\}}^{b/3}$.
\[email protected]
\STATESet ${\sigma}_{c}$ to be the (zeroone) binary representation of $j1$ (a $\lceil {\mathrm{log}}_{2}{d}_{\text{sketch}}\rceil $dimensional vector).
\STATEReplace each zero in ${\sigma}_{c}$ with $1$ and each one in $\sigma $ with $+1$.
\STATEPrepend ${\sigma}_{c}$ with a $+1$, making it a $\left(\lceil {\mathrm{log}}_{2}{d}_{\text{sketch}}\rceil +1\right)$dimensional vector.
\STATEAppend ${\sigma}_{c}$ with $({b}_{m},{b}_{s})$, making it a $\left(\lceil {\mathrm{log}}_{2}{d}_{\text{sketch}}\rceil +3\right)$dimensional vector.
\STATEAppend ${\sigma}_{c}$ with enough $+1$ entries to make it a $(b/3)$dimensional vector. Note that $b\ge 3\left(\lceil {\mathrm{log}}_{2}{d}_{\text{sketch}}\rceil +3\right)$.
\STATEDivide each entry of ${\sigma}_{c}$ by $\sqrt{{d}_{\text{sketch}}q}$.
\RETURN${\sigma}_{c}$.
{algorithm}[H]
\[email protected]@algorithmic[1]
\STATEInput: Block size $b\in {\mathbb{Z}}^{+}$, block sampling probability $q\in [0,1]$, number of rows/columns ${d}_{\text{sketch}}$.
\STATEOutput: A matrix $R\in {\mathbb{R}}^{{d}_{\text{sketch}}\times {d}_{\text{sketch}}}$.
\[email protected]
\STATEInitialize $R$ to be the allzeros ${\mathbb{R}}^{{d}_{\text{sketch}}\times {d}_{\text{sketch}}}$ matrix.
\STATESample a “matrix signature” vector ${\sigma}_{m}$ from the uniform distribution over ${\{\pm \frac{1}{\sqrt{{d}_{\text{sketch}}q}}\}}^{b/3}$.
\FORcolumn $j\in [{d}_{\text{sketch}}]$
\STATESample a “random string” vector ${\sigma}_{s,j}$ from the uniform distribution over ${\{\pm \frac{1}{\sqrt{{d}_{\text{sketch}}q}}\}}^{b/3}$.
\FORblock $i\in [{d}_{\text{sketch}}/b]$
\STATESample three coin flips ${f}_{m,i,j},{f}_{s,i,j},{f}_{c,i,j}$ as independent Rademacher random variables (i.e. uniform over $\{\pm 1\}$).
\STATECompute two bits ${f}_{m,i,j}^{\prime}\u2254{f}_{m,i,j}\cdot \text{sign}({\sigma}_{m}[1])$ and ${f}_{s,i,j}^{\prime}\u2254{f}_{s,i,j}s\cdot \text{sign}({\sigma}_{s,j}[1])$.
\STATECompute a “column signature” vector ${\sigma}_{c,i,j}\u2254\mathrm{Enc}(j,{f}_{m,i,j}^{\prime},{f}_{s,i,j}^{\prime})$.
\STATESample ${\eta}_{i,j}$ to be a (zeroone) Bernoulli random variable which is one with probability $q$.
\IF${\eta}_{i,j}=1$
\STATESet $R[b(i1)+1:bi+1,j]$ to be ${f}_{s,i,j}\cdot {\sigma}_{s,j}$ concatenated with ${f}_{c,i,j}\cdot {\sigma}_{c,i,j}$ concatenated with ${f}_{m,i,j}\cdot {\sigma}_{m}$.
\ENDIF\ENDFOR\ENDFOR\RETURN$R$.
6 Conclusion
In this paper, we presented a sketching mechanism which captures several natural properties. Two potential areas for future work are our proposed use cases for this mechanism: adding a sketch repository to an existing modular network and using it to find new concepts, and using our sketches to learn a teacher network.References
 Andreas et al. (2016) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48, 2016.
 Arora et al. (2014a) Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable bounds for learning some deep representations. In International Conference on Machine Learning, pp. 584–592, 2014a.
 Arora et al. (2014b) Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable bounds for learning some deep representations. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 584–592, Bejing, China, 22–24 Jun 2014b. PMLR. URL http://proceedings.mlr.press/v32/arora14.html.
 Arora et al. (2014c) Arora, S., Ge, R., and Moitra, A. New algorithms for learning incoherent and overcomplete dictionaries. In Conference on Learning Theory, pp. 779–806, 2014c.
 Arora et al. (2015) Arora, S., Ge, R., Ma, T., and Moitra, A. Simple, efficient, and neural algorithms for sparse coding. In Proceedings of The 28th Conference on Learning Theory, pp. 113–149, 2015.
 Azam (2000) Azam, F. Biologically inspired modular neural networks. PhD thesis, Virginia Tech, 2000.
 Brannon & Roitman (2003) Brannon, E. M. and Roitman, J. D. Nonverbal representations of time and number in animals and human infants. 2003.
 Buciluǎ et al. (2006) Buciluǎ, C., Caruana, R., and NiculescuMizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2006.
 Clune et al. (2013) Clune, J., Mouret, J.B., and Lipson, H. The evolutionary origins of modularity. Proc. R. Soc. B, 280(1755):20122863, 2013.
 Danihelka et al. (2016) Danihelka, I., Wayne, G., Uria, B., Kalchbrenner, N., and Graves, A. Associative long shortterm memory. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp. 1986–1994. JMLR.org, 2016.
 Du et al. (2017a) Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a.
 Du et al. (2017b) Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b.
 Ellis et al. (2018) Ellis, K., Ritchie, D., SolarLezama, A., and Tenenbaum, J. Learning to infer graphics programs from handdrawn images. In Advances in Neural Information Processing Systems, pp. 6059–6068, 2018.
 Fernando et al. (2017) Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 Gayler (2004) Gayler, R. W. Vector symbolic architectures answer jackendoff’s challenges for cognitive neuroscience. arXiv preprint cs/0412059, 2004.
 Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Graves et al. (2016) Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., GrabskaBarwińska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016.
 Grefenstette et al. (2015) Grefenstette, E., Hermann, K. M., Suleyman, M., and Blunsom, P. Learning to transduce with unbounded memory. In Advances in neural information processing systems, pp. 1828–1836, 2015.
 Hanson & Wright (1971) Hanson, D. L. and Wright, F. T. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
 Hawkins & Blakeslee (2007) Hawkins, J. and Blakeslee, S. On intelligence: How a new understanding of the brain will lead to the creation of truly intelligent machines. Macmillan, 2007.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Hinton et al. (2000) Hinton, G. E., Ghahramani, Z., and Teh, Y. W. Learning to parse images. In Advances in neural information processing systems, pp. 463–469, 2000.
 Hinton et al. (2011) Hinton, G. E., Krizhevsky, A., and Wang, S. D. Transforming autoencoders. In International Conference on Artificial Neural Networks, pp. 44–51. Springer, 2011.
 Hoeffding (1963) Hoeffding, W. Probability inequalities for sums of bounded random variables. J. American Statist. Assoc., 58:13 – 30, 03 1963. doi: 10.2307/2282952.
 Hu et al. (2017) Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. Learning to reason: Endtoend module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813, 2017.
 Joulin & Mikolov (2015) Joulin, A. and Mikolov, T. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pp. 190–198, 2015.
 Kane & Nelson (2014) Kane, D. M. and Nelson, J. Sparser johnsonlindenstrauss transforms. J. ACM, 61(1):4:1–4:23, January 2014. ISSN 00045411. doi: 10.1145/2559902. URL http://doi.acm.org/10.1145/2559902.
 Levy & Gayler (2008) Levy, S. D. and Gayler, R. Vector symbolic architectures: A new building material for artificial general intelligence. In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference, pp. 414–418. IOS Press, 2008.
 Li & Yuan (2017) Li, Y. and Yuan, Y. Convergence analysis of twolayer neural networks with ReLU activation. arXiv preprint arXiv:1705.09886, 2017.
 Oh et al. (2017) Oh, J., Singh, S., Lee, H., and Kohli, P. Zeroshot task generalization with multitask deep reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2661–2670. JMLR. org, 2017.
 Papyan et al. (2018) Papyan, V., Romano, Y., Sulam, J., and Elad, M. Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and its connection to convolutional neural networks. IEEE Signal Processing Magazine, 35(4):72–89, 2018.
 Real et al. (2018) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Sabour et al. (2017) Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.
 Smolensky (1990) Smolensky, P. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(12):159–216, 1990.
 Spielman et al. (2012) Spielman, D. A., Wang, H., and Wright, J. Exact recovery of sparselyused dictionaries. In Conference on Learning Theory, pp. 37–1, 2012.
 Sukhbaatar et al. (2015) Sukhbaatar, S., Weston, J., Fergus, R., et al. Endtoend memory networks. In Advances in neural information processing systems, pp. 2440–2448, 2015.
 Wu et al. (2017) Wu, J., Tenenbaum, J. B., and Kohli, P. Neural scene derendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Yi et al. (2018) Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenenbaum, J. B. NeuralSymbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems (NIPS), 2018.
 Zaremba & Sutskever (2015) Zaremba, W. and Sutskever, I. Reinforcement learning neural turing machinesrevised. arXiv preprint arXiv:1505.00521, 2015.
 Zhang et al. (2017) Zhang, Q., Panigrahy, R., Sachdeva, S., and Rahimi, A. Electronproton dynamics in deep learning. arXiv preprint arXiv:1702.00458, 2017.
 Zhong et al. (2017a) Zhong, K., Song, Z., and Dhillon, I. S. Learning nonoverlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.
 Zhong et al. (2017b) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. Recovery guarantees for onehiddenlayer neural networks. arXiv preprint arXiv:1706.03175, 2017b.
Appendix A Technical Machinery for Random Matrices
In this appendix, we present several technical lemmas which will be useful for analyzing the random matrices that we use in our sketches.A.1 Notation
Definition 1.
Suppose we have two fixed vectors $x\mathrm{,}y\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ and random matrices ${R}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{R}_{k}$ drawn independently from distributions ${\mathrm{D}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\mathrm{D}}_{k}$. We have a random variable $Z$ which is a function of $x$, $y$, and ${R}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{R}_{k}$. $Z$ is a $\delta $nice noise variable with respect to $D$ if the following two properties hold: 1. (Centered Property) ${\mathbb{E}}_{{R}_{1}\sim {\mathcal{D}}_{1},\mathrm{\dots}{R}_{k}\sim {\mathcal{D}}_{k}}[Z]=0$. 2. ($\delta $bounded Property) With high probability (over ${R}_{1}\sim {\mathcal{D}}_{1},\mathrm{\dots}{R}_{k}\sim {\mathcal{D}}_{k}$), $Z\le \delta \cdot {\left\leftx\right\right}_{2}\cdot {\left\lefty\right\right}_{2}$.Definition 2.
We say that a distribution $\mathrm{D}$ has the ${\alpha}_{D}$Leaky ${\delta}_{D}$Desynchronization Property if the following is true. If $R\mathrm{\sim}\mathrm{D}$, then the random variable $Z\mathrm{=}\mathrm{\u27e8}R\mathit{}x\mathrm{,}y\mathrm{\u27e9}\mathrm{}{\alpha}_{D}\mathit{}\mathrm{\u27e8}x\mathrm{,}y\mathrm{\u27e9}$ is a ${\delta}_{D}$nice noise variable. If ${\alpha}_{D}\mathrm{=}\mathrm{0}$ then we simply refer to this as the ${\delta}_{D}$Desynchronization Property.Definition 3.
We say that a distribution $\mathrm{D}$ has the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property if the following is true. If $R\mathrm{\sim}\mathrm{D}$ and $x\mathrm{=}y$, then the random variable $Z\mathrm{=}\mathrm{\u27e8}R\mathit{}x\mathrm{,}R\mathit{}x\mathrm{\u27e9}\mathrm{}{\alpha}_{I}\mathit{}\mathrm{\u27e8}x\mathrm{,}x\mathrm{\u27e9}$ is a ${\delta}_{I}$nice noise variable. If ${\alpha}_{I}\mathrm{=}\mathrm{1}$ then we simply refer to this as the ${\delta}_{I}$Isometry Property.Lemma 1.
If $\mathrm{D}$ satisfies the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property, then with high probability: $${\left\leftRx\right\right}_{2}^{2}\le ({\alpha}_{I}+{\delta}_{I}){\left\leftx\right\right}_{2}^{2}$$Proof.
Consider the random variable: $$Z=\u27e8Rx,Rx\u27e9{\alpha}_{I}\u27e8x,x\u27e9$$ Since $\mathcal{D}$ satisfies the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property, $Z$ is a ${\delta}_{I}$nice random variable. Hence with high probability: $\u27e8Rx,Rx\u27e9{\alpha}_{I}\u27e8x,x\u27e9$ $\le {\delta}_{I}{\left\leftx\right\right}_{2}^{2}$ $\u27e8Rx,Rx\u27e9$ $\le ({\alpha}_{I}+{\delta}_{I}){\left\leftx\right\right}_{2}^{2}$ This completes the proof. ∎Lemma 2.
If $\mathrm{D}$ satisfies the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property, then the random variable $Z\mathrm{=}\mathrm{\u27e8}R\mathit{}x\mathrm{,}R\mathit{}y\mathrm{\u27e9}\mathrm{}{\alpha}_{I}\mathit{}\mathrm{\u27e8}x\mathrm{,}y\mathrm{\u27e9}$ is a $\mathrm{(}\mathrm{3}\mathit{}{\delta}_{I}\mathrm{)}$nice noise variable.Proof.
This proof revolves around the scalar fact that ${(a+b)}^{2}={a}^{2}+2ab+{b}^{2}$, so from squaring we can get to products. This claim is trivially true if $x$ or $y$ is the zero vector. Hence without loss of generality, we can scale $x$ and $y$ to unit vectors, since being a nice noise variable is scale invariant. We invoke the ${\alpha}_{I}$scaled ${\delta}_{I}$bounded Isometry Property three times, on ${x}_{1}=x+y$, ${x}_{2}=x$, and ${x}_{3}=y$. This implies that the following three noise variables are ${\delta}_{I}$nice: ${Z}_{1}$ $=\u27e8R(x+y),R(x+y)\u27e9{\alpha}_{I}\u27e8x+y,x+y\u27e9$ ${Z}_{2}$ $=\u27e8Rx,Rx\u27e9{\alpha}_{I}\u27e8x,x\u27e9$ ${Z}_{3}$ $=\u27e8Ry,Ry\u27e9{\alpha}_{I}\u27e8y,y\u27e9$ We now relate our four noise variables of interest. ${Z}_{1}$ $=\left[\u27e8Rx,Rx\u27e9+2\u27e8Rx,Ry\u27e9+\u27e8Ry,Ry\u27e9\right]$ ${\alpha}_{I}\left[\u27e8x,x\u27e9+2\u27e8x,y\u27e9+\u27e8y,y\u27e9\right]$ $={Z}_{2}+2Z+{Z}_{3}$ $Z$ $={\displaystyle \frac{1}{2}}{Z}_{1}{\displaystyle \frac{1}{2}}{Z}_{2}{\displaystyle \frac{1}{2}}{Z}_{3}$ Hence, $Z$ is centered. We can bound it by analyzing the ${\mathrm{\ell}}_{2}$ lengths of the vectors that we applied isometry to. With high probability: ${Z}_{1}$ $\le {\delta}_{I}{\left\leftx+y\right\right}_{2}^{2}$ $\le {\delta}_{I}{\left({\left\leftx\right\right}_{2}+{\left\lefty\right\right}_{2}\right)}^{2}$ $\le 4{\delta}_{I}$ ${Z}_{2}$ $\le {\delta}_{I}{\left\leftx\right\right}_{2}^{2}$ $\le {\delta}_{I}$ ${Z}_{3}$ $\le {\delta}_{I}{\left\lefty\right\right}_{2}^{2}$ $\le {\delta}_{I}$ Hence $Z$ is also $(3{\delta}_{I})$bounded, making it a $(3{\delta}_{I})$noise variable, completing the proof. ∎A.2 Transparent Random Matrices
For our sketch, we sometimes want to both rotate a vector and pass it through asis. For these cases, the distribution $\mathrm{\text{Transparent}}\left(\mathcal{D}\right)$ is useful and defined as follows. We draw $R$ from $D$ and return $\overline{R}=\left(\frac{I+R}{2}\right)$. This subsection explains how the desynchronization and isometry of $\mathcal{D}$ carry over to $\mathrm{\text{Transparent}}\left(\mathcal{D}\right)$. This technical lemma shows how to get desynchronization.Lemma 3.
If $\mathrm{D}$ satisfies the ${\delta}_{D}$Desynchronization Property, then $\mathrm{\text{Transparent}}\mathit{}\mathrm{\left(}\mathrm{D}\mathrm{\right)}$ satisfies the $\mathrm{(}\mathrm{1}\mathrm{/}\mathrm{2}\mathrm{)}$leaky $\mathrm{(}{\delta}_{D}\mathrm{/}\mathrm{2}\mathrm{)}$Desynchronization Property.Proof.
We want to show that if $\overline{R}$ is drawn from $\mathrm{\text{Transparent}}\left(\mathcal{D}\right)$, then the following random variable $Z$ is a $({\delta}_{D}/2)$nice noise variable. $Z$ $=\u27e8\overline{R}x,y\u27e9{\displaystyle \frac{1}{2}}\u27e8x,y\u27e9$ $={x}^{T}\left({\displaystyle \frac{I+{R}^{T}}{2}}\right)y{\displaystyle \frac{1}{2}}\u27e8x,y\u27e9$ $={\displaystyle \frac{1}{2}}{x}^{T}y+{\displaystyle \frac{1}{2}}{x}^{T}{R}^{T}y{\displaystyle \frac{1}{2}}\u27e8x,y\u27e9$ $={\displaystyle \frac{1}{2}}\underset{\text{Let this be}{Z}^{\prime}\text{.}}{\underset{\u23df}{\left[{x}^{T}{R}^{T}y\right]}}$ Since $\mathcal{D}$ satisfies the ${\delta}_{D}$Desynchronization Property, we know ${Z}^{\prime}$ is a ${\delta}_{D}$nice noise variable. Our variable $Z$ is just a scaled version of ${Z}^{\prime}$, making it a $({\delta}_{D}/2)$nice noise variable. This completes the proof. ∎Lemma 4.
If $\mathrm{D}$ satisfies the ${\delta}_{D}$Desynchronization Property and the ${\delta}_{I}$Isometry Property, then $\mathrm{\text{Transparent}}\mathit{}\mathrm{\left(}\mathrm{D}\mathrm{\right)}$ satisfies the $\frac{\mathrm{1}}{\mathrm{2}}$scaled $\mathrm{(}\frac{\mathrm{1}}{\mathrm{2}}\mathit{}{\delta}_{D}\mathrm{+}\frac{\mathrm{1}}{\mathrm{4}}\mathit{}{\delta}_{I}\mathrm{)}$Isometry Property.Proof.
We want to show that if $\overline{R}$ is drawn from $\mathrm{\text{Transparent}}\left(\mathcal{D}\right)$, and $x=y$, then the following random variable $Z$ is a $\left({\delta}_{D}+\frac{1}{2}{\delta}_{I}\right)$nice noise variable: $Z$ $=\u27e8\overline{R}x,\overline{R}x\u27e9{\displaystyle \frac{1}{2}}\u27e8x,x\u27e9$ $={x}^{T}\left({\displaystyle \frac{I+{R}^{T}}{2}}\right)\left({\displaystyle \frac{I+R}{2}}\right)x{\displaystyle \frac{1}{2}}{x}^{T}x$ $={\displaystyle \frac{1}{4}}\left({x}^{T}x+{x}^{T}Rx+{x}^{T}{R}^{T}x+{x}^{T}{R}^{T}Rx\right){\displaystyle \frac{1}{2}}{x}^{T}x$ $={\displaystyle \frac{1}{2}}\underset{\text{Let this be}{Z}_{1}\text{.}}{\underset{\u23df}{\u27e8Rx,x\u27e9}}+{\displaystyle \frac{1}{4}}\underset{\text{Let this be}{Z}_{2}\text{.}}{\underset{\u23df}{\left(\u27e8Rx,Rx\u27e9\u27e8x,x\u27e9\right)}}$ Since $\mathcal{D}$ satisfies the ${\delta}_{D}$Desynchronization Property, we know ${Z}_{1}$ is a ${\delta}_{D}$nice noise variable. Since $\mathcal{D}$ satisfies the ${\delta}_{I}$Isometry Property, we know ${Z}_{2}$ is a ${\delta}_{I}$nice noise variable. This makes our variable $Z$ a $(\frac{1}{2}{\delta}_{D}+\frac{1}{4}{\delta}_{I})$nice noise variable, as desired. ∎A.3 Products of Random Matrices
For our sketch, we wind up multiplying matrices. In this subsection, we try to understand the properties of the resulting product. The distribution $\mathrm{\text{Product}}({\mathcal{D}}_{1},{\mathcal{D}}_{2})$ is as follows. We independently draw ${R}_{1}\sim {\mathcal{D}}_{1},{R}_{2}\sim {\mathcal{D}}_{2}$ and then return: $$\stackrel{~}{R}={R}_{2}{R}_{1}$$ . This technical lemma says that the resulting product is desynchronizing.Lemma 5.
If ${\mathrm{D}}_{\mathrm{1}}$ satisfies the ${\delta}_{D\mathrm{,}\mathrm{1}}$Desynchronization Property, ${\mathrm{D}}_{\mathrm{1}}$ satisfies the ${\delta}_{I\mathrm{,}\mathrm{1}}$Isometry Property, and ${\mathrm{D}}_{\mathrm{2}}$ satisfies the ${\delta}_{D\mathrm{,}\mathrm{2}}$leaky ${\alpha}_{D\mathrm{,}\mathrm{2}}$Desynchronization Property, then $\mathrm{\text{Product}}\mathit{}\mathrm{(}{\mathrm{D}}_{\mathrm{1}}\mathrm{,}{\mathrm{D}}_{\mathrm{2}}\mathrm{)}$ satisfies the $\mathrm{\left[}{\delta}_{D\mathrm{,}\mathrm{1}}\mathrm{+}{\delta}_{D\mathrm{,}\mathrm{2}}\mathit{}\sqrt{\mathrm{1}\mathrm{+}{\delta}_{I\mathrm{,}\mathrm{1}}}\mathrm{\right]}$Desynchronization Property.Proof.
We want to show that if $\stackrel{~}{R}$ is drawn from $\mathrm{\text{Product}}({\mathcal{D}}_{1},{\mathcal{D}}_{2})$, then the following random variable $Z$ is a $\left[{\delta}_{D,1}+{\delta}_{D,2}\sqrt{1+{\delta}_{I,1}}\right]$nice noise variable. $Z$ $=\u27e8\stackrel{~}{R}x,y\u27e9\u27e8x,y\u27e9$ $=\u27e8{R}_{2}{R}_{1}x,y\u27e9\u27e8x,y\u27e9$ We now define some useful (noise) variables. ${Z}_{1}$ $\u2254\u27e8{R}_{1}x,y\u27e9\u27e8x,y\u27e9$ ${Z}_{2}$ $\u2254\u27e8{R}_{2}{R}_{1}x,y\u27e9\u27e8{R}_{1}x,y\u27e9$ $Z$ $={Z}_{1}+{Z}_{2}$ From our assumptions about ${\mathcal{D}}_{1}$ and ${\mathcal{D}}_{2}$, we know that ${Z}_{1}$ is a ${\delta}_{D,1}$nice noise variable and ${Z}_{2}$ is a ${\delta}_{D,2}$nice noise variable. By Lemma 1 we know that with high probability, ${\left\left{R}_{1}x\right\right}_{2}^{2}\le 1+{\delta}_{I,1}$. Hence, with high probability: $\left{Z}_{1}\right$ $\le {\delta}_{D,1}{\left\leftx\right\right}_{2}{\left\lefty\right\right}_{2}$ $\left{Z}_{2}\right$ $\le {\delta}_{D,2}\sqrt{1+\delta I,1}{\left\leftx\right\right}_{2}{\left\lefty\right\right}_{2}$ $\leftZ\right$ $\le \left{Z}_{1}\right+\left{Z}_{2}\right$ $\le \left[{\delta}_{D,1}+{\delta}_{D,2}\sqrt{1+{\delta}_{I,1}}\right]{\left\leftx\right\right}_{2}{\left\lefty\right\right}_{2}$ Hence $Z$ is a $\left[{\delta}_{D,1}+{\delta}_{D,2}\sqrt{1+{\delta}_{I,1}}\right]$nice noise variable, completing the proof. ∎Lemma 6.
If ${\mathrm{D}}_{\mathrm{1}}$ satisfies the ${\alpha}_{I\mathrm{,}\mathrm{1}}$scaled ${\delta}_{I\mathrm{,}\mathrm{1}}$Isometry Property and ${\mathrm{D}}_{\mathrm{2}}$ satisfies the ${\alpha}_{I\mathrm{,}\mathrm{2}}$scaled ${\delta}_{I\mathrm{,}\mathrm{2}}$Isometry Property, then $\mathrm{\text{Product}}\mathit{}\mathrm{(}{\mathrm{D}}_{\mathrm{1}}\mathrm{,}{\mathrm{D}}_{\mathrm{2}}\mathrm{)}$ satisfies the $\mathrm{\left(}{\alpha}_{I\mathrm{,}\mathrm{1}}\mathit{}{\alpha}_{I\mathrm{,}\mathrm{2}}\mathrm{\right)}$scaled $\mathrm{\left[}{\delta}_{I\mathrm{,}\mathrm{2}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{+}{\delta}_{I\mathrm{,}\mathrm{1}}\mathrm{)}\mathrm{+}{\alpha}_{I\mathrm{,}\mathrm{2}}\mathit{}{\delta}_{I\mathrm{,}\mathrm{1}}\mathrm{\right]}$bounded Isometry Property.Proof.
We want to show that if $\stackrel{~}{R}$ is drawn from $\mathrm{\text{Product}}({\mathcal{D}}_{1},{\mathcal{D}}_{2})$, and $x=y$, then the following random variable $Z$ is a $\left[{\delta}_{I,2}(1+{\delta}_{I,1})+{\alpha}_{I,2}{\delta}_{I,1}\right]$nice noise variable. $Z$ $\u2254\u27e8\stackrel{~}{R}x,\stackrel{~}{R}x\u27e9{\alpha}_{I,1}{\alpha}_{I,2}\u27e8x,x\u27e9$ $={x}^{T}{\stackrel{~}{R}}^{T}\stackrel{~}{R}x{\alpha}_{I,1}{\alpha}_{I,2}{x}^{T}x$ $={x}^{T}{R}_{1}^{T}{R}_{2}^{T}{R}_{2}{R}_{1}x{\alpha}_{I,1}{\alpha}_{I,2}{x}^{T}x$ We now define some useful (noise) variables. ${x}_{0}$ $\u2254x$ ${x}_{1}$ $\u2254{R}_{1}{x}_{0}$ ${x}_{2}$ $\u2254{R}_{2}{x}_{1}$ ${Z}_{1}$ $\u2254\u27e8{R}_{1}{x}_{0},{R}_{1}{x}_{0}\u27e9{\alpha}_{I,1}\u27e8{x}_{0},{x}_{0}\u27e9$ $={\left\left{x}_{1}\right\right}_{2}^{2}{\alpha}_{I,1}{\left\left{x}_{0}\right\right}_{2}^{2}$ ${Z}_{2}$ $\u2254\u27e8{R}_{2}{x}_{1},{R}_{2}{X}_{1}\u27e9{\alpha}_{I,2}\u27e8{x}_{1},{x}_{1}\u27e9$ $={\left\left{x}_{2}\right\right}_{2}^{2}{\alpha}_{I,2}{\left\left{x}_{1}\right\right}_{2}^{2}$ $Z$ is just a weighted sum of our random variables: ${Z}_{2}+{\alpha}_{I,2}{Z}_{1}$ $={\left\left{x}_{2}\right\right}_{2}^{2}{\alpha}_{I,2}{\left\left{x}_{1}\right\right}_{2}^{2}$ $+{\alpha}_{I,2}{\left\left{x}_{1}\right\right}_{2}^{2}{\alpha}_{I,1}{\alpha}_{I,2}{\left\left{x}_{0}\right\right}_{2}^{2}$ $={\left\left{x}_{2}\right\right}_{2}^{2}{\alpha}_{I,1}{\alpha}_{I,2}{\left\left{x}_{0}\right\right}_{2}^{2}$ $=Z$ Since ${\mathcal{D}}_{1}$ satisfies the ${\alpha}_{I,1}$scaled ${\delta}_{I,1}$Isometry Property we know that ${Z}_{1}$ is a ${\delta}_{I,1}$nice noise variable. Similarly, we know ${Z}_{2}$ is a ${\delta}_{I,2}$nice noise variable. As a result, $Z$ is centered; it remains to bound it. By Lemma 1, we know that with high probability ${\left\left{x}_{1}\right\right}_{2}^{2}\le 1+{\delta}_{I,1}$. Hence with high probability: $\left{Z}_{1}\right$ $\le {\delta}_{I,1}{\left\left{x}_{0}\right\right}_{2}^{2}$ $\left{Z}_{2}\right$ $\le {\delta}_{I,2}(1+{\delta}_{I,1}){\left\left{x}_{0}\right\right}_{2}^{2}$ $\leftZ\right$ $\le \left{Z}_{2}\right+{\alpha}_{I,2}\left{Z}_{1}\right$ $\le {\delta}_{I,2}(1+{\delta}_{I,1})+{\alpha}_{I,2}{\delta}_{I,1}$ Hence $Z$ is a $\left[{\delta}_{I,2}(1+{\delta}_{I,1})+{\alpha}_{I,2}{\delta}_{I,1}\right]$nice noise variable, completing the proof. ∎A.4 Applying Different Matrices
For the analysis of our sketch, we sometimes compare objects that have been sketched using different matrices. The following technical lemma is a two matrix variant of desynchronization for this situation.Lemma 7.
Suppose that we have a distribution $\mathrm{D}$ which satisfies the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property and the ${\alpha}_{D}$leaky ${\delta}_{D}$Desynchronization Property. Suppose we independently draw ${R}_{\mathrm{1}}\mathrm{\sim}\mathrm{D}$ and ${R}_{\mathrm{2}}\mathrm{\sim}\mathrm{D}$. Then the random variable $$Z=\u27e8{R}_{1}x,{R}_{2}y\u27e9{\alpha}_{D}^{2}\u27e8x,y\u27e9$$ is a $\mathrm{(}{\delta}_{D}\mathit{}\sqrt{{\alpha}_{I}\mathrm{+}{\delta}_{I}}\mathrm{+}{\alpha}_{D}\mathit{}{\delta}_{D}\mathrm{)}$nice noise variable.Proof.
We define useful (noise) variables ${Z}_{1}$ and ${Z}_{2}$. ${x}^{\prime}$ $\u2254{R}_{1}x$ ${Z}_{1}$ $\u2254\u27e8{x}^{\prime},{R}_{2}y\u27e9{\alpha}_{D}\u27e8{x}^{\prime},y\u27e9$ ${Z}_{2}$ $\u2254\u27e8{R}_{1}x,y\u27e9{\alpha}_{D}\u27e8x,y\u27e9$ $Z$ $={Z}_{1}+{\alpha}_{D}{Z}_{2}$ Since $\mathcal{D}$ satisfies the ${\alpha}_{D}$leaky ${\delta}_{D}$Desynchronization Property, we know that ${Z}_{1}$ and ${Z}_{2}$ are ${\delta}_{D}$nice noise variables. Since $D$ satisfies the ${\alpha}_{I}$scaled ${\delta}_{I}$Isometry Property, we can apply Lemma 1 to show that with high probability, ${\left\left{R}_{1}x\right\right}_{2}^{2}\le ({\alpha}_{I}+{\delta}_{I})$. Hence with high probability: $\left{Z}_{1}\right$ $\le {\delta}_{D}\sqrt{{\alpha}_{I}+{\delta}_{I}}$ $\left{Z}_{2}\right$ $\le {\delta}_{D}$ $\leftZ\right$ $\le \left{Z}_{1}\right+{\alpha}_{D}\left{Z}_{2}\right$ $\le {\delta}_{D}\sqrt{{\alpha}_{I}+{\delta}_{I}}+{\alpha}_{D}{\delta}_{D}$ Hence $Z$ is a $({\delta}_{D}\sqrt{{\alpha}_{I}+{\delta}_{I}}+{\alpha}_{D}{\delta}_{D})$nice noise variable, as desired. This completes the proof. ∎Appendix B Properties of the BlockRandom Distribution
In this section, we prove that our blockrandom distribution over random matrices satisfies the isometry and desynchronization properties required by our other results. In particular, our goal is to prove the following two theorems:Theorem 6.
Suppose that $q$ is $\mathrm{\Omega}\mathit{}\mathrm{\left(}\frac{\sqrt{b\mathit{}\mathrm{log}\mathit{}N}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$ and ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ is $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}N\mathrm{)}$. Then there exists a $\delta \mathrm{=}O\mathit{}\mathrm{\left(}\frac{\sqrt{b\mathit{}\mathrm{log}\mathit{}N}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$ such that the blockrandom distribution $\mathrm{D}\mathit{}\mathrm{(}b\mathrm{,}q\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ satisfies the $\delta $Isometry Property.Theorem 7.
Suppose that $q$ is $\mathrm{\Omega}\mathit{}\mathrm{\left(}\frac{\sqrt{b\mathit{}\mathrm{log}\mathit{}N}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$. Then there exists a $\delta \mathrm{=}O\mathit{}\mathrm{\left(}\frac{\sqrt{b\mathit{}\mathrm{log}\mathit{}N}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$ such that the blockrandom distribution $\mathrm{D}\mathit{}\mathrm{(}b\mathrm{,}q\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ satisfies the $O\mathit{}\mathrm{(}\frac{\mathrm{log}\mathit{}N}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{)}$Desynchronization Property.Corollary 1.
Suppose that $b\mathrm{=}\mathrm{3}\mathit{}\mathrm{max}\mathit{}\mathrm{(}\mathrm{\lceil}{\mathrm{log}}_{\mathrm{2}}\mathit{}N\mathrm{\rceil}\mathrm{,}\mathrm{\lceil}{\mathrm{log}}_{\mathrm{2}}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\rceil}\mathrm{+}\mathrm{3}\mathrm{)}$, $q\mathrm{=}\mathrm{\lceil}\frac{\sqrt{\mathrm{(}\mathrm{log}\mathit{}N\mathrm{+}\mathrm{log}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}\mathit{}\mathrm{log}\mathit{}N}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\rceil}$, and ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ is $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}N\mathrm{)}$. Then the blockrandom distribution $\mathrm{D}\mathit{}\mathrm{(}b\mathrm{,}q\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ satisfies the $\stackrel{\mathrm{~}}{O}\mathit{}\mathrm{\left(}\frac{\mathrm{1}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$Isometry Property and the $\stackrel{\mathrm{~}}{O}\mathit{}\mathrm{\left(}\frac{\mathrm{1}}{\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$Desynchronization Property.B.1 Isometry
In this subsection, we will prove Theorem 6. Our proof essentially follows the first analysis presented by Kane and Nelson for their sparse JohnsonLindenstrauss Transform (Kane & Nelson, 2014). Our desired result differs from theirs in several ways, which complicates our version of the proof: • Their random matrices do not have blocklevel patterns like ours do. We handle this by using the small size of our blocks. • Their random matrices have a fixed number of nonzeros per column, but the nonzeros within one of our columns are determined by independent Bernoulli random variables: ${\eta}_{i,j}$. We handle this by applying a standard concentration bound to the number of nonzeros. • Some of our subblock patterns depend on the $\pm 1$ coin flips for other subblocks, ruining independence. We handle this by breaking our matrix into three submatrices to be analyzed separately; see Figure 5. For the sake of completeness, we reproduce the entire proof here. The highlevel proof plan is to use HansonWright inequality (Hanson & Wright, 1971) to bound the ${\mathrm{\ell}}^{th}$ moment of the noise, and then use a Markov’s inequality to convert the result into a high probability bound on the absolute value of the noise. Suppose we draw a random matrix $R\in {\mathbb{R}}^{{d}_{\text{sketch}}\times {d}_{\text{sketch}}}$ from our distribution $\mathcal{D}(b,q,{d}_{\text{sketch}})$. We split $R$ into three submatrices: ${R}_{s}\in {\mathbb{R}}^{{d}_{\text{sketch}}/3\times {d}_{\text{sketch}}}$ containing the random string subblocks, ${R}_{m}\in {\mathbb{R}}^{{d}_{\text{sketch}}/3\times {d}_{\text{sketch}}}$ containing the matrix signature subblocks, and ${R}_{c}\in {\mathbb{R}}^{{d}_{\text{sketch}}/3\times {d}_{\text{sketch}}}$ containing the column signature subblocks. $\left[\begin{array}{ccccccccccc}\hfill {\colorbox[rgb]{0.8,1,0.8}{$f$}}_{\colorbox[rgb]{0.8,1,0.8}{$s$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$1$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$1$}}& \cdot {\sigma}_{s,1}\hfill & \hfill {f}_{s,1,2}& \cdot {\sigma}_{s,2}\hfill & \hfill {f}_{s,1,3}& \cdot {\sigma}_{s,3}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{s,1,{d}_{\text{sketch}}}& \cdot {\sigma}_{s,{d}_{\text{sketch}}}\hfill \\ \hfill {f}_{c,1,1}& \cdot {\sigma}_{c,1,1}\hfill & \hfill {f}_{c,1,2}& \cdot {\sigma}_{c,1,2}\hfill & \hfill {f}_{c,1,3}& \cdot {\sigma}_{c,1,3}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{c,1,{d}_{\text{sketch}}}& \cdot {\sigma}_{c,1,{d}_{\text{sketch}}}\hfill \\ \hfill {f}_{m,1,1}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,1,2}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,1,3}& \cdot {\sigma}_{m}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{m,1,{d}_{\text{sketch}}}& \cdot {\sigma}_{m}\hfill \\ & \overrightarrow{0}\hfill & \hfill {f}_{s,2,2}& \cdot {\sigma}_{s,2}\hfill & & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ & \overrightarrow{0}\hfill & \hfill {f}_{c,2,2}& \cdot {\sigma}_{c,2,2}\hfill & & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ & \overrightarrow{0}\hfill & \hfill {f}_{m,2,2}& \cdot {\sigma}_{m}\hfill & & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ \hfill {\colorbox[rgb]{0.8,1,0.8}{$f$}}_{\colorbox[rgb]{0.8,1,0.8}{$s$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$3$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$1$}}& \cdot {\sigma}_{s,1}\hfill & \hfill {f}_{s,3,2}& \cdot {\sigma}_{s,2}\hfill & \hfill {f}_{s,3,3}& \cdot {\sigma}_{s,3}\hfill & \hfill {f}_{s,3,4}& \cdot {\sigma}_{s,4}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ \hfill {f}_{c,3,1}& \cdot {\sigma}_{c,3,1}\hfill & \hfill {f}_{c,3,2}& \cdot {\sigma}_{c,3,2}\hfill & \hfill {f}_{c,3,3}& \cdot {\sigma}_{c,3,3}\hfill & \hfill {f}_{c,3,4}& \cdot {\sigma}_{c,3,4}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ \hfill {f}_{m,3,1}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,3,2}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,3,3}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,3,4}& \cdot {\sigma}_{m}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ \hfill {\colorbox[rgb]{0.8,1,0.8}{$f$}}_{\colorbox[rgb]{0.8,1,0.8}{$s$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$4$}\colorbox[rgb]{0.8,1,0.8}{$,$}\colorbox[rgb]{0.8,1,0.8}{$1$}}& \cdot {\sigma}_{s,1}\hfill & \hfill {f}_{s,4,2}& \cdot {\sigma}_{s,2}\hfill & \hfill {f}_{s,4,3}& \cdot {\sigma}_{s,3}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{s,4,{d}_{\text{sketch}}}& \cdot {\sigma}_{s,{d}_{\text{sketch}}}\hfill \\ \hfill {f}_{c,4,1}& \cdot {\sigma}_{c,4,1}\hfill & \hfill {f}_{c,4,2}& \cdot {\sigma}_{c,4,2}\hfill & \hfill {f}_{c,4,3}& \cdot {\sigma}_{c,4,3}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{c,4,{d}_{\text{sketch}}}& \cdot {\sigma}_{c,4,{d}_{\text{sketch}}}\hfill \\ \hfill {f}_{m,4,1}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,4,2}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,4,3}& \cdot {\sigma}_{m}\hfill & & \overrightarrow{0}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {f}_{m,4,{d}_{\text{sketch}}}& \cdot {\sigma}_{m}\hfill \\ & \mathrm{\vdots}\hfill & & \mathrm{\vdots}\hfill & & \mathrm{\vdots}\hfill & & \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & & \mathrm{\vdots}\hfill \\ & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill {f}_{s,\mathrm{\u25c7},3}& \cdot {\sigma}_{s,3}\hfill & \hfill {f}_{s,\mathrm{\u25c7},4}& \cdot {\sigma}_{s,4}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill {f}_{c,\mathrm{\u25c7},3}& \cdot {\sigma}_{c,\mathrm{\u25c7},3}\hfill & \hfill {f}_{c,\mathrm{\u25c7},4}& \cdot {\sigma}_{c,\mathrm{\u25c7},4}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \\ & \overrightarrow{0}\hfill & & \overrightarrow{0}\hfill & \hfill {f}_{m,\mathrm{\u25c7},3}& \cdot {\sigma}_{m}\hfill & \hfill {f}_{m,\mathrm{\u25c7},4}& \cdot {\sigma}_{m}\hfill & \hfill \mathrm{\cdots}\hfill & & \overrightarrow{0}\hfill \end{array}\right]$ We fix $\delta =O\left(\frac{\mathrm{log}N}{\sqrt{{d}_{\text{sketch}}}}\right)$. Consider any one of these submatrices, ${R}_{\star}$. We wish to show that for any unit vector $x\in {\mathbb{R}}^{{d}_{\text{sketch}}}$, with high probability over $R\sim \mathcal{D}(b,q,{d}_{\text{sketch}})$: $${\left\left{R}_{\star}x\right\right}_{2}^{2}\in [\frac{1}{3}(1\delta ),\frac{1}{3}(1+\delta )]$$ Since this would imply that with high probability (due to unionbounding over the three submatrices): ${\left\leftRx\right\right}_{2}^{2}$ $={\displaystyle \sum _{\star \in \{s,c,m\}}}{\left\left{R}_{\star}x\right\right}_{2}^{2}$ $\in [1\delta ,1+\delta ]$ We will use the variant of the HansonWright inequality presented by Kane and Nelson (Kane & Nelson, 2014).Theorem 8 (HansonWright Inequality (Hanson & Wright, 1971)).
Let $z\mathrm{=}\mathrm{(}{z}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{z}_{n}\mathrm{)}$ be a vector of i.i.d. Rademacher $\mathrm{\pm}\mathrm{1}$ random variables. For any symmetric $B\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}n}$ and $\mathrm{\ell}\mathrm{\ge}\mathrm{2}$, $$\mathbb{E}{{z}^{T}Bz\mathrm{trace}(B)}^{\mathrm{\ell}}\le {C}^{\mathrm{\ell}}\cdot \mathrm{max}{\{\sqrt{\mathrm{\ell}}\cdot {\left\leftB\right\right}_{F},\mathrm{\ell}\cdot {\left\leftB\right\right}_{2}\}}^{\mathrm{\ell}}$$ for some universal constant $C\mathrm{>}\mathrm{0}$ independent of $B\mathrm{,}n\mathrm{,}\mathrm{\ell}$.B.2 Desynchronization
In this subsection, we will prove Theorem 7. Suppose we draw a random matrix $R\in {\mathbb{R}}^{{d}_{\text{sketch}}\times {d}_{\text{sketch}}}$ from our distribution $\mathcal{D}(b,q,{d}_{\text{sketch}})$, and we have two unit vectors $x,y\in {\mathbb{R}}^{{d}_{\text{sketch}}}$. For convenience, we will index $x$ using $[{d}_{\text{sketch}}]$ and $y$ using $\{s,c,m\}\times [\mathrm{\u25c7}]\times [b/3]$. As in the previous section, we split $R$ into ${R}_{s},{R}_{c},{R}_{m}$ (see Figure 5) so that our random variables will be independent of the subblock patterns. For one submatrix: $${y}^{T}{R}_{\star}x=\sum _{i=1}^{\mathrm{\u25c7}}\sum _{j=1}^{{d}_{\text{sketch}}}{\eta}_{i,j}{f}_{\star ,i,j}\u27e8{\sigma}_{\star ,i,j},{y}_{\star ,i}\u27e9{x}_{j}$$ We view this as a sum of $\mathrm{\u25c7}{d}_{\text{sketch}}$ random variables, which we will apply Bernstein’s Inequality to bound. Note that: $\left{\eta}_{i,j}{f}_{\star ,i,j}\u27e8{\sigma}_{\star ,i,j},{y}_{\star ,i}\u27e9{x}_{j}\right$ $\le {\displaystyle \frac{b}{3}}{\displaystyle \frac{1}{{d}_{\text{sketch}}q}}$ $={\displaystyle \frac{1}{3\mathrm{\u25c7}q}}$ $\sum _{i=1}^{\mathrm{\u25c7}}}{\displaystyle \sum _{j=1}^{{d}_{\text{sketch}}}}\mathbb{E}\left[{\eta}_{i,j}^{2}{f}_{\star ,i,j}^{2}{\u27e8{\sigma}_{\star ,i,j},{y}_{\star ,i}\u27e9}^{2}{x}_{j}^{2}\right]$ $=q{\displaystyle \sum _{i=1}^{\mathrm{\u25c7}}}{\displaystyle \sum _{j=1}^{{d}_{\text{sketch}}}}{\u27e8{\sigma}_{\star ,i,j},{y}_{\star ,i}\u27e9}^{2}{x}_{j}^{2}$ $\le q{\displaystyle \sum _{i=1}^{\mathrm{\u25c7}}}{\displaystyle \sum _{j=1}^{{d}_{\text{sketch}}}}{\left\left{\sigma}_{\star ,i,j}\right\right}_{2}^{2}{\left\left{y}_{\star ,i}\right\right}_{2}^{2}{x}_{j}^{2}$ $=q{\displaystyle \sum _{i=1}^{\mathrm{\u25c7}}}{\displaystyle \sum _{j=1}^{{d}_{\text{sketch}}}}{\displaystyle \frac{b}{3}}{\displaystyle \frac{1}{{d}_{\text{sketch}}q}}{\left\left{y}_{\star ,i}\right\right}_{2}^{2}{x}_{j}^{2}$ $={\displaystyle \frac{1}{3\mathrm{\u25c7}}}$ Combining these with Bernstein’s Inequality yields: $$\mathrm{Pr}\left[{y}^{T}{R}_{\star}x>\frac{\delta}{3}\right]\le \mathrm{exp}\left(\frac{{\delta}^{2}/18}{\frac{1}{3\mathrm{\u25c7}}+\frac{1}{9\mathrm{\u25c7}q}\frac{\delta}{3}}\right)$$ We note that by assumption $q=\mathrm{\Omega}\left(\frac{\sqrt{\mathrm{log}N}}{\sqrt{\mathrm{\u25c7}}}\right)$ and hence we can also choose a sufficiently large $\delta =O\left(\frac{\sqrt{\mathrm{log}N}}{\sqrt{\mathrm{\u25c7}}}\right)$ to make this probability polynomially small in $N$. This completes the proof.Appendix C Proof Sketches for Sketch Prototypes
In this appendix, we present proof sketches for our sketch prototypes.C.1 Proof Sketches for Prototype A
Recall Claim 1, which we will now sketch a proof of. See 1Proof Sketch.
We multiply our overall sketch with ${R}_{M({\theta}^{\star})}^{T}={R}_{M({\theta}^{\star})}^{1}$. This product is ${\sum}_{\theta}{w}_{\theta}{R}_{M({\theta}^{\star})}^{1}{R}_{M(\theta )}{x}_{\theta}$. Our signal term occurs when $\theta ={\theta}^{\star}$ and is ${w}_{{\theta}^{\star}}{x}_{{\theta}^{\star}}$. We think of the other terms as noise; we know from our orthonormal matrix properties that with high probability they each perturb any specific coordinate by at most $O({w}_{\theta}\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$. Taken together, no coordinate is perturbed by more than $O(\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$. Dividing both our signal term and noise by ${w}_{{\theta}^{\star}}$, we get the desired claim. ∎Proof Sketch.
Consider two overall sketches $s={\sum}_{\theta}{w}_{\theta}{R}_{M(\theta )}{x}_{\theta}$ and $\overline{s}={\sum}_{\overline{\theta}}{\overline{w}}_{\overline{\theta}}{R}_{M(\overline{\theta})}{x}_{\overline{\theta}}$. Their dotproduct can be written as ${\sum}_{\theta}{\sum}_{\overline{\theta}}{w}_{\theta}{\overline{w}}_{\overline{\theta}}{x}_{\theta}^{T}{R}_{M(\theta )}^{T}{R}_{M(\overline{\theta})}{x}_{\overline{\theta}}$. First, let us suppose we are in case (i) and they share no modules. Since orthonormal matrices are desynchronizing, each term has magnitude at most $O({w}_{\theta}{\overline{w}}_{\overline{\theta}}\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$. Since the sum of each set of weights is one, the total magnitude is at most $O(\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$. This completes the proof of case (i). Next, let us suppose we are in case (ii) and they share an object ${\theta}^{\star}$. Since orthonormal matrices are isometric, the term $\theta =\overline{\theta}={\theta}^{\star}$ has a value of ${w}_{{\theta}^{\star}}{\overline{w}}_{{\theta}^{\star}}$. We can analyze the remaining terms as in case (i); they have total magnitude at most $O(\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$. The worstcase is that the noise is negative, which yields the claim for case (ii). We need to analyze case (iii) a little differently. Since our objects are paired up between the sketches, we refer to the objects as ${\theta}_{1},\mathrm{\dots},{\theta}_{k}$ (in the first sketch) and ${\overline{\theta}}_{1},\mathrm{\dots},{\overline{\theta}}_{k}$ (in the second sketch. Because they are close in ${\mathrm{\ell}}_{2}$ distance, we can write ${x}_{{\overline{\theta}}_{i}}={x}_{{\theta}_{i}}+{\u03f5}_{i}$, where each ${\u03f5}_{i}$ is a noise vector with ${\mathrm{\ell}}_{2}$ length at most $\u03f5$. We can write our second sketch as $\overline{s}={\sum}_{i}{w}_{{\theta}_{i}}{R}_{M({\theta}_{i})}\left({x}_{{\theta}_{i}}+{\u03f5}_{i}\right)$ (by assumption, the paired objects had the same weight and were produced by the same module). But then $\overline{s}=s+{\sum}_{i}{w}_{{\theta}_{i}}{R}_{M({\theta}_{i})}{\u03f5}_{i}$! Since orthonormal matrices are isometric and the weights sum to one, $\overline{s}$ is at most $\u03f5$ away from $s$ in ${\mathrm{\ell}}_{2}$ norm, as desired. ∎C.2 Proof Sketches for Prototype B
Recall Claim 3, which we will now sketch a proof of. See 3Proof Sketch.
We begin with the easier case (i). For case (i), we attempt to recover ${x}_{{\theta}^{\star}}$ via multiplying the sketch by ${R}_{M({\theta}^{\star}),1}^{T}$. The signal component of our sketch has the form $\left(\frac{I+{R}_{i}}{2}\right)\frac{1}{2}{R}_{M({\theta}^{\star}),1}^{T}{x}_{{\theta}^{\star}}$. The ${R}_{i}/2$ part can be considered noise, and the $I/2$ part gives us a quarter of the signal we had when analyzing sketch A. The noise calculation is the same as before because ${R}_{M({\theta}^{\star})}$ does not occur anywhere else, so we get the same additive error (up to a constant hidden in the bigoh notation). For the harder case (ii), we multiply the sketch by ${R}_{M({\theta}^{\star})}^{T}{R}_{i}^{T}$ instead. This recovers the ${R}_{i}/2$ part instead of the $I/2$ part, but everything else is the same as the previous case. ∎Proof Sketch.
Case (i) is the same as before; the desynchronizing property of orthonormal matrices means that taking the dotproduct of module sketches, even when transformed by $I/2$ or ${R}_{i}/2$, results in at most $O(\sqrt{\mathrm{log}N}/\sqrt{{d}_{\text{sketch}}})$ noise. Regarding case (ii), our new signal term is weaker by a factor of four. Notice that for both sketches, the object sketch ${s}_{\text{object}}({\theta}^{\star})$ is still identical. We lose our factor of four when considering that a $\left(\frac{I+{R}_{i}}{2}\right)$ has been added into the tuple sketch; the $I/2$ lets through the information we care about, and compounded over two sketches results in a factor four loss. If the object were located at the same index in both sketches, we would only lose a factor of two. Regarding case (iii), the new module sketch is only halfbased on attributes; the other half agrees if the objects have the same module. As a result, changing the attributes has half the effect as before. Case (iv) is similar to case (i), except only the ${e}_{1}$ component of the module sketch is similar, resulting in an additional factor two loss from each sketch, for a final factor of sixteen. ∎Proof Sketch.
It is more intuitive to begin with case (ii). Consider what would happen if we performed case (i) of Attribute Recovery as in Claim 3. That claim assumed that there were no other objects produced by the same module, but if there were (and they had the same weight ${w}^{\star}$), then their attribute vectors would be added together. This exactly yields the current case (ii). Next, for case (i), we simply observe that the object sketch treats ${x}_{\theta}$ the same way as it treats ${e}_{1}$, so we could use the same idea to recover the sum, over all objects with the same module, of ${e}_{1}$. That is exactly the number of such modules times ${e}_{1}$, so looking at the first coordinate gives us the desired estimate for the current case (i). ∎Appendix D Proofs for Final Sketch
In this appendix, we present proofs for the properties of our final sketch. Recall that our final sketch is defined as follows. ${s}_{\text{tuple}}({s}_{1},\mathrm{\dots},{s}_{k})$ $\u2254{\displaystyle \sum _{i=1}^{k}}{w}_{i}\left({\displaystyle \frac{I+{R}_{i}}{2}}\right){s}_{i}$ ${s}_{\text{object}}(\theta )$ $\u2254\left(\frac{I+{R}_{M(\theta ),0}}{2}\right){s}_{\text{tuple}}({s}_{\text{attr}}(\theta ),{s}_{\text{input}}(\theta ),\frac{1}{2},\frac{1}{2})$ ${s}_{\text{attr}}(\theta )$ $\u2254\frac{1}{2}{R}_{M(\theta ),1}{x}_{\theta}+\frac{1}{2}{R}_{M(\theta ),2}{e}_{1}$ ${s}_{\text{input}}(\theta )$ $\u2254{s}_{\text{tuple}}({s}_{\text{object}}({\theta}_{1}),\mathrm{\dots},{s}_{\text{object}}({\theta}_{k}),{w}_{1},\mathrm{\dots},{w}_{k})$ ${s}_{\text{overall}}$ $\u2254{s}_{\text{input}}(\text{output pseudoobject})$ If we unroll these definitions, then the overall sketch has the form: $${s}_{\text{overall}}=\sum _{\theta}{w}_{\theta}\stackrel{\curvearrowright}{\prod _{j=1}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ where $h(\theta )$ is the depth of object $\theta $ and $\left(\frac{I+{R}_{\theta ,j}}{2}\right)$ is the ${j}^{th}$ matrix when walking down the sketch tree from the overall sketch to the $\theta $ object sketch (i.e. alternating between input index, module matrix, and a one or two index into module versus inputs). Technical Assumptions. For this section, we need the following technical modification to the sketch. The ${R}_{i}$ matrices for the tuple operators are redrawn at every depth level of the sketch, where the depth starts at one at the overall sketch and increases every time we take a tuple sketch. Additionally, we assume that we can guarantee $\delta \le \frac{1}{4H}$ by appropriately increasing the sketch dimension (in the typical case where $H$ is constant, this just increases the sketch dimension by a constant); this section simply accepts a $\delta $ guarantee derived in Appendix B and does not tinker with sketch parameters.D.1 Proofs for Attribute Recovery
Theorem 9.
Our final sketch has the Attribute Recovery property, which is as follows. Consider an object ${\theta}^{\mathrm{\star}}$ which lies at depth $h\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}$ in the overall sketch and with effective weight ${w}_{{\theta}^{\mathrm{\star}}}$. (i) Suppose no other objects in the overall sketch are also produced by $M({\theta}^{\star})$. We can produce a vector estimate of the attribute vector ${x}_{{\theta}^{\star}}$, which with high probability has an additive error of at most $O({2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta /{w}_{{\theta}^{\star}})$ in each coordinate. (ii) Suppose we know the sequence of input indices to get to ${\theta}^{\star}$ in the sketch. Then even if other objects in the overall sketch are produced by $M({\theta}^{\star})$, we can still produce a vector estimate of the attribute vector ${x}_{{\theta}^{\star}}$, which with high probability has an additive error of at most $O({2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta /{w}_{{\theta}^{\star}})$ in each coordinate.Proof.
We begin by proving the easier case (i). Suppose that our overall sketch is $s$. Let $\beta ={2}^{3h({\theta}^{\star})+1}/{w}_{{\theta}^{\star}}$, and imagine that we attempt to recover the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$ by computing $\beta {e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}s$, which has the form: $$\sum _{\theta}\beta {w}_{\theta}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ When $\theta ={\theta}^{\star}$, we get the term: $${2}^{3h({\theta}^{\star})+1}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h({\theta}^{\star})}}\left[\left(\frac{I+{R}_{{\theta}^{\star},j}}{2}\right)\right]{s}_{\text{object}}({\theta}^{\star})$$ If we expand out each transparent matrix, we would get ${2}^{3h({\theta}^{\star})}$ subterms. Let’s focus on the subterm where we choose $I/2$ every time: $$2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\left[\frac{1}{2}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}+\frac{1}{2}{R}_{M({\theta}^{\star}),2}{e}_{1}\right]$$ By Lemma 2, we know that with high probability (note ${e}_{j}$ and ${x}_{{\theta}^{\star}}$ are both unit vectors): $$\left2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\times \frac{1}{2}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}{e}_{j}^{T}{x}_{{\theta}^{\star}}\right\le 3\delta =O(\delta )$$ This half of the subterm is (approximately) the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$, up to an additive error of $O(\delta )$. We control the other half using Lemma 7: $$\left2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\times \frac{1}{2}{R}_{M({\theta}^{\star}),2}{e}_{1}\right\le \delta \sqrt{1+\delta}=O(\delta )$$ It remains to bound the additive error from the other subterms and the other terms. The other subterms have the form: $$2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{~}{R}{s}_{\text{object}}({\theta}^{\star})$$ where $\stackrel{~}{R}$ is a product of one to $3h({\theta}^{\star})$ $R$ matrices. By Lemma 1, we know that with high probability ${s}_{\text{object}}({\theta}^{\star})$ and ${R}_{M({\theta}^{\star}),1}{e}_{j}$ have squared ${\mathrm{\ell}}_{2}$ norm at most $(1+\delta )$. We analyze $\stackrel{~}{R}$ by repeatedly applying Lemma 5: we know that the distribution of one matrix satisfies $\delta $Desynchronization and $\delta $Isometry. Let $\gamma =\sqrt{1+\delta}$. Then the product distribution of two matrices satisfies $\left[(1+\gamma )\delta \right]$Desynchronization, the product distribution of three matrices satisfies $\left[(1+\gamma +{\gamma}^{2})\delta \right]$Desynchronization, and the product distribution of $3h({\theta}^{\star})$ matrices satisfies $\left[\frac{{\gamma}^{3h({\theta}^{\star})}1}{\gamma 1}\delta \right]$Desynchronization. The last desynchronization is the largest, therefore no matter how many matrices are being multiplied, with high probability the absolute value of any other subterm is bounded by: $2{\displaystyle \frac{{\gamma}^{3h({\theta}^{\star})}1}{\gamma 1}}\delta (1+\delta )$ $\le 2{\displaystyle \frac{{\gamma}^{6h({\theta}^{\star})}1}{{\gamma}^{2}1}}\delta (1+\delta )$ $\le 2{\displaystyle \frac{{(1+\delta )}^{3h({\theta}^{\star})}1}{\delta}}\delta (1+\delta )$ $\le 2(1+O(h({\theta}^{\star})\delta )1)(1+\delta )$ $\le O(h({\theta}^{\star})\delta )$ Note that we used the fact that ${(1+x)}^{n}\le 1+O(xn)$ if $xn\le \frac{1}{2}$. Hence, the absolute value of the sum of the other subterms is bounded by $O({2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta )$. We now bound the additive error from the other terms, which have the form: $$\beta {w}_{\theta}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ The plan is to use the desynchronization of ${R}_{M({\theta}^{\star}),1}$. To do so, we will need to bound the ${\mathrm{\ell}}_{2}$ length of ${s}_{\text{object}}(\theta )$ as it gets multiplied by this sequence of matrices. By Lemma 4 that the distribution of these matrices is $\frac{1}{2}$scaled $(\frac{3}{4}\delta )$Isometric. By repeatedly applying Lemma 1 we know that the squared ${\mathrm{\ell}}_{2}$ length is upper bounded by ${(\frac{1}{2}+\frac{3}{4}\delta )}^{3h(\theta )}(1+\delta )$. By our assumptions, this is upperbounded by a constant: ${({\displaystyle \frac{1}{2}}+{\displaystyle \frac{3}{4}}\delta )}^{3h(\theta )}(1+\delta )$ $\le {(1+\delta )}^{3h(\theta )+1}$ $\le {(1+1/(2H))}^{2H}$ $\le e$ Now, by the desynchronization of ${R}_{M({\theta}^{\star}),1}$, we know that with high probability the absolute value of the term indexed by $\theta $ is at most $O(\beta {w}_{\theta}\delta )$. Hence with high probability the absolute value of all other terms is at most $O(\beta \delta )$. Plugging in our choice of $\beta $, it is at most $O({2}^{3h({\theta}^{\star})+1}\delta /{w}_{{\theta}^{\star}})$. Hence we have shown that with high probability, we can recover the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$ with the desired additive error. This completes the proof of case (i). We now prove the harder case (ii). Again, we let $\beta ={2}^{3h({\theta}^{\star})+1}/{w}_{{\theta}^{\star}}$, and imagine that we attempt to recover the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$ by computing $\beta {e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowleft}{{\prod}_{j=1}^{3h({\theta}^{\star})}}\left[{R}_{{\theta}^{\star},j}^{T}\right]s$, which has the form: $\sum _{\theta}}\beta {w}_{\theta}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowleft}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}\left[{R}_{{\theta}^{\star},j}^{T}\right]$ $\mathrm{\hspace{1em}\hspace{1em}}\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h(\theta )}}}\left[\left({\displaystyle \frac{I+{R}_{\theta ,j}}{2}}\right)\right]{s}_{\text{object}}(\theta )$ It is worth noting that we did not compute the alternative product $\beta {e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowleft}{{\prod}_{j=1}^{3h({\theta}^{\star})}}\left[\left(\frac{I+{R}_{{\theta}^{\star},j}^{T}}{2}\right)\right]s$ because the transparent matrices in the alternative would allow through information about other objects produced by the same module. When $\theta ={\theta}^{\star}$, we get the term: ${2}^{3h({\theta}^{\star})+1}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowleft}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}\left[{R}_{{\theta}^{\star},j}^{T}\right]$ $\mathrm{\hspace{1em}\hspace{1em}}\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}\left[\left({\displaystyle \frac{I+{R}_{{\theta}^{\star},j}}{2}}\right)\right]{s}_{\text{object}}({\theta}^{\star})$ If we expand out each transparent matrix, we would get ${2}^{3h({\theta}^{\star})}$ subterms. Let’s focus on the subterm where we choose $R/2$ every time: $2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}{\stackrel{~}{R}}^{T}\stackrel{~}{R}\left[{\displaystyle \frac{1}{2}}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}+{\displaystyle \frac{1}{2}}{R}_{M({\theta}^{\star}),2}{e}_{1}\right]$ where $\stackrel{~}{R}=\stackrel{\curvearrowright}{{\prod}_{j=1}^{3h({\theta}^{\star})}}[{R}_{{\theta}^{\star},j}]$. As before, we expect to get some signal from the first half $\frac{1}{2}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}$ and some noise from the second half $\frac{1}{2}{\mathbb{R}}_{M({\theta}^{\star}),2}{e}_{1}$. Let us begin with the first half. The matrices involved match exactly, so it is a matter of repeatedly applying Lemma 6. Note that all matrices involved are $\delta $Isometric. Since at each step we multiply the isometry parameter by $(1+\delta )$ and then add $\delta $, the isometry parameter after combining $3h({\theta}^{\star})$ times is: $\sum _{i=0}^{3h({\theta}^{\star})}}\delta {(1+\delta )}^{i$ $=\delta {\displaystyle \frac{{(1+\delta )}^{3h({\theta}^{\star})+1}1}{1+\delta 1}}$ $={(1+\delta )}^{3h({\theta}^{\star})+1}1$ $\le 1+O(h({\theta}^{\star})\delta )1$ $\le O(h({\theta}^{\star})\delta )$ Note that we used the fact that ${(1+x)}^{n}\le 1+O(xn)$ if $xn\le \frac{1}{2}$ along with our assumption that $$. Hence the product of all $(3h({\theta}^{\star})+1)$ matrices is $O(h({\theta}^{\star})\delta )$Isometric. By Lemma 2 this implies that with high probability: $\left\u27e8\stackrel{~}{R}{R}_{M({\theta}^{\star}),1}{e}_{j},\stackrel{~}{R}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}\u27e9\u27e8{e}_{j},{x}_{{\theta}^{\star}}\u27e9\right\le O(h({\theta}^{\star})\delta )$ We now control the noise half. By Lemma 7, with high probability: $$\left\u27e8{R}_{M({\theta}^{\star}),1}{e}_{j},{R}_{M({\theta}^{\star}),2}{e}_{1}\u27e9\right\le \delta \sqrt{1+\delta}=O(\delta )$$ We have already computed that the remaining $3h({\theta}^{\star})$ matrices is $O(H\delta )$Isometric and hence approximately preserves this. Lemma 2 says that with high probability: $\u27e8\stackrel{~}{R}{R}_{M({\theta}^{\star}),1}{e}_{j},\stackrel{~}{R}{R}_{M({\theta}^{\star}),2}{e}_{1}\u27e9$ $\mathrm{\hspace{1em}\hspace{1em}}\u27e8{R}_{M({\theta}^{\star}),1}{e}_{j},{R}_{M({\theta}^{\star}),2}{e}_{1}\u27e9\le O(H\delta )$ By the triangle inequality, we now know that with high probability, the subterm where we choose $R/2$ every time is at most $O(H\delta )$ additive error away from the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$. We now consider the other subterms. These other subterms are of the form: $$2{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}{\stackrel{~}{R}}^{T}{\stackrel{~}{R}}^{\prime}\left[\frac{1}{2}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}+\frac{1}{2}{R}_{M({\theta}^{\star}),2}{e}_{1}\right]$$ where $\stackrel{~}{R}$ is still $\stackrel{\curvearrowright}{{\prod}_{j=1}^{3h({\theta}^{\star})}}[{R}_{{\theta}^{\star},j}]$ but ${\stackrel{~}{R}}^{\prime}$ is now the product of a subsequence of those terms; we write it as $\stackrel{\curvearrowright}{{\prod}_{j=1}^{3h({\theta}^{\star})}}[{S}_{{\theta}^{\star},j}]$ where ${S}_{{\theta}^{\star},j}$ is either ${R}_{{\theta}^{\star},j}$ or the identity matrix. Next, we define ${x}_{J}\u2254\stackrel{\curvearrowright}{{\prod}_{j=J}^{3h({\theta}^{\star})}}[{R}_{{\theta}^{\star},j}]{R}_{M({\theta}^{\star}),1}{e}_{j}$ and ${x}_{J}^{\prime}\u2254\stackrel{\curvearrowright}{{\prod}_{j=J}^{3h({\theta}^{\star})}}[{S}_{{\theta}^{\star},j}]{s}_{\text{object}}(\theta )$. Additionally, when $J=3h({\theta}^{\star})+1$, ${x}_{J}={R}_{M({\theta}^{\star}),1}{e}_{j}$ and ${x}_{J}^{\prime}={s}_{\text{object}}(\theta )$. Our plan is to induct backwards from $J=3h({\theta}^{\star})+1$, controlling the dot product of the two. By Lemma 1, we know that all ${x}_{J}$ and ${x}_{J}^{\prime}$ have squared ${\mathrm{\ell}}_{2}$ length at most ${(1+\delta )}^{3h({\theta}^{\star})+1}$. Under our assumptions, this is less than $e$ and is hence a constant. Lemma 7 and Lemma 2 already imply that for our base case of $J=3h({\theta}^{\star})+1$, the dot product of the two is at most $O(\delta )$ away from the ${j}^{th}$ coordinate of ${x}_{{\theta}^{\star}}$. How does the dot product change from $J$ to $J1$? We either multiply ${x}_{J}$ by a random matrix, or both ${x}_{J}$ and ${x}_{J}^{\prime}$ by the same random matrix. In the first case, the $\delta $Desynchronization of our matrices tells us that the new dotproduct will be zero times the old dotproduct plus $O(\delta )$ error. In the second case, the $\delta $Isometry of our matrices tells us that the new dotproduct will be the old dotproduct plus $O(\delta )$ error. Since the first case must occur at least once, we know that the final dot product when $J=1$ consists of no signal and $O(h({\theta}^{\star})\delta )$ noise. Hence, summing up our ${2}^{3h({\theta}^{\star})}1$ other subterms, we arrive at $O({2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta )$ noise. It remains to consider the other terms. Unfortunately, we will need to analyze our other terms by breaking them into subterms as well, since otherwise we would need to consider the situation where $R$ is applied to one vector and $\left(\frac{I+R}{2}\right)$ to the other. The other terms have the form: $\beta {w}_{\theta}{e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}\stackrel{\curvearrowleft}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}\left[{R}_{{\theta}^{\star},j}^{T}\right]$ $\mathrm{\hspace{1em}\hspace{1em}}\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h(\theta )}}}\left[\left({\displaystyle \frac{I+{R}_{\theta ,j}}{2}}\right)\right]{s}_{\text{object}}(\theta )$ We break our terms into subterms, but only for the first $\mathrm{3}\mathit{}h\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}$ choices. Our analysis focuses on handling the case where $\theta $ is at a greater depth than ${\theta}^{\star}$; we handle the other case by inserting identity matrices and splitting them into $I/2+I/2$. We wind up with ${2}^{3h({\theta}^{\star})}$ subterms. Such a subterm has the form: $$(2{w}_{\theta}/{w}_{{\theta}^{\star}}){e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}{\stackrel{~}{R}}^{T}{\stackrel{~}{R}}^{\prime}{\stackrel{~}{R}}^{\prime \prime}{s}_{\text{object}}(\theta )$$ where $\stackrel{~}{R}$ $\u2254\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}[{R}_{{\theta}^{\star},j}]$ ${\stackrel{~}{R}}^{\prime}$ $\u2254\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h({\theta}^{\star})}}}[{S}_{\theta ,j}]$ ${\stackrel{~}{R}}^{\prime \prime}$ $\u2254\stackrel{\curvearrowright}{{\displaystyle \prod _{j=3h({\theta}^{\star})+1}^{3h(\theta )}}}\left[{\displaystyle \frac{I+{R}_{\theta ,j}}{2}}\right]$ and ${S}_{\theta ,j}$ is either ${R}_{\theta ,j}$ or the identity matrix. We define: ${x}_{J}$ $\u2254\stackrel{\curvearrowright}{{\displaystyle \prod _{j=J}^{3h({\theta}^{\star})}}}[{R}_{{\theta}^{\star},j}]{R}_{M({\theta}^{\star}),1}{e}_{j}$ ${x}_{J}^{\prime}$ $\u2254\stackrel{\curvearrowright}{{\displaystyle \prod _{j=J}^{3h({\theta}^{\star})}}}[{S}_{\theta ,j}]{\stackrel{~}{R}}^{\prime \prime}{s}_{\text{object}}(\theta )$ When $J=3h({\theta}^{\star})+1$, we define ${x}_{J}\u2254{R}_{M({\theta}^{\star}),1}{e}_{j}$ and ${x}_{J}^{\prime}={\stackrel{~}{R}}^{\prime \prime}{s}_{\text{object}}(\theta )$. As usual, Lemma 1 guarantees that all ${x}_{J}$ and ${x}_{J}^{\prime}$ have constant ${\mathrm{\ell}}_{2}$ length. For our base induction case $J=\left[\mathrm{max}(3h({\theta}^{\star}),3h(\theta ))+1\right]$, we would like to know the dot product of ${R}_{M({\theta}^{\star}),1}{e}_{j}$ with ${\stackrel{~}{R}}^{\prime \prime}{s}_{\text{object}}(\theta )$. We first analyze the dot product of ${R}_{M({\theta}^{\star}),1}$ with ${s}_{\text{object}}(\theta )$. Keep in mind that it may be the case that $M(\theta )=M({\theta}^{\star})$. In the same module case, we know by Lemma 2 and Lemma 7 that we get a signal of the ${j}^{th}$ component of ${x}_{\theta}$ along with $O(\delta )$ noise. In the different module case, Lemma 7 is enough to tell us that there is no signal and $O(\delta )$ noise. As a reminder, the goal is to show that this signal does not get through, since we only want the ${j}^{th}$ component of ${x}_{{\theta}^{\star}}$, not any other ${x}_{\theta}$. We know by Lemma 3 that our transparent matrices are $\frac{1}{2}$leaky $O(\delta )$Desynchronizing. Hence after applying all the transparent matrices, we know that at most we have our original signal (the ${j}^{th}$ component of ${x}_{\theta}$ or nothing) and $O(\delta )$ noise (the noise telescopes due to the $\frac{1}{2}$ leaking/scaling). This is the base case of our backwards induction, $J=3h({\theta}^{\star})+1$. We now consider what happens to the dot product of ${x}_{J}$ and ${x}_{J}^{\prime}$ as we move from $J$ to $J1$. We may either (a) multiply both by different random matrices, (b) multiply both by the same random matrix, or (c) multiply only one by a random matrix. Furthermore, we know that the sequences ${R}_{{\theta}^{\star},\cdot}$ and ${R}_{\theta ,\cdot}$ differ at some point in the first $3h({\theta}^{\star})$ matrices, so we must end up in case (a) or case (c) at least once. For case (a), we know by Lemma 7 that the current dotproduct will be multiplied by zero and $O(\delta )$ noise will be added. For case (b), we know by Lemma 2 that the current dotproduct will be multiplied by one and $O(\delta )$ noise will be added. For case (c), we know by $d$Desynchronization that the current dotproduct will be multiplied by zero and $O(\delta )$ noise will be added. Since case (a) or (c) must occur at least once, we know that we get no signal and at most $O(h({\theta}^{\star})\delta )$ noise. Since each of our ${2}^{3h({\theta}^{\star})}$ subterms has a multiplier of $(2{w}_{\theta}/{w}_{{\theta}^{\star}})$ on top of the dot product of ${x}_{1}$ and ${x}_{1}^{\prime}$, each term has noise bounded in magnitude (still with high probability) by $O({w}_{\theta}{2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta /{w}_{{\theta}^{\star}})$. Summing over all objects $\theta $, we get our final desired claim: with high probability, the total additive noise is bounded by $O({2}^{3h({\theta}^{\star})}h({\theta}^{\star})\delta /{w}_{{\theta}^{\star}})$. This completes the proof. ∎D.2 Proofs for SketchtoSketch Similarity
Theorem 10.
Our final sketch has the SketchtoSketch Similarity property, which is as follows. Suppose we have two overall sketches $s$ and $\overline{s}$. (i) If the two sketches share no modules, then with high probability they have at most $O(\delta )$ dotproduct. (ii) If $s$ has an object ${\theta}^{\star}$ of weight ${w}_{{\theta}^{\star}}$ and $\overline{s}$ has an object ${\overline{\theta}}^{\star}$ of weight ${\overline{w}}_{{\theta}^{\star}}$ and both objects are produced by the same module, then with high probability they have at least $\mathrm{\Omega}({2}^{6h({\theta}^{\star})}{w}_{{\theta}^{\star}}{\overline{w}}_{{\overline{\theta}}^{\star}})O(\delta )$ dotproduct. (iii) If the two sketches are identical except that the attributes of any object $\theta $ differ by at most $\u03f5$ in ${\mathrm{\ell}}_{2}$ distance, then with probability one the two sketches are at most $\u03f5/2$ from each other in ${\mathrm{\ell}}_{2}$ distance.Proof.
We prove cases (i) and (ii) together. Suppose that our overall sketches are $s,\overline{s}$. Then the dotproduct of our sketches has the form: $\sum _{\theta}}{\displaystyle \sum _{\overline{\theta}}}{w}_{\theta}{\overline{w}}_{\overline{\theta}$ ${s}_{\text{object}}{(\theta )}^{T}\stackrel{\curvearrowleft}{{\displaystyle \prod _{j=1}^{3h(\theta )}}}\left[\left({\displaystyle \frac{I+{R}_{\theta ,j}^{T}}{2}}\right)\right]$ $\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h(\overline{\theta})}}}\left[\left({\displaystyle \frac{I+{R}_{\overline{\theta},j}}{2}}\right)\right]{s}_{\text{object}}(\overline{\theta})$ Our proof plan is to analyze a single term of this doublesummation at a time. Fix $\theta $ and $\overline{\theta}$. We analyze this term with our toolkit of technical lemmas. The plan is to first understand $\u27e8{s}_{\text{object}}(\theta ),{s}_{\text{object}}(\overline{\theta})\u27e9$. There are two possibilities for this dotproduct, depending on our two objects $\theta $ and $\overline{\theta}$. If these objects come from different modules, then we will want a small dotproduct. If these objects come from the same module, then we will want a significant dotproduct. DotProduct Under Different Modules. Since the matrices are different, we know by Lemma 7 that with high probability, all of the following are true: $\left\u27e8{R}_{M(\theta ),1}{x}_{\theta},{R}_{M(\overline{\theta}),1}{x}_{\overline{\theta}}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ $\left\u27e8{R}_{M(\theta ),1}{x}_{\theta},{R}_{M(\overline{\theta}),2}{e}_{1}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ $\left\u27e8{R}_{M(\theta ),2}{e}_{1},{R}_{M(\overline{\theta}),1}{x}_{\overline{\theta}}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ $\left\u27e8{R}_{M(\theta ),2}{e}_{1},{R}_{M(\overline{\theta}),2}{e}_{1}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ Combining these statements and using the triangle inequality, we know that with high probability, $$\left\u27e8{s}_{\text{object}}(\theta ),{s}_{\text{object}}(\overline{\theta})\u27e9\right\le O(\delta )$$ DotProduct Under Same Module. Since the matrices are the same, we know by Lemma 2 that with high probability, both of the following are true: $\left\u27e8{R}_{M(\theta ),1}{x}_{\theta},{R}_{M(\overline{\theta}),1}{x}_{\overline{\theta}}\u27e9\u27e8{x}_{\theta},{x}_{\overline{\theta}}\u27e9\right\le 3\delta $ $\le O(\delta )$ $\left\u27e8{R}_{M(\theta ),2}{e}_{1},{R}_{M(\overline{\theta}),2}{e}_{1}\u27e9\u27e8{e}_{1},{e}_{1}\u27e9\right\le 3\delta $ $\le O(\delta )$ Also, we still know by Lemma 7 that with high probability, both of the following are true: $\left\u27e8{R}_{M(\theta ),1}{x}_{\theta},{R}_{M(\overline{\theta}),2}{e}_{1}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ $\left\u27e8{R}_{M(\theta ),2}{e}_{1},{R}_{M(\overline{\theta}),1}{x}_{\overline{\theta}}\u27e9\right\le \delta \sqrt{1+\delta}$ $\le O(\delta )$ Combining these statements with the triangle inequality, we know that with high probability, $$\left\u27e8{s}_{\text{object}}(\theta ),{s}_{\text{object}}(\overline{\theta})\u27e9\frac{1}{4}\u27e8{x}_{\theta},{x}_{\overline{\theta}}\u27e9\frac{1}{4}\u27e8{e}_{1},{e}_{1}\u27e9\right\le O(\delta )$$ Since attribute vectors are nonnegative, we know that: $$\u27e8{s}_{\text{object}}(\theta ),{s}_{\text{object}}(\overline{\theta})\u27e9\ge \frac{1}{4}O(\delta )$$ We have successfully established our desired facts about the base dotproduct and are now ready to consider what happens when we begin applying matrices of the form $\left(\frac{I+R}{2}\right)$ to both vectors in this dot product. We begin by controlling the ${\mathrm{\ell}}_{2}$ norm of the various vectors produced by gradually applying these matrices. By Lemma 1, we know that every vector of the form $${x}_{J}=\stackrel{\curvearrowright}{\prod _{j=J}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ where $J\in \{1,2,\mathrm{\dots},3h(\theta )\}$ or of the form $${\overline{x}}_{J}=\stackrel{\curvearrowright}{\prod _{j=J}^{3h(\overline{\theta})}}\left[\left(\frac{I+{R}_{\overline{\theta},j}}{2}\right)\right]{s}_{\text{object}}(\overline{\theta})$$ where $J\in \{1,2,\mathrm{\dots},3h(\overline{\theta})\}$ has squared ${\mathrm{\ell}}_{2}$ norm at most ${(\frac{1}{2}+\frac{3}{4}\delta )}^{\mathrm{max}(3h(\theta ),3h(\overline{\theta}))}(1+\delta )$. By our assumptions, this is upperbounded by a constant: ${({\displaystyle \frac{1}{2}}+{\displaystyle \frac{3}{4}}\delta )}^{\mathrm{max}(3h(\theta ),3h(\overline{\theta}))}(1+\delta )$ $\le {(1+\delta )}^{\mathrm{max}(3h(\theta ),3h(\overline{\theta}))+1}$ $\le {(1+1/(2H))}^{2H}$ $\le e$ We have shown that our unit vectors maintain constant ${\mathrm{\ell}}_{2}$ norm throughout this process. This allows us to safely apply our desynchronization and isometry properties to any ${x}_{J}$ or ${\overline{x}}_{J}$. From here, we plan to induct in order of decreasing depth. For homogeneity of notation, we will define ${x}_{J}$ to be ${s}_{\text{object}}(\theta )$ when $J>3h(\theta )$ and ${\overline{x}}_{J}$ to be ${s}_{\text{object}}(\overline{\theta})$ when $J>3h(\overline{\theta})$. We induct backwards, beginning with $J=\mathrm{max}(3h(\theta ),3h(\overline{\theta}))+1$ and ending with $J=1$. In our base case, we know that these are two possibilities for $\u27e8{x}_{J},{\overline{x}}_{J}\u27e9$. If we are in the different module case, then the magnitude of this quantity is $O(\delta )$. If we are in the same module case, then this quantity is at least $\frac{1}{4}O(\delta )$. To get from $J$ to $J1$, there are three possibilities. We either (a) multiply both ${x}_{J}$ and ${\overline{x}}_{J}$ by the same matrix, (b) multiply ${x}_{J}$ and ${\overline{x}}_{J}$ by different matrices, or (c) multiply only one of them by a matrix. Note that case (c) happens when one object is at a different depth than the other object, so our analysis first peels off the additional matrices to make them the same depth, then proceeds. This is an important point, because tuple matrices are only shared across the same depth. In any one of these lettered cases, we know from Lemma 3 and Lemma 4 that the distribution of the matrix(ces) is $\frac{1}{2}$leaky $O(\delta )$Desynchronizing and $\frac{1}{2}$scaled $O(\delta )$Isometric. In case (a), we apply Lemma 2 to show that with high probability, the new dot product is $\frac{1}{2}$ times the old dot product plus $O(\delta )$ additive noise. In case (b), we apply Lemma 7 to show that with high probability, the new dot product is $\frac{1}{4}$ times the old dot product plus $O(\delta )$ additive noise. In case (c), we use the fact that our distribution is $\frac{1}{2}$leaky $O(\delta )$Desynchronizing to show that the new dot product is $\frac{1}{2}$ times the old dot product, plus $O(\delta )$ additive noise. Hence, if the objects were produced by different modules, then we begin with $O(\delta )$ noise and at every step we reduce by at least a factor of $\frac{1}{2}$ before adding $O(\delta )$ noise. The total noise telescopes into $O(\delta )$ total. If the objects were produced by the same module, then we begin with at least $\frac{1}{4}$ signal and $O(\delta )$ noise. The noise sums as before into $O(\delta )$ total, and the signal falls by at most $\frac{1}{4}$ per step, resulting in a final signal of $\mathrm{\Omega}({2}^{6h(\theta )})$. However, our original term under consideration was actually a scaled version of what we just analyzed: ${w}_{\theta}{\overline{w}}_{\overline{\theta}}{x}_{1}^{T}{\overline{x}}_{1}$. Hence in the different module case, there is $O({w}_{\theta}{\overline{w}}_{\overline{\theta}}\delta )$ noise, and in the same module case there is $\mathrm{\Omega}({2}^{6h(\theta )}{w}_{\theta}{\overline{w}}_{\overline{\theta}})$ signal and $O({w}_{\theta}{\overline{w}}_{\overline{\theta}}\delta )$ noise. Hence for case (i) which we were originally trying to prove, since the effective weights ${w}_{\theta}$ sum to one and the effective weights ${\overline{w}}_{\overline{\theta}}$ sum to one, we wind up, with high probability, at most $O(\delta )$ total noise, as desired. For case (ii) which we were originally trying to prove, we get at least $\mathrm{\Omega}({2}^{6h({\theta}^{\star})}{w}_{{\theta}^{\star}}{\overline{w}}_{\overline{{\theta}^{\star}}})O(\delta )$ dot product. We conclude by proving case (iii). Suppose that in sketch $\overline{s}$, the attributes of $\theta $ are ${x}_{\theta}+{\u03f5}_{\theta}$, where ${\left\left{\u03f5}_{\theta}\right\right}_{2}\le \u03f5$ for all objects $\theta $. Let’s write the difference between the two sketches. $\overline{s}s$ $={\displaystyle \sum _{\theta}}{w}_{\theta}\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h(\theta )}}}\left[\left({\displaystyle \frac{I+{R}_{\theta ,j}}{2}}\right)\right]{\displaystyle \frac{1}{2}}{R}_{M(\theta ),1}{\u03f5}_{\theta}$ Applying Lemma 4, we know that $\left(\frac{I+R}{2}\right)$ is $\frac{1}{2}$scaled $\frac{3}{4}\delta $Isometric. Repeatedly applying Lemma 2, we can conclude that with high probability: ${\left\left\stackrel{\curvearrowright}{{\displaystyle \prod _{j=1}^{3h(\theta )}}}\left[\left({\displaystyle \frac{I+{R}_{\theta ,j}}{2}}\right)\right]{R}_{M(\theta ),1}{\displaystyle \frac{1}{4}}{\u03f5}_{\theta}\right\right}_{2}^{2}$ $\le {({\displaystyle \frac{1}{2}}+{\displaystyle \frac{3}{4}}\delta )}^{3h(\theta )}(1+\delta ){\displaystyle \frac{1}{4}}{\u03f5}^{2}$ $\le {({\displaystyle \frac{1}{2}}+{\displaystyle \frac{3}{8}})}^{3h(\theta )+1}{\displaystyle \frac{1}{4}}{\u03f5}^{2}$ $\le {\displaystyle \frac{1}{4}}{\u03f5}^{2}$ Taking the square root of the above and applying the triangle inequality yields the following. ${\left\left\overline{s}s\right\right}_{2}$ $\le {\displaystyle \sum _{\theta}}{w}_{\theta}{\displaystyle \frac{1}{2}}\u03f5$ $\le {\displaystyle \frac{1}{2}}\u03f5$ This completes the proof of case (iii). ∎D.3 Proofs for Summary Statistics
Theorem 11.
Our final sketch has the Summary Statistics property, which is as follows. Consider a module ${M}^{\mathrm{\star}}$ which lies at depth $h\mathit{}\mathrm{(}{M}^{\mathrm{\star}}\mathrm{)}$, and suppose all the objects produced by ${M}^{\mathrm{\star}}$ has the same effective weight ${w}^{\mathrm{\star}}$. (i) Frequency Recovery. We can produce an estimate of the number of objects produced by ${M}^{\star}$. With high probability, our estimate has an additive error of at most $O({2}^{3h({M}^{\star})}\delta /{w}^{\star})$. (ii) Summed Attribute Recovery. We can produce a vector estimate of the summed attribute vectors of the objects produced by ${M}^{\star}$. With high probability, our estimate has additive error of at most $O({2}^{3h({M}^{\star})}\delta /{w}^{\star})$ in each coordinate.Proof.
Let the overall sketch be $s$. We begin by proving case (i). As we did in case (ii) of Attribute Recovery, let $\beta ={2}^{3h({M}^{\star})+1}/{w}^{\star}$. We compute $\beta {e}_{1}^{T}{R}_{{M}^{\star},1}^{T}s$, which has the form: $$\sum _{\theta}\beta {w}_{\theta}{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ There are two types of terms in this sum: those objects produced by ${M}^{\star}$, and those that are not. Consider some object ${\theta}^{\star}$ produced by ${M}^{\star}$. When $\theta ={\theta}^{\star}$, we get the term: $${2}^{3h({\theta}^{\star})+1}{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h({M}^{\star})}}\left[\left(\frac{I+{R}_{{\theta}^{\star},j}}{2}\right)\right]{s}_{\text{object}}({\theta}^{\star})$$ If we expand out each transparent matrix, we would get ${2}^{3h({M}^{\star})}$ subterms. Again, focus on the subterm where we choose $I/2$ every time: $$2{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\left[\frac{1}{2}{R}_{{M}^{\star},1}{x}_{{\theta}^{\star}}+\frac{1}{2}{R}_{{M}^{\star},2}{e}_{1}\right]$$ By the $\delta $Isometry property of our distribution, we know that with high probability (note ${e}_{1}$ is a unit vector): $$\left2{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\times \frac{1}{2}{R}_{{M}^{\star},2}{e}_{1}{e}_{1}^{T}{e}_{1}\right\le \delta $$ This half of the subterm (up to additive error $O(\delta )$) contributes one to our estimate, representing one copy of an object produced by ${M}^{\star}$. We control the other half of the subterm using Lemma 7: $$\left2{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\times \frac{1}{2}{R}_{{M}^{\star},1}{x}_{{\theta}^{\star}}\right\le \delta \sqrt{1+\delta}=O(\delta )$$ Next, we bound the additive error from the other subterms of such a term ${\theta}^{\star}$. These other subterms have the form: $$2{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\stackrel{~}{R}{s}_{\text{object}}({\theta}^{\star})$$ where $\stackrel{~}{R}$ is a product of one to $3h({M}^{\star})$ $R$ matrices. As before, we apply Lemma 1 so that with high probability ${s}_{\text{object}}({\theta}^{\star})$ and ${R}_{{M}^{\star},2}{e}_{1}$ have squared ${\mathrm{\ell}}_{2}$ norm at most $(1+\delta )$, and then conclude using Lemma 5 that the distribution of $\stackrel{~}{R}$ satisfies $O(\delta )$Desynchronization (this was the worst case). Hence with high probability the absolute value of any other subterm was bounded by $O(\delta )$, and the absolute value of the sum of the other subterms for ${\theta}^{\star}$ is bounded by $O({2}^{3h({M}^{\star})}\delta )$. There are at most $1/{w}^{\star}$ such objects, so the total error from these terms is $O({2}^{3h({M}^{\star})}\delta /{w}^{\star})$, as desired. It remains to show that the total error from the other terms is also $O({2}^{3h({M}^{\star})}\delta /{w}^{\star})$. These other terms have the form: $$\beta {w}_{\theta}{e}_{1}^{T}{R}_{{M}^{\star},2}^{T}\stackrel{\curvearrowright}{\prod _{j=1}^{3h(\theta )}}\left[\left(\frac{I+{R}_{\theta ,j}}{2}\right)\right]{s}_{\text{object}}(\theta )$$ We prove this using the $\delta $Desynchronization of ${R}_{{M}^{\star},2}$. To do so, we will need to bound the ${\mathrm{\ell}}_{2}$ length of ${s}_{\text{object}}(\theta )$ as it gets multiplied by this sequence of matrices. As before, we use Lemma 4 to know that the distribution of these matrices is $\frac{1}{2}$scaled $(\frac{3}{4}\delta )$Isometric, and then repeatedly apply Lemma 6 to conclude that the squared ${\mathrm{\ell}}_{2}$ length is upper bounded by a constant. Hence the $\delta $Desynchronization of ${R}_{{M}^{\star},2}$ implies that with high probability the absolute value of the term indexed by $\theta $ is at most $O(\beta {w}_{\theta}\delta )$. Hence the absolute value of all other terms is at most $O(\beta \delta )$. Plugging in our choice of $\beta $, it is at most $O({2}^{3h({M}^{\star})}\delta /{w}^{\star})$, as desired. We have finished proving case (i); with high probability, we can recover the number of objects produced by module ${M}^{\star}$, with the desired additive error. We now prove case (ii). We do exactly what we did in case (i), except that instead of computing $\beta {e}_{1}^{T}{R}_{{M}^{\star},2}^{T}s$ we compute $\beta {e}_{j}^{T}{R}_{{M}^{\star},1}^{T}$. Instead of operating on the first coordinate of the ${e}_{1}$ in our object sketch, we now operate on the ${j}^{th}$ coordinate of the ${x}_{\theta}$ in our object sketch, and hence we can recover the ${j}^{th}$ coordinate of the sum of the attribute vectors of the appropriate objects to the same error as in case (i). ∎D.4 Generalizing for Object Signature Recovery
In this subsection, we explain how to also satisfy the Object Signature Recovery property through a small modification to our sketch. The key idea is that each object is given a “magic number” based on its attributes: ${m}_{\theta}$ is a random $(\mathrm{log}N)$sparse unit vector whose entries are in $\{0,\frac{1}{\sqrt{\mathrm{log}N}}\}$. We revise the sketch of object $\theta $ to be $\frac{1}{3}{R}_{M(\theta ),1}{x}_{\theta}+\frac{1}{3}{R}_{M(\theta ),2}{e}_{1}+\frac{1}{3}{R}_{M(\theta ),3}{m}_{\theta}$. The process of recovering an object signature is the same as recovering its attribute vector, except that the signature is quantized and therefore as long as the additive error is less than $\frac{1}{2\sqrt{\mathrm{log}N}}$, we can recover it exactly.D.5 Generalizing for Graceful Erasure
In this subsection, we explain how to generalize our previous proofs for the Graceful Erasure property. We will need to better understand the behavior of our matrices when only a prefix of the result is considered. We can extend our definitions as follows:Definition 4.
We say that a distribution $D$ over random ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\times}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ matrices satisfies the ${d}_{D\mathit{}E}$prefix ${\alpha}_{D\mathit{}E}$leaky ${\delta}_{D\mathit{}E}$Desynchronization Property if the following is true. If $R\mathrm{\sim}D$, then the random variable $Z\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{{d}_{D\mathit{}E}}{\mathrm{(}R\mathit{}x\mathrm{)}}_{i}\mathit{}{\mathrm{(}y\mathrm{)}}_{i}\mathrm{}{\alpha}_{D\mathit{}E}\mathit{}\mathrm{\u27e8}x\mathrm{,}y\mathrm{\u27e9}$ is a ${\delta}_{D\mathit{}E}$nice noise variable. If ${\alpha}_{D\mathit{}E}\mathrm{=}\mathrm{0}$ then we simply refer to this as the ${d}_{D\mathit{}E}$prefix ${\delta}_{D\mathit{}E}$Desynchronization Property.Definition 5.
We say that a distribution $D$ over random ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\times}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ matrices satisfies the ${d}_{I\mathit{}E}$prefix ${\alpha}_{I\mathit{}E}$leaky ${\delta}_{I\mathit{}E}$Isometry Property if the following is true. If $R\mathrm{\sim}D$ and $x\mathrm{=}y$, then the random variable $Z\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{{d}_{I\mathit{}E}}{\mathrm{(}R\mathit{}x\mathrm{)}}_{i}^{\mathrm{2}}\mathrm{}{\alpha}_{I\mathit{}E}\mathit{}\mathrm{\u27e8}x\mathrm{,}x\mathrm{\u27e9}$ is a ${\delta}_{I\mathit{}E}$nice noise variable. If ${\alpha}_{I\mathit{}E}\mathrm{=}\mathrm{1}$ then we simply refer to this as the ${d}_{I\mathit{}E}$prefix ${\delta}_{I\mathit{}E}$Isometry Property.Lemma 8.
If $D$ satisfies the ${d}_{I\mathit{}E}$prefix ${\alpha}_{I\mathit{}E}$leaky ${\delta}_{I\mathit{}E}$Isometry Property, then the random variable $Z\mathrm{=}{\mathrm{\sum}}_{i\mathrm{=}\mathrm{1}}^{{d}_{I\mathit{}E}}{\mathrm{(}R\mathit{}x\mathrm{)}}_{i}\mathit{}{\mathrm{(}R\mathit{}y\mathrm{)}}_{i}\mathrm{}{\alpha}_{I\mathit{}E}\mathit{}\mathrm{\u27e8}x\mathrm{,}y\mathrm{\u27e9}$ is a $\mathrm{(}\mathrm{3}\mathit{}{\delta}_{I\mathit{}E}\mathrm{)}$nice noise variable.Proof.
The proof is the same idea as before. The claim is trivially true if $x$ or $y$ is zero, and then without loss of generality we scale them to unit vectors. We invoke the ${d}_{IE}$prefix ${\alpha}_{IE}$leaky ${\delta}_{IE}$Isometry Property three times, on ${x}_{1}=x+y$, ${x}_{2}=x$, and ${x}_{3}=y$. This implies that the following three noise variables are ${d}_{IE}$nice: ${Z}_{1}$ $={\displaystyle \sum _{i=1}^{{d}_{IE}}}{(R(x+y))}_{i}^{2}{\alpha}_{IE}\u27e8x+y,x+y\u27e9$ ${Z}_{2}$ $={\displaystyle \sum _{i=1}^{{d}_{IE}}}{(R(x))}_{i}^{2}{\alpha}_{I}\u27e8x,x\u27e9$ ${Z}_{3}$ $={\displaystyle \sum _{i=1}^{{d}_{IE}}}{(R(y))}_{i}^{2}{\alpha}_{I}\u27e8y,y\u27e9$ By construction, $Z=\frac{1}{2}{Z}_{1}\frac{1}{2}{Z}_{2}\frac{1}{2}{Z}_{3}$, so $Z$ is centered. We bound it by noting that ${\left\leftx+y\right\right}_{2}^{2}\le 4$. We know that with high probability, $\left{Z}_{1}\right\le 4{\delta}_{IE}$, $\left{Z}_{2}\right\le {\delta}_{IE}$, and $\left{Z}_{3}\right\le {\delta}_{IE}$. By the triangle inequality, $Z$ is $(3{\delta}_{IE})$bounded, making it a $(3{\delta}_{IE})$noise variable and completing the proof. ∎Claim 6.
Suppose we have an overall sketch $s$ and we keep only a ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}^{\mathrm{\prime}}\mathrm{\le}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ prefix of it, ${s}^{\mathrm{\prime}}$. Consider an object ${\theta}^{\mathrm{\star}}$ which lay at depth $h\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}$ and had effective weight ${w}_{{\theta}^{\mathrm{\star}}}$ in the original sketch $s$. Suppose no other objects in $s$ are also produced by $M\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}$. We can produce a vector estimate of the attribute vector ${x}_{{\theta}^{\mathrm{\star}}}$ from ${s}^{\mathrm{\prime}}$, which with high probability has an additive error of at most $O\mathit{}\mathrm{(}{\mathrm{2}}^{\mathrm{3}\mathit{}h\mathit{}\mathrm{(}{\theta}^{\mathrm{\star}}\mathrm{)}}\mathit{}{\delta}_{E}\mathrm{/}{w}_{{\theta}^{\mathrm{\star}}}\mathrm{)}$ in each coordinate.Proof Sketch.
We have an overall sketch $s$ and a version ${s}^{\prime}$ where the last $({d}_{\text{sketch}}{d}_{\text{sketch}}^{\prime})$ coordinates have been replaced with zeros. We compute $\beta {e}_{j}^{T}{R}_{M({\theta}^{\star}),1}^{T}{s}^{\prime}$. From our proof of case (i) of Theorem 9, there are three things to analyze: the primary signal subterm, the other subterms, and the other terms. The primary signal subterm is now the following. $$\sum _{i=1}^{{d}_{\text{sketch}}^{\prime}}\left[{\left(\frac{1}{2}{R}_{M({\theta}^{\star}),1}{x}_{{\theta}^{\star}}+\frac{1}{2}{R}_{M({\theta}^{\star}),2}{e}_{1}\right)}_{i}{\left(2{R}_{M({\theta}^{\star}),1}{e}_{j}\right)}_{i}\right]$$ Using ${d}_{\text{sketch}}^{\prime}$prefix isometry, we can show that the first half is a (scaleddown) of ${e}_{j}^{T}{x}_{{\theta}^{\star}}$ plus (scaleddown) small noise. Using a generalization of Lemma 7, we can show that the second half is (scaleddown) small noise. When analyzing the other subterms, we use the ${d}_{\text{sketch}}^{\prime}$prefix desynchronization of the final matrix in $\stackrel{~}{R}$ to show the the subterm is just (scaleddown) small noise. Finally, when analyzing the other terms, we use the ${d}_{\text{sketch}}^{\prime}$prefix desynchronization of ${R}_{M({\theta}^{\star}),1}$ to show that the term is just (scaleddown) small noise. ∎Appendix E Proofs for Dictionary Learning
E.1 Proof of Network Learnability
In this subsection, we prove Theorem 5. Recall Theorem 5: See 5Proof.
Our plan is to use Theorem 4 to unwind our sketching. Precisely, we invoke it with $N$ as thrice the maximum between the number of modules and the number of objects in any communication graph, $\underset{~}{S}$ as $(\alpha /\beta )\mathrm{log}N=\text{poly}(N)$, $H$ as three times the depth of our modular network, and ${\u03f5}_{H}$ as $\mathrm{min}(\u03f5/4,1/4){2}^{H}w$. This yields a sketch dimension ${d}_{\text{sketch}}\le \text{poly}(1/{\u03f5}_{H})\mathrm{log}N\mathrm{log}\text{poly}(N)$ which is $\text{poly}(1/w,1/\u03f5){\mathrm{log}}^{2}N$ as claimed. Additionally, it yields a base number of samples $S=\text{poly}(N)$ and a sequence of ${\mathrm{\ell}}_{\mathrm{\infty}}$ errors ${\u03f5}_{1}\le {\u03f5}_{2}\le \mathrm{\cdots}\le {\u03f5}_{H1}\le {\u03f5}_{H}$. These are guarantees on an algorithm $A$ which accepts an $h\in [3\cdot \text{network depth}]$ as well as ${S}_{h}=S{N}^{h1}$ samples with up to ${\u03f5}_{h}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ error and $O(\sqrt{{d}_{\text{sketch}}})$ ${\mathrm{\ell}}_{1}$ error. This base number of samples is why we will require $\text{poly}(N)$ overall sketches. By construction, overall sketches are just input subsketches of the output psuedoobject. We know that no communication graph contains more than $N/3$ objects by the definition of $N$, so we can write all the overall sketches in the form: $${s}_{\text{overall}}=\sum _{i=1}^{N}{x}_{i}+{R}_{i}{x}_{i}$$ where ${x}_{i}$ is $({w}_{i}/2){s}_{\text{object}}({\theta}_{i})$ if there actually was an ${i}^{th}$ input and the zero vector otherwise. We want to treat ${\sum}_{i}{x}_{i}$ as ${\mathrm{\ell}}_{1}$ noise, but how much do they contribute? We can only tolerate them contributing at most $O(\sqrt{{d}_{\text{sketch}}})$ ${\mathrm{\ell}}_{1}$ norm. In fact, we claim that any sketch vector has at most $O(1)$ ${\mathrm{\ell}}_{2}$ norm, which would imply the desired statement by CauchySchwartz and noticing that we are taking a convex combination of sketch vectors. We prove an ${\mathrm{\ell}}_{2}$ bound on sketch vectors as in Appendix D, but roughly the idea is that our error parameter $\delta $, which we know to be $O(\frac{\mathrm{log}N}{\sqrt{{d}_{\text{sketch}}}})$, needs to be less than the constant $\frac{1}{8H}$ (recall that our network depth is constant). Since our sketch dimension already scaled with ${\mathrm{log}}^{2}N$, this is true up to making our sketch dimension a constant factor larger. We actually have no additional ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise, which is under the ${\u03f5}_{1}$ requirement. Hence, we can safely apply $A$ and retrieve $({w}_{i}/2)$ times our object sketches (and some zero vectors), up to ${\u03f5}_{2}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise. Note that we now have a total of $\underset{~}{S}N$ vectors, since the procedure turns one vector into $N$ vectors. If any of our retrieved vectors have less than ${\u03f5}_{2}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm, we declare them garbage and any further vectors that dictionary learning recovers from them as garbage. When learning modules, we will not use any garbage vectors. Now, we have several approximate, scaled object sketches and several garbage vectors. We note that the scaled object sketches, by definition, are of the form: $$w\cdot {s}_{\text{object}}(\theta )=w\left(\frac{I+{R}_{M(\theta ),0}}{2}\right)x\mathit{\hspace{1em}\hspace{1em}}\left(+\sum _{M\ne M(\theta )}{R}_{M,0}\overrightarrow{0}\right)$$ where $w$ is some weight from the previous step and $x$ is $w$ times a tuple sketch. Undoing the first matrix off an object sketch is exactly like the previous step; we handle the transparent matrix by noting that all sketches have constant ${\mathrm{\ell}}_{2}$ norm with high probability and hence it’s a matter of ${\mathrm{\ell}}_{1}$ noise for our algorithm. Our ${\u03f5}_{2}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise turns into ${\u03f5}_{3}$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise. The garbage vectors can be thought of as ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise plus the zero vector, which also decomposes nicely: $$\overrightarrow{0}=\sum _{M}{R}_{M,0}\overrightarrow{0}$$ so the dictionary learning procedure can safely handle them. There are at most $N/3$ matrices involved (we can imagine the problem as padded with zero matrices and additional zero vectors). We recover $\underset{~}{S}{N}^{2}$ vectors, which are attribute subsketches, input subsketches, and garbage vectors. The process of undoing attribute subsketches and input subsketches is why we need $N$ to be thrice the maximum between the number of modules and the number of objects in any communication graph. For attribute subsketches, there are up to two matrices for every module at this depth. For input subsketches, there is possibly a tuple matrix for every object at this depth. Still, there are at most $N$ matrices involved, so we can recover our $\underset{~}{S}{N}^{3}$ vectors, which are object attribute vectors, ${e}_{1}$’s, object sketches, and garbage vectors. Unlike previous steps, where we had one type of vector (and garbage vectors), we now have three types of vectors that we want to distinguish; we can only recurse on object sketches because we cannot guarantee that object attribute vectors or ${e}_{1}$ can be written as the correct matrixvector product form. We can get some helpful information by looking at which vectors produced which vectors in our dictionary learning procedure. By construction, we know that one object sketch gets turned into an attribute subsketch and an input subsketch. The former gets turned into an attribute vector and ${e}_{1}$ while the latter gets turned into a series of object sketches. For example, if the input subsketch produces at least three nongarbage vectors, then we know which subsketch was which (the attribute subsketch only produces up to two nongarbage vectors with high probability). Of course, this idea won’t help us distinguish in general; an object might only have two input objects! The better test is to see which subsketch produces a vector that looks more like ${e}_{1}$. To begin with, it’s worth noting that the two subsketches have the same scaling: the effective weight of object $\theta $, ${w}_{\theta}$, times ${2}^{2h(\theta )+2}$. Now, let’s consider how small the first coordinate of the (scaled) ${e}_{1}$ vector can be. It picks up an additional $1/2$ scaling factor, so the vector we recover is ${w}_{\theta}{2}^{2h(\theta )+1}{e}_{1}$ up to ${\u03f5}_{3h(\theta )2}$ additional ${\mathrm{\ell}}_{\mathrm{\infty}}$ error. We know that $$, so the first coordinate is at least: $${w}_{\theta}{2}^{2h(\theta )+1}w{2}^{H}$$ Now, let’s consider how large the first (or any) coordinate of any object sketch could be. Due to Theorem 7, we know that any (at most unit in length) vector has a first coordinate of at most $O(\frac{\mathrm{log}N}{\sqrt{{d}_{\text{sketch}}}})$ after one of our block random matrices is applied to it. Furthermore, applying the various transparent matrices $\left(\frac{I+R}{2}\right)$ throughout our sketch increases this by at most a constant factor. The object sketch is just a convex combination of such vectors, so it also has at most $O(\frac{\mathrm{log}N}{\sqrt{{d}_{\text{sketch}}}})$ in its first coordinate. Since there is up to $w{2}^{H}$ additional ${\mathrm{\ell}}_{\mathrm{\infty}}$ error in the recovery process and this vector is scaled too, we know that the recovered object sketch vector has first coordinate at most: $${w}_{\theta}{2}^{2h(\theta )+2}O(\frac{\mathrm{log}N}{\sqrt{{d}_{\text{sketch}}}})+w{2}^{H}$$ We require that $$. As already discussed, since our sketch dimension already scaled with ${\mathrm{log}}^{2}N$, this amounts to making the sketch dimension at most a constant factor larger. Now, our recovered object sketch vector has first coordinate at most: $${w}_{\theta}{2}^{2h(\theta )+1}/4+w{2}^{H}$$ Whether our two cases are distinguishable still depends on how $\theta $’s effective weight ${w}_{\theta}$ relates to the goal weight $w$. We could clearly distinguish the two if ${w}_{\theta}{2}^{2h(\theta )+1}$ was at least $4w{2}^{H}$, since then the first case would yield a value of at least $3w{2}^{H}$ and the second case a value of at most $2w{2}^{H}$. Hence our test is the following: if some vector produced by the object sketch has a first coordinate of at least $3w{2}^{H}$, then take the one with the largest first coordinate and decide that it was produced by the object attribute subsketch. If no vector produced by the object sketch has a first coordinate of at least $3w{2}^{H}$, then swap them all for zero vectors and declare them garbage. Notice that if a recovered object sketch vector manages to have a first coordinate value of at least $3w{2}^{H}$, then we know that ${w}_{\theta}{2}^{2h(\theta )+1}$ is at least $8w{2}^{H}$ and hence the scaled ${e}_{1}$ still beats all the recovered object sketch vectors. On the other hand, all the vectors are declared garbage, then we know that $$ and hence $$ (i.e. we don’t need to recover this object or any of its children). Hence our procedure works and we can continue to safely recurse. The result of this procedure is that for each module (we can tell which vectors belong to which modules due to shared final matrices) we recover unordered pairs of scaled (attribute vector, ${e}_{1}$), at least every time the module produces an object with at least $w$ weight (possibly more). We’ll consider the one with larger first coordinate to be ${e}_{1}$, and divide the other vector by the first coordinate of ${e}_{1}$ to unscale it. Note that both vectors have been scaled by ${w}_{\theta}{2}^{2h(\theta )+1}$ and then face at most ${\u03f5}_{H}$ additional ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise. note that we chose a $$ and no vector could have passed our test unless ${w}_{\theta}{2}^{2h(\theta )+1}\ge 2w{2}^{H}$, i.e. $w/{w}_{\theta}\le {2}^{H2h(\theta )}$. Combining these facts, we know that a single instance of the ${\mathrm{\ell}}_{\mathrm{\infty}}$ noise messes up any (unscaled) coordinate by at most: ${2}^{2h(\theta )1}{\u03f5}_{H}/{w}_{\theta}$ $\le {2}^{2h(\theta )1}(\u03f5/4){2}^{H}w/{w}_{\theta}$ $\le {2}^{2h(\theta )1}(\u03f5/4){2}^{H}{2}^{H2h(\theta )}$ $=(\u03f5/8)$ There are two cases: (i) we consider the correct vector to be ${e}_{1}$ and (ii) we consider the object attribute vector to be ${e}_{1}$. In case (i), our object attribute vector is scaled by a number which is within a multiplicative $(1\pm \u03f5/8)$ of correct. Additionally, its noise makes it off by an additive $\u03f5/8$ in ${\mathrm{\ell}}_{\mathrm{\infty}}$ distance. Since we capped $\u03f5$ to be at most one, the multiplicative difference is at most $(9/8)/(1\u03f5/8)9/8\le \frac{9}{56}\u03f5$, for a total difference of $\frac{2}{7}\u03f5$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ distance. In case (ii), the original attribute vector had to have first coordinate at least $1(\u03f5/4)$. Since it has an ${\mathrm{\ell}}_{1}$ norm of at most one, we know that it is at most $\u03f5/4$ in ${\mathrm{\ell}}_{\mathrm{\infty}}$ distance from ${e}_{1}$. We imagine that the noised version of ${e}_{1}$ is it; by triangle inequality this choice is at most $\frac{3}{8}\u03f5$ from it. We again pick up at most a $\frac{9}{56}\u03f5$ error from incorrect scaling, for a total of at most $\frac{15}{28}\u03f5$ ${\mathrm{\ell}}_{\mathrm{\infty}}$ error. We conclude the proof of the first half of our theorem by noting that since we started with $(\alpha /\beta )\mathrm{log}N$ samples, with high probability we will have at least $\alpha $ samples where we recover an input/output pair for ${M}^{\star}$ within the robust module learnability conditions. Regarding fixed communication graphs, we can use this same algorithm to detect the entire path to an object with high probability (i.e. what modules produce each node along the path) if the object has effective weight $w$ along this path. This means we can identify the entire subgraph of objects which have effective weight $w$ “often enough”. ∎E.2 Proof of Recursable Dictionary Learning
In this subsection, we prove Theorem 4, which we restate next. See 4 We point out that the ${\mathrm{\ell}}_{\mathrm{\infty}}$ bound on the recovery error in Theorem 4 implies that the $x$ vectors can be recovered exactly if they are quantized and the recovery error ${\u03f5}_{h+1}$ is small enough.Notation.
We next provide some notation that will be used in the proof of Theorem 4. Let $H$ be a given positive constant and let $S\le \text{poly}(N)$ be a base number of samples. Fix $h\in [H1]$ and let ${S}_{h}=S{N}^{h1}$. • For each sample index $i\in [{S}_{h}]$, we divide the ${d}_{\text{sketch}}$dimensional real vector ${y}^{(i)}$ into ${d}_{\text{sketch}}/b$ contiguous blocks each of length $b$, with the $\mathrm{\ell}$th block being ${y}^{(i)}[(\mathrm{\ell}1)b+1:\mathrm{\ell}b+1]$ for each $\mathrm{\ell}\in [{d}_{\text{sketch}}/b]$. • For any two points $x,y\in {\mathbb{R}}^{n}$, recall that their ${\mathrm{\ell}}_{\mathrm{\infty}}$distance is defined as $${d}_{\mathrm{\infty}}(x,y):={\parallel yx\parallel}_{\mathrm{\infty}}=\underset{i\in [n]}{\mathrm{max}}{y}_{i}{x}_{i}.$$ Moreover, we define the symmetric ${\mathrm{\ell}}_{\mathrm{\infty}}$distance between $x$ and $y$ as $\mathrm{min}({d}_{\mathrm{\infty}}(x,y),{d}_{\mathrm{\infty}}(x,y))$. • For any vector $y\in {\mathbb{R}}^{n}$ and any finite subset $L$ of ${\mathbb{R}}^{n}$, we denote the ${\mathrm{\ell}}_{\mathrm{\infty}}$distance of $y$ from $L$ by $${d}_{\mathrm{\infty}}(y,L):=\underset{x\in L}{\mathrm{min}}{\parallel yx\parallel}_{\mathrm{\infty}}.$$ If this distance is at least $\tau $, then we say that $y$ is $\tau $far in ${\mathrm{\ell}}_{\mathrm{\infty}}$distance from the set $L$. • For any $y,{y}^{\prime}\in {\mathbb{R}}^{n}$, we define their Hamming distance as $\mathrm{\Delta}(y,{y}^{\prime}):=\{j\in [n]:{y}_{j}\ne {y}_{j}^{\prime}\}$ and the symmetric Hamming distance as $\overline{\mathrm{\Delta}}(y,{y}^{\prime})=\mathrm{min}(\mathrm{\Delta}(y,{y}^{\prime}),\mathrm{\Delta}(y,{y}^{\prime}))$.Algorithm.
Analysis.
Theorem 4 follows from Theorem 12 below which proves the guarantees on the operation of Algorithm ‣ 5.3.Theorem 12 (Guarantees of Algorithm ‣ 5.3).
Let $b$, $q$ and ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$ satisfy $b\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{,}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{)}$, $q\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{,}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{)}\mathrm{/}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$, and ${d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\ge}\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}{\u03f5}_{H}\mathrm{,}\mathrm{log}\mathit{}N\mathrm{,}\mathrm{log}\mathit{}S\mathrm{)}$. For any unknown vectors ${x}^{\mathrm{(}\mathrm{1}\mathrm{)}}\mathrm{,}\mathrm{\dots}\mathrm{,}{x}^{\mathrm{(}{S}_{h}\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\cdot}N}$ with ${\mathrm{\ell}}_{\mathrm{2}}$norm at most $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$, if we draw ${R}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{R}_{N}\mathrm{\sim}\mathrm{D}\mathit{}\mathrm{(}b\mathrm{,}q\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ and receive ${S}_{h}$ noisy samples ${y}^{\mathrm{(}k\mathrm{)}}\mathrm{\u2254}\mathrm{[}{R}_{\mathrm{1}}\mathit{}{R}_{\mathrm{2}}\mathit{}\mathrm{\cdots}\mathit{}{R}_{N}\mathrm{]}\mathit{}{x}^{\mathrm{(}k\mathrm{)}}\mathrm{+}{z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{+}{z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}$ where each ${z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ is noise with ${\mathrm{\left}\mathrm{\left}{z}_{\mathrm{1}}^{\mathrm{(}k\mathrm{)}}\mathrm{\right}\mathrm{\right}}_{\mathrm{1}}\mathrm{\le}O\mathit{}\mathrm{(}\sqrt{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}\mathrm{)}$ (independent of our random matrices) and each ${z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ is noise with ${\mathrm{\left}\mathrm{\left}{z}_{\mathrm{\infty}}^{\mathrm{(}k\mathrm{)}}\mathrm{\right}\mathrm{\right}}_{\mathrm{\infty}}\mathrm{\le}{\u03f5}_{h}$ (also independent of our random matrices), then Algorithm ‣ 5.3, taking as inputs $h$, ${y}^{\mathrm{(}\mathrm{1}\mathrm{)}}$, $\dots $, ${y}^{\mathrm{(}{S}_{h}\mathrm{)}}$, and thresholds ${\tau}_{\mathrm{1}}\mathrm{=}\mathrm{\Theta}\mathit{}\mathrm{\left(}\frac{{\u03f5}_{h\mathrm{+}\mathrm{1}}}{\sqrt{q\mathrm{\cdot}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}}\mathrm{\right)}$ and ${\tau}_{\mathrm{2}}\mathrm{=}\mathrm{\Theta}\mathit{}\mathrm{(}{\u03f5}_{h\mathrm{+}\mathrm{1}}^{\mathrm{2}}\mathrm{)}$, runs in time $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}}\mathit{}\mathrm{(}{S}_{h}\mathrm{,}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{)}$ and outputs matrices ${\widehat{R}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\widehat{R}}_{N}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\times}{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}}$ and vectors ${\widehat{x}}^{\mathrm{(}\mathrm{1}\mathrm{)}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\widehat{x}}^{\mathrm{(}{S}_{h}\mathrm{)}}\mathrm{\in}{\mathrm{R}}^{{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}\mathrm{\cdot}N}$ that, with high probability, satisfy the following for some permutation $\pi $ of $\mathrm{[}N\mathrm{]}$: • Matrix Recovery: For every column $(i1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j$ of the matrix $[{R}_{1}{R}_{2}\mathrm{\cdots}{R}_{N}]$ (where $i\in [N]$ and $j\in [{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}]$) for which there exists a sample index ${k}^{\star}\in [{S}_{h}]$ such that $\left{x}_{(i1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{({k}^{\star})}\right\ge {\u03f5}_{h+1}$, the $j$th column of ${\widehat{R}}_{\pi (i)}$ is $0.2{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}$close in Hamming distance to the $j$th column of ${R}_{i}$. Moreover, all the other columns of ${\widehat{R}}_{\pi (i)}$ are zero. • Input Recovery: For each $i\in [N]$ and $j\in [{d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}]$, it is guaranteed that ${\widehat{x}}_{(\pi (i)1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{(k)}{x}_{(i1){d}_{\text{\mathit{s}\mathit{k}\mathit{e}\mathit{t}\mathit{c}\u210e}}+j}^{(k)}\le {\u03f5}_{h+1}$.\STATEOutput: A pair $(j,f)$ where $j$ is an index in $[{d}_{\text{sketch}}]$ and $f$ is a sign.
\[email protected] \FOReach sample index $k\in [{S}_{h}]$ \FOReach block index $\mathrm{\ell}\in [{d}_{\text{sketch}}/b]$ \STATELet ${w}_{k,\mathrm{\ell}}$ be the product of $\sqrt{q/b}$ and the ${\mathrm{\ell}}_{1}$norm of the $\mathrm{\ell}$th block of ${y}^{(k)}$. \IF${w}_{k,\mathrm{\ell}}$ is not zero \STATELet ${z}_{k,\mathrm{\ell}}$ be the coordinatewise normalization of the $\mathrm{\ell}$th block of ${y}^{(k)}$ by ${w}_{k,\mathrm{\ell}}$. \STATELet ${z}_{k,\mathrm{\ell},s}$, ${z}_{k,\mathrm{\ell},c}$ and ${z}_{k,\mathrm{\ell},m}$ be the first, middle and last $b/3$ coordinates of ${z}_{k,\mathrm{\ell}}$ respectively. \ENDIF\ENDFOR\ENDFOR\STATEInitialize $n$ to zero, ${\widehat{R}}_{1},\mathrm{\dots},{\widehat{R}}_{N}$ to the allzeros matrices and ${\widehat{x}}^{(1)},\mathrm{\dots},{\widehat{x}}^{({S}_{h})}$ to the allzeros vectors. \FOReach $i\in [N]$ \STATELet ${\widehat{\sigma}}_{i,m}$ be a vector in ${\{\pm \frac{1}{\sqrt{q{d}_{\text{sketch}}}}\}}^{b/3}$ that is initialized arbitrarily. \ENDFOR\FOReach sample index $k\in [{S}_{h}]$ \FOReach block index $\mathrm{\ell}\in [{d}_{\text{sketch}}/b]$ \IF${w}_{k,\mathrm{\ell}}$ is zero \STATEcontinue \ENDIF\IFthe $\mathrm{\ell}$th block of ${y}^{(k)}$ has a coordinate whose absolute value is smaller than ${\tau}_{1}$ or larger than $\frac{2}{\sqrt{q{d}_{\text{sketch}}}}$ \STATEcontinue \ENDIF\IF${z}_{k,\mathrm{\ell},s}$ is ${\tau}_{2}$far in ${\mathrm{\ell}}_{\mathrm{\infty}}$distance from the hypercube ${\{\pm \frac{1}{\sqrt{q{d}_{\text{sketch}}}}\}}^{b/3}$ \STATEcontinue \ENDIF\STATELet ${S}_{k,\mathrm{\ell}}$ be the set of all block indices ${\mathrm{\ell}}^{\prime}\in [{d}_{\text{sketch}}/b]$ for which ${w}_{k,{\mathrm{\ell}}^{\prime}}$ is nonzero, all the coordinates of the ${\mathrm{\ell}}^{\prime}$th block of ${y}^{(k)}$ are between ${\tau}_{1}$ and $\frac{2}{\sqrt{q{d}_{\text{sketch}}}}$ and ${z}_{k,{\mathrm{\ell}}^{\prime},s}$ is $2{\tau}_{2}$close in symmetric ${\mathrm{\ell}}_{\mathrm{\infty}}$distance to ${z}_{k,\mathrm{\ell},s}$. \IFthe cardinality of ${S}_{k,\mathrm{\ell}}$ is less than ${(0.9)}^{3}q{d}_{\text{sketch}}$ \STATEcontinue \ENDIF\STATELet ${\overline{z}}_{k,\mathrm{\ell}}$ be the coordinatewise rounding of ${z}_{k,\mathrm{\ell}}$ to the hypercube ${\{\pm \frac{1}{\sqrt{q{d}_{\text{sketch}}}}\}}^{b}$. \STATELet ${\overline{z}}_{k,\mathrm{\ell},s}$, ${\overline{z}}_{k,\mathrm{\ell},c}$, ${\overline{z}}_{k,\mathrm{\ell},m}$ be the most common first, middle and last $\frac{b}{3}$ coordinates of $\{{\overline{z}}_{k,{\mathrm{\ell}}^{\prime}}:{l}^{\prime}\in {S}_{k,\mathrm{\ell}}\}$ respectively. \IFthere is $i\in [n]$ such that ${\widehat{\sigma}}_{i,m}$ is $0.01b$close in symmetric Hamming distance to ${\overline{z}}_{k,\mathrm{\ell},m}$ \STATESet ${i}^{*}$ to $i$. \ELSE\STATEIncrement $n$ by $1$ and set ${i}^{*}$ to the new value of $n$. \STATESet ${\widehat{\sigma}}_{{i}^{*},m}$ to the matrix signature ${\overline{z}}_{k,\mathrm{\ell},m}$. \ENDIF \STATE Let $(j,f)=\mathrm{Dec}({\overline{z}}_{k,\mathrm{\ell},c})$ be the decoded column index and sign respectively. \IFthe sign of the first coordinate of ${\overline{z}}_{k,\mathrm{\ell},m}$ is the opposite of $f$ \STATENegate ${w}_{k,\mathrm{\ell}}$ and each coordinate of ${\overline{z}}_{k,\mathrm{\ell}}$. \ENDIF\IF${\widehat{R}}_{{i}^{*}}[:,j]$ is the allzeros vector \FOR${\mathrm{\ell}}^{\prime}\in {S}_{k,\mathrm{\ell}}$ \STATESet ${\widehat{R}}_{{i}^{*}}[({\mathrm{\ell}}^{\prime}1)b+1:{\mathrm{\ell}}^{\prime}b+1,j]$ to ${\overline{z}}_{k,\mathrm{\ell}}$. \ENDFOR\ENDIF\STATESet ${x}_{({i}^{*}1){d}_{\text{sketch}}+j}^{(k)}$ to ${w}_{k,\mathrm{\ell}}$. \ENDFOR\ENDFOR\RETURN${\widehat{R}}_{1},\mathrm{\dots},{\widehat{R}}_{N}$ and ${\widehat{x}}^{(1)},\mathrm{\dots},{\widehat{x}}^{({S}_{h})}$. Theorem 12 directly follows from Lemmas 9 and 10 below, applied in sequence. We point out that for simplicity of exposure, we chose large constants (which we did not attempt to optimize) in Lemmas 9 and 10.