Abstract
Integrating logical reasoning within deep learning architectures has been amajor goal of modern AI systems. In this paper, we propose a new directiontoward this goal by introducing a differentiable (smoothed) maximumsatisfiability (MAXSAT) solver that can be integrated into the loop of largerdeep learning systems. Our (approximate) solver is based upon a fast coordinatedescent approach to solving the semidefinite program (SDP) associated with theMAXSAT problem. We show how to analytically differentiate through the solutionto this SDP and efficiently solve the associated backward pass. We demonstratethat by integrating this solver into endtoend learning systems, we can learnthe logical structure of challenging problems in a minimally supervisedfashion. In particular, we show that we can learn the parity function usingsinglebit supervision (a traditionally hard task for deep networks) and learnhow to play 9x9 Sudoku solely from examples. We also solve a "visual Sudok"problem that maps images of Sudoku puzzles to their associated logicalsolutions by combining our MAXSAT solver with a traditional convolutionalarchitecture. Our approach thus shows promise in integrating logical structureswithin deep learning.
Quick Read (beta)
SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver
SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver
Abstract
Integrating logical reasoning within deep learning architectures has been a major goal of modern AI systems. In this paper, we propose a new direction toward this goal by introducing a differentiable (smoothed) maximum satisfiability (MAXSAT) solver that can be integrated into the loop of larger deep learning systems. Our (approximate) solver is based upon a fast coordinate descent approach to solving the semidefinite program (SDP) associated with the MAXSAT problem. We show how to analytically differentiate through the solution to this SDP and efficiently solve the associated backward pass. We demonstrate that by integrating this solver into endtoend learning systems, we can learn the logical structure of challenging problems in a minimally supervised fashion. In particular, we show that we can learn the parity function using singlebit supervision (a traditionally hard task for deep networks) and learn how to play $9\times 9$ Sudoku solely from examples. We also solve a “visual Sudoku” problem that maps images of Sudoku puzzles to their associated logical solutions by combining our MAXSAT solver with a traditional convolutional architecture. Our approach thus shows promise in integrating logical structures within deep learning.
1 Introduction
Although modern deep learning has produced groundbreaking improvements in a variety of domains, stateoftheart methods still struggle to capture “hard” and “global” constraints arising from discrete logical relationships. Motivated by this deficiency, there has been a great deal of recent interest in integrating logical or symbolic reasoning into neural network architectures (Palm et al., 2017; Yang et al., 2017; Cingillioglu & Russo, 2018; Evans & Grefenstette, 2018). However, with few exceptions, previous work primarily focuses on integrating preexisting relationships into a larger differentiable system via tunable continuous parameters, not on discovering the discrete relationships that produce a set of observations in a truly endtoend fashion. As an illustrative example, consider the popular logicbased puzzle game Sudoku, in which a player must fill in a $9\times 9$ partiallyfilled grid of numbers to satisfy specific constraints. If the rules of Sudoku (i.e. the relationships between problem variables) are not given, then it may be desirable to jointly learn the rules of the game and learn how to solve Sudoku puzzles in an endtoend manner.
We consider the problem of learning logical structure specifically as expressed by satisfiability problems – concretely, problems that are wellmodeled as instances of SAT or MAXSAT (the optimization analogue of SAT). This is a rich class of domains encompassing much of symbolic AI, which has traditionally been difficult to incorporate into neural network architectures since neural networks rely on continuous and differentiable parameterizations. Our key contribution is to develop and derive a differentiable smoothed MAXSAT solver that can be embedded within more complex deep architectures, and show that this solver enables effective endtoend learning of logical relationships from examples (without hardcoding of these relationships). More specifically, we build upon recent work in fast block coordinate descent methods for solving SDPs (Wang et al., 2017) to build a differentiable solver for the smoothed SDP relaxation of MAXSAT. We provide an efficient mechanism to differentiate through the optimal solution of this SDP by using a similar block coordinate descent solver as used in the forward pass. Our module is amenable to GPU acceleration, greatly improving training scalability.
Using this framework, we are able to solve several problems that, despite their simplicity, prove essentially impossible for traditional deep learning methods and existing logical learning methods to reliably learn without any prior knowledge. In particular, we show that we can learn the parity function, known to be challenging for deep classifiers (ShalevShwartz et al., 2017), with only single bit supervision. We also show that we can learn to play $9\times 9$ Sudoku, a problem that is challenging for modern neural network architectures (Palm et al., 2017). We demonstrate that our module quickly recovers the constraints that describe a feasible Sudoku solution, learning to correctly solve 98.3% of puzzles at test time without any handcoded knowledge of the problem structure. Finally, we show that we can embed this differentiable solver into larger architectures, solving a “visual Sudoku” problem where the input is an image of a Sudoku puzzle rather than a binary representation. We show that, in a fully endtoend setting, our method is able to integrate classical convolutional networks (for digit recognition) with the differentiable MAXSAT solver (to learn the logical portion). Taken together, this presents a substantial advance toward a major goal of modern AI: integrating logical reasoning into deep learning architectures.
2 Related work
Recently, the deep learning community has given increasing attention to the concept of embedding complex, “nontraditional” layers within deep networks in order to train systems endtoend. Major examples have included logical reasoning modules and optimization layers. Our work combines research in these two areas by exploiting optimizationbased relaxations of logical reasoning structures, namely an SDP relaxation of MAXSAT. We explore each of these relevant areas of research in more detail below.
Logical reasoning in deep networks.
Our work is closely related to recent interest in integrating logical reasoning into deep learning architectures (Garcez et al., 2015). Most previous systems have focused on creating differentiable modules from an existing set of known relationships, so that a deep network can learn the parameters of these relationships (Dai et al., 2018; Manhaeve et al., 2018; Sourek et al., 2018; Xu et al., 2018; Hu et al., 2016; Yang et al., 2017; Selsam et al., 2018). For example, Palm et al. (2017) introduce a network that carries out relational reasoning using handcoded information about which variables are allowed to interact, and test this network on $9\times 9$ Sudoku. Similarly, Evans & Grefenstette (2018) integrate inductive logic programming into neural networks by constructing differentiable SATbased representations for specific “rule templates.” While these networks are seeded with prior information about the relationships between variables, our approach learns these relationships and their associated parameters endtoend. While other recent work has also tried to jointly learn rules and parameters, the problem classes captured by these architectures have been limited. For instance, Cingillioglu & Russo (2018) train a neural network to apply a specific class of logic programs, namely the binary classification problem of whether a given set of propositions entails a specific conclusion. While this approach does not rely on prior handcoded structure, our method applies to a broader class of domains, encompassing any problem reducible to MAXSAT.
Differentiable optimization layers.
Our work also fits within a line of research leveraging optimization as a layer in neural networks. For instance, previous work has introduced differentiable modules for quadratic programs (Amos & Kolter, 2017; Donti et al., 2017), submodular optimization problems (Djolonga & Krause, 2017; Tschiatschek et al., 2018; Wilder et al., 2018), and equilibrium computation in zerosum games (Ling et al., 2018). To our knowledge, ours is the first work to use differentiable SDP relaxations to capture relationships between discrete variables.
MAXSAT SDP relaxations.
We build on a long line of research exploring SDP relaxations as a tool for solving MAXSAT and related problems. Classical work shows that such relaxations produce strong approximation guarantees for MAXCUT and MAX2SAT (Goemans & Williamson, 1995), and are empirically tighter than standard linear programming relaxations (Gomes et al., 2006). More recent work, e.g. Wang et al. (2017); Wang & Kolter (2019), has developed lowrank SDP solvers for general MAXSAT problems. We extend the work of Wang et al. (2017) to create a differentiable optimizationbased MAXSAT solver that can be employed in the loop of deep learning.
3 A differentiable satisfiability solver
The MAXSAT problem is the optimization analogue of the wellknown satisfiability (SAT) problem, in which the goal is to maximize the number of clauses satisfied. We present a differentiable, smoothed approximate MAXSAT solver that can be integrated into modern deep network architectures. This solver uses a fast coordinate descent approach to solving an SDP relaxation of MAXSAT. We describe our MAXSAT SDP relaxation as well as the forward pass of our MAXSAT deep network layer (which employs this relaxation). We then show how to analytically differentiate through the MAXSAT SDP and efficiently solve the associated backward pass.
3.1 Solving an SDP formulation of satisfiability
Consider a MAXSAT instance with $n$ variables and $m$ clauses. Let $\stackrel{~}{v}\in {\{1,1\}}^{n}$ denote binary assignments of the problem variables, where ${\stackrel{~}{v}}_{i}$ is the truth value of variable $i\in \{1,\mathrm{\dots},n\}$, and define ${\stackrel{~}{s}}_{i}\in {\{1,0,1\}}^{m}$ for $i\in \{1,\mathrm{\dots},n\}$, where ${\stackrel{~}{s}}_{ij}$ denotes the sign of ${\stackrel{~}{v}}_{i}$ in clause $j\in \{1,\mathrm{\dots},m\}$. We then write the MAXSAT problem as
$$\underset{\stackrel{~}{v}\in {\{1,1\}}^{n}}{maximize}\sum _{j=1}^{m}\underset{i=1}{\overset{n}{\bigvee}}\mathrm{\U0001d7cf}\{{\stackrel{~}{s}}_{ij}{\stackrel{~}{v}}_{i}>0\}.$$  (1) 
As derived in Goemans & Williamson (1995); Wang & Kolter (2019), to form a semidefinite relaxation of (1), we first relax the discrete variables ${\stackrel{~}{v}}_{i}$ into associated continuous variables ${v}_{i}\in {\mathbb{R}}^{k},\parallel {v}_{i}\parallel =1$ with respect to some “truth direction” ${v}_{\top}\in {\mathbb{R}}^{k},\parallel {v}_{\top}\parallel =1$. Specifically, we relate the continuous ${v}_{i}$ to the discrete ${\stackrel{~}{v}}_{i}$ probabilistically via $P({\stackrel{~}{v}}_{i}=1)={\mathrm{cos}}^{1}({\mathrm{v}}_{\mathrm{i}}^{\mathrm{T}}{\mathrm{v}}_{\top})/\pi $ based on randomized rounding (Goemans & Williamson (1995); see Section 3.2.4). We additionally define a coefficient vector ${\stackrel{~}{s}}_{\top}={\{1\}}^{m}$ associated with ${v}_{\top}$. Our SDP relaxation of MAXSAT is then
$\underset{V\in {\mathbb{R}}^{k\times (n+1)}}{minimize}$  $\u27e8{S}^{T}S,{V}^{T}V\u27e9,$  (2)  
subject to  $\parallel {v}_{i}\parallel =1,i=\top ,1,\mathrm{\dots},n$ 
where $V\equiv \left[\begin{array}{cccc}\hfill {v}_{\top}\hfill & \hfill {v}_{1}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {v}_{n}\hfill \end{array}\right]\in {\mathbb{R}}^{k\times (n+1)}$, and $S\equiv \left[\begin{array}{cccc}\hfill {\stackrel{~}{s}}_{\top}\hfill & \hfill {\stackrel{~}{s}}_{1}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {\stackrel{~}{s}}_{n}\hfill \end{array}\right]\mathrm{diag}(1/\sqrt{4{\stackrel{~}{\mathrm{s}}}_{\mathrm{j}}})\in {\mathbb{R}}^{\mathrm{m}\times (\mathrm{n}+1)}$. We note that this problem is a lowrank (but nonconvex) formulation of MINUNSAT, which is equivalent to MAXSAT. This formulation can be rewritten as an SDP, and has been shown to recover the optimal SDP solution given $k>\sqrt{2n}$ (Barvinok, 1995; Pataki, 1998).
Despite its nonconvexity, problem (2) can then be solved optimally via coordinate descent for all $i=\top ,1,\mathrm{\dots},n$. In particular, the objective terms that depend on ${v}_{i}$ are given by ${v}_{i}^{T}{\sum}_{j=0}^{n}{s}_{i}^{T}{s}_{j}{v}_{j}$, where ${s}_{i}$ is the $i$th column vector of $S$. Minimizing this quantity over ${v}_{i}$ subject to the constraint that $\parallel {v}_{i}\parallel =1$ yields the coordinate descent update
$${v}_{i}={g}_{i}/\parallel {g}_{i}\parallel ,\text{where}{g}_{i}=V{S}^{T}{s}_{i}{\parallel {s}_{i}\parallel}^{2}{v}_{i}.$$  (3) 
These updates provably converge to the globally optimal fixed point of the SDP (2) (Wang et al., 2017). A more detailed derivation of this update can be found in Appendix A.
3.2 SATNet: Satisfiability solving as a layer
Using our MAXSAT SDP relaxation and associated coordinate descent updates, we create a deep network layer for satisfiability solving (SATNet). Define $\mathcal{I}\subset \{1,\mathrm{\dots},n\}$ to be the indices of MAXSAT variables with known assignments, and let $\mathcal{O}\equiv \{1,\mathrm{\dots},n\}\setminus \mathcal{I}$ correspond to the indices of variables with unknown assignments. Our layer admits probabilistic or binary inputs ${z}_{\iota}\in [0,1],\iota \in \mathcal{I}$, and then outputs the assignments of unknown variables ${z}_{o}\in [0,1],o\in \mathcal{O}$ which are similarly probabilistic or (optionally, at test time) binary. We let ${Z}_{\mathcal{I}}\in {[0,1]}^{\mathcal{I}}$ and ${Z}_{\mathcal{O}}\in {[0,1]}^{\mathcal{O}}$ refer to all input and output assignments, respectively.
The outputs ${Z}_{\mathcal{O}}$ are generated from inputs ${Z}_{\mathcal{I}}$ via the SDP (2), and the weights of our layer correspond to the SDP’s lowrank coefficient matrix $S$. This forward pass procedure is pictured in Figure 1. We describe the steps of layer initialization and the forward pass in Algorithm 1, and in more detail below.
3.2.1 Layer initialization
When initializing SATNet, the user must specify a maximum number of clauses $m$ that this layer can represent. It is often desirable to set $m$ to be low; in particular, lowrank structure can prevent overfitting and thus improve generalization.
Given this lowrank structure, a user may wish to somewhat increase the layer’s representational ability via auxiliary variables. The highlevel intuition here follows from the conjunctive normal form (CNF) representation of boolean satisfaction problems; adding additional variables to a problem can dramatically reduce the number of CNF clauses needed to describe that problem, as these variables play a role akin to register memory that is useful for inference.
3.2.2 Step 1: Relaxing layer inputs
Our layer first relaxes its inputs ${Z}_{\mathcal{I}}$ into continuous vectors for use in the SDP formulation (2). That is, we relax each layer input ${z}_{\iota},\iota \in \mathcal{I}$ to an associated random unit vector ${v}_{\iota}\in {\mathbb{R}}^{k}$ so that
$${v}_{\iota}^{T}{v}_{\top}=\mathrm{cos}(\pi {z}_{\iota}).$$  (4) 
(This equation is derived from the probabilistic relationship described in Section 3.1 between discrete variables and their continuous relaxations.) Constraint (4) can be satisfied by
$${v}_{\iota}=\mathrm{cos}(\pi {z}_{\iota}){v}_{\top}+\mathrm{sin}(\pi {z}_{\iota})({I}_{k}{v}_{\top}{v}_{\top}^{T}){v}_{\iota}^{\text{rand}},$$  (5) 
where ${v}_{\iota}^{\text{rand}}$ is a random unit vector. For simplicity, we use the notation ${V}_{\mathcal{I}}\in {\mathbb{R}}^{k\times \mathcal{I}}$ (i.e. the $\mathcal{I}$indexed column subset of $V$) to collectively refer to all relaxed layer inputs derived via Equation (5).
3.2.3 Step 2: Generating continuous relaxations of outputs via SDP
Given the continuous input relaxations ${V}_{\mathcal{I}}$, our layer employs the coordinate descent updates (3) to compute values for continuous output relaxations ${v}_{o},o\in \mathcal{O}$ (which we collectively refer to as ${V}_{\mathcal{O}}\in {\mathbb{R}}^{k\times \mathcal{O}}$). Notably, coordinate descent updates are only computed for output variables, i.e. are not computed for variables whose assignments are given as input to the layer.
Our coordinate descent algorithm for the forward pass is detailed in Algorithm 2. This algorithm maintains the term $\mathrm{\Omega}=V{S}^{T}$ needed to compute ${g}_{o}$, and then modifies it via a rankone update during each inner iteration. Accordingly, the periteration runtime is $O(nmk)$ (and in practice, only a small number of iterations is required for convergence).
3.2.4 Step 3: Generating discrete or probabilistic outputs
Given the relaxed outputs ${V}_{\mathcal{O}}$ from coordinate descent, our layer converts these outputs to discrete or probabilistic variable assignments ${Z}_{\mathcal{O}}$ via either thresholding or randomized rounding (which we describe here).
The main idea of randomized rounding is that for every ${v}_{o},o\in \mathcal{O}$, we can take a random hyperplane $r$ from the unit sphere and assign
$${\stackrel{~}{v}}_{o}=\{\begin{array}{cc}1\hfill & \text{if sign}({v}_{o}^{T}r)=\text{sign}({v}_{\top}^{T}r)\hfill \\ 1\hfill & \text{otherwise}\hfill \end{array},o\in \mathcal{O},$$  (6) 
where ${\stackrel{~}{v}}_{o}$ is the boolean output for ${v}_{o}$. Intuitively, this scheme sets ${\stackrel{~}{v}}_{o}$ to “true” if and only if ${v}_{o}$ and the truth vector ${v}_{\top}$ are on the same side of the random hyperplane $r$. Given the correct weights $S$, this randomized rounding procedure assures an optimal expected approximation ratio for certain NPhard problems (Goemans & Williamson, 1995).
During training, we do not explicitly perform randomized rounding. We instead note that the probability that ${v}_{o}$ and ${v}_{\top}$ are on the same side of any given $r$ is
$$P({\stackrel{~}{v}}_{o})={\mathrm{cos}}^{1}({v}_{o}^{T}{v}_{\top})/\pi ,$$  (7) 
and thus set ${z}_{o}=P({\stackrel{~}{v}}_{o})$ to equal this probability.
During testing, we can either output probabilistic outputs in the same fashion, or output discrete assignments via thresholding or randomized rounding. If using randomized rounding, we round multiple times, and then set ${z}_{o}$ to be the boolean solution maximizing the MAXSAT objective in Equation (1). Prior work has observed that such repeated rounding improves approximation ratios in practice, especially for MAXSAT problems (Wang & Kolter, 2019).
3.3 Computing the backward pass
We now derive backpropagation updates through our SATNet layer to enable its integration into a neural network. That is, given the gradients $\partial \mathrm{\ell}/\partial {\mathrm{Z}}_{\mathcal{O}}$ of the network loss $\mathrm{\ell}$ with respect to the layer outputs, we must compute the gradients $\partial \mathrm{\ell}/\partial {\mathrm{Z}}_{\mathcal{I}}$ with respect to layer inputs and $\partial \mathrm{\ell}/\partial \mathrm{S}$ with respect to layer weights. As it would be inefficient in terms of time and memory to explicitly unroll the forwardpass computations and store intermediate Jacobians, we instead derive analytical expressions to compute the desired gradients directly, employing an efficient coordinate descent algorithm. The procedure for computing these gradients is summarized in Algorithm 1 and derived below.
3.3.1 From probabilistic outputs to their continuous relaxations
Given $\partial \mathrm{\ell}/\partial {\mathrm{Z}}_{\mathcal{O}}$ (with respect to the layer outputs), we first derive an expression for $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{O}}$ (with respect to the output relaxations) by pushing gradients through the probability assignment mechanism described in Section 3.2.4. That is, for each $o\in \mathcal{O}$,
$$\frac{\partial \mathrm{\ell}}{\partial {v}_{o}}=\left(\frac{\partial \mathrm{\ell}}{\partial {z}_{o}}\right)\left(\frac{\partial {z}_{o}}{\partial {v}_{o}}\right)=\left(\frac{\partial \mathrm{\ell}}{\partial {z}_{o}}\right)\frac{1}{\pi \mathrm{sin}(\pi {z}_{o})}{v}_{\top},$$  (8) 
where we obtain $\partial {\mathrm{z}}_{\mathrm{o}}/\partial {\mathrm{v}}_{\mathrm{o}}$ by differentiating through Equation (7) (or, more readily, by implicitly differentiating through its rearrangement $\mathrm{cos}(\pi {z}_{o})={v}_{\top}^{T}{v}_{o}$).
3.3.2 Backpropagation through the SDP
Given the analytical form for $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{O}}$ (with respect to the output relaxations), we next seek to derive $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{I}}$ (with respect to the input relaxations) and $\partial \mathrm{\ell}/\partial \mathrm{S}$ (with respect to the layer weights) by pushing gradients through our SDP solution procedure (Section 3.2.3). We describe the analytical form for the resultant gradients in Theorem 1.
Theorem 1.
Define ${P}_{o}\mathrm{\equiv}{I}_{k}\mathrm{}{v}_{o}\mathit{}{v}_{o}^{T}$ for each $o\mathrm{\in}\mathrm{O}$. Then, define $U\mathrm{\in}{\mathrm{R}}^{k\mathrm{\times}n}$, where the columns ${U}_{\mathrm{I}}\mathrm{=}\mathrm{0}$ and the columns ${U}_{\mathrm{O}}$ are given by
$$\mathrm{vec}({U}_{\mathcal{O}})={\left(P((C+D)\otimes {I}_{k})P\right)}^{\u2020}\mathrm{vec}\left(\frac{\partial \mathrm{\ell}}{\partial {V}_{\mathcal{O}}}\right),$$  (9) 
where $P\mathrm{\equiv}\mathrm{diag}\mathit{}\mathrm{(}{P}_{o}\mathrm{)}$, where $C\mathrm{\equiv}{S}_{\mathrm{O}}^{T}\mathit{}{S}_{\mathrm{O}}\mathrm{}\mathrm{diag}\mathit{}\mathrm{(}{\mathrm{\parallel}{s}_{o}\mathrm{\parallel}}^{\mathrm{2}}\mathrm{)}$, and where $D\mathrm{\equiv}\mathrm{diag}\mathit{}\mathrm{(}\mathrm{\parallel}{g}_{o}\mathrm{\parallel}\mathrm{)}$. Then, the gradient of the network loss $\mathrm{\ell}$ with respect to the relaxed layer inputs is
$$\frac{\partial \mathrm{\ell}}{\partial {V}_{\mathcal{I}}}=\left(\sum _{o\in \mathcal{O}}{u}_{o}{s}_{o}^{T}\right){S}_{\mathcal{I}},$$  (10) 
where ${S}_{\mathrm{I}}$ is the $\mathrm{I}$indexed column subset of $S$, and the gradient with respect to the layer weights is
$$\frac{\partial \mathrm{\ell}}{\partial S}={\left(\sum _{o\in \mathcal{O}}{u}_{o}{s}_{o}^{T}\right)}^{T}V(S{V}^{T})U.$$  (11) 
We defer the derivation of Theorem 1 to Appendix B. Although this derivation is somewhat involved, the concept at a high level is quite simple: we differentiate the solution of the SDP problem (Section 3.1) with respect to the problem’s parameters and input, which requires computing the (relatively large) matrixvector solve given in Equation (9).
To solve Equation (9), we use a coordinate descent approach that closely mirrors the coordinate descent procedure employed in the forward pass, and which has similar fast convergence properties. This procedure, described in Algorithm 3, enables us to compute the desired gradients without needing to maintain intermediate Jacobians explicitly. Mirroring the forward pass, we use rankone updates to maintain and modify the term $\mathrm{\Psi}=U{S}^{T}$ needed to compute $\U0001d5bd{g}_{o}$, which again enables our algorithm to run in $O(nmk)$ time. We defer the derivation of Algorithm 3 to Appendix D.
3.3.3 From relaxed to original inputs
As a final step, we must use the gradient $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{I}}$ (with respect to the input relaxations) to derive the gradient $\partial \mathrm{\ell}/\partial {\mathrm{Z}}_{\mathcal{I}}$ (with respect to the actual inputs) by pushing gradients through the input relaxation procedure described in Section 3.2.2. For each $\iota \in \mathcal{I}$, we see that
$\frac{\partial \mathrm{\ell}}{\partial {z}_{\iota}}$  $={\displaystyle \frac{\partial \mathrm{\ell}}{\partial {z}_{\iota}^{\star}}}+{\left({\displaystyle \frac{\partial \mathrm{\ell}}{\partial {v}_{\iota}}}\right)}^{T}{\displaystyle \frac{\partial {v}_{\iota}}{\partial {z}_{\iota}}}$  (12)  
$={\displaystyle \frac{\partial \mathrm{\ell}}{\partial {z}_{\iota}^{\star}}}{\left({\displaystyle \frac{\partial {v}_{\iota}}{\partial {z}_{\iota}}}\right)}^{T}\left({\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}{s}_{o}^{T}\right){s}_{\iota}$ 
where
$$\frac{\partial {v}_{\iota}}{\partial {z}_{\iota}}=\pi \left(\mathrm{sin}(\pi {z}_{\iota}){v}_{\top}+\mathrm{cos}(\pi {z}_{\iota})({I}_{k}{v}_{\top}{v}_{\top}^{T}){v}_{\iota}^{\text{rand}}\right),$$  (13) 
and where $\partial \mathrm{\ell}/\partial {\mathrm{z}}_{\iota}^{\star}$ captures any direct dependence of $\mathrm{\ell}$ on ${z}_{\iota}^{\star}$ (as opposed to dependence through ${v}_{\iota}$). Here, the expression for $\partial \mathrm{\ell}/\partial {\mathrm{v}}_{\iota}$ comes from Equation (10), and we obtain $\partial {\mathrm{v}}_{\iota}/\partial {\mathrm{z}}_{\iota}$ by differentiating Equation (5).
3.4 An efficient GPU implementation
The coordinate descent updates in Algorithms 2 and 3 dominate the computational costs of the forward and backward passes, respectively. We thus present an efficient, parallel GPU implementation of these algorithms to speed up training and inference. During the inner loop of coordinate descent, our implementation parallelizes the computation of all ${g}_{o}$ ($\U0001d5bd{g}_{o}$) terms by parallelizing the computation of $\mathrm{\Omega}$ ($\mathrm{\Psi}$), as well as of all rankone updates of $\mathrm{\Omega}$ ($\mathrm{\Psi}$). This underscores the benefit of using a lowrank SDP formulation in our MAXSAT layer, as traditional fullrank coordinate descent cannot be efficiently parallelized. We find in our preliminary benchmarks that our GPU CUDAC implementation is up to $1830$x faster than the corresponding OpenMP implementation run on Xeon CPUs. Source code for our implementation is available at https://github.com/locuslab/SATNet.
4 Experiments
We test our MAXSAT layer approach in three domains that are traditionally difficult for neural networks: learning the parity function with singlebit supervision, learning $9\times 9$ Sudoku solely from examples, and solving a “visual Sudoku” problem that generates the logical Sudoku solution given an input image of a Sudoku puzzle. We find that in all cases, we are able to perform substantially better on these tasks than previous deep learningbased approaches.
4.1 Learning parity (chained XOR)
This experiment tests SATNet’s ability to differentiate through many successive SAT problems by learning to compute the parity function. The parity of a bit string is defined as one if there is an odd number of ones in the sequence and zero otherwise. The task is to map input sequences to their parity, given a dataset of example sequence/parity pairs. Learning parity functions from such singlebit supervision is known to pose difficulties for conventional deep learning approaches (ShalevShwartz et al., 2017). However, parity is simply a logic function – namely, a sequence of XOR operations applied successively to the input sequence.
Hence, for a sequence of length $L$, we construct our model to contain a sequence of $L1$ SATNet layers with tied weights (similar to a recurrent network). The first layer receives the first two binary values as input, and layer $d$ receives value $d$ along with the rounded output of layer $d1$. If each layer learns to compute the XOR function, the combined system will correctly compute parity. However, this requires the model to coordinate a long series of SAT problems without any intermediate supervision.
Figure 2 shows that our model accomplishes this task for input sequences of length $L=20$ and $L=40$. For each sequence length, we generate a dataset of 10K random examples (9K training and 1K testing). We train our model using crossentropy loss and the Adam optimizer (Kingma & Ba, 2015) with a learning rate of ${10}^{1}$. We compare to an LSTM sequence classifier, which uses 100 hidden units and a learning rate of ${10}^{3}$ (we tried varying the architecture and learning rate but did not observe any improvement). In each case, our model quickly learns the target function, with error on the heldout set converging to zero within 20 epochs. In contrast, the LSTM is unable to learn an appropriate representation, with only minor improvement over the course of 100 training epochs; across both input lengths, it achieves a testing error rate of at best 0.476 (where a random guess achieves value 0.5).
4.2 Sudoku (original and permuted)



In this experiment, we test SATNet’s ability to infer and recover constraints simply from bit supervision (i.e. without any hardcoded specification of how bits are related). We demonstrate this property via Sudoku. In Sudoku, given a (typically) $9\times 9$ partiallyfilled grid of numbers, a player must fill in the remaining empty grid cells such that each row, each column, and each of nine $3\times 3$ subgrids contains exactly one of each number from 1 through 9. While this constraint satisfaction problem is computationally easy to solve once the rules of the game are specified, actually learning the rules of the game, i.e. the hard constraints of the puzzle, has proved challenging for traditional neural network architectures. In particular, Sudoku problems are often solved computationally via tree search, and while tree search cannot be easily performed by neural networks, it is easily expressible using SAT and MAXSAT problems.
We construct a SATNet model for this task that takes as input a logical (bit) representation of the initial Sudoku board along with a mask representing which bits must be learned (i.e. all bits in empty Sudoku cells). This input is vectorized, which means that our SATNet model cannot exploit the locality structure of the input Sudoku grid when learning to solve puzzles. Given this input, the SATNet layer then outputs a bit representation of the Sudoku board with guesses for the unknown bits. Our model architecture consists of a single SATNet layer with 300 auxiliary variables and low rank structure $m=600$, and we train it to minimize a digitwise negative log likelihood objective (optimized via Adam with a $2\times {10}^{3}$ learning rate).
We compare our model to a convolutional neural network baseline modeled on that of Park (2016), which interprets the bit inputs as 9 input image channels (one for each square in the board) and uses a sequence of 10 convolutional layers (each with 512 3$\times $3 filters) to output the solution. The ConvNet makes explicit use of locality in the input representation since it treats the nine cells within each square as a single image. We also compare to a version of the ConvNet which receives a binary mask indicating which bits need to be learned (ConvNetMask). The mask is input as a set of additional image channels in the same format as the board. We trained both architectures using mean squared error (MSE) loss (which gave better results than negative log likelihood for this architecture). The loss was optimized using Adam (learning rate ${10}^{4}$). We additionally tried to train an OptNet (Amos & Kolter, 2017) model for comparison, but this model made little progress even after a few days of training. (We compare our method to OptNet on a simpler $4\times 4$ version of the Sudoku problem in Appendix E.)
Our results for the traditional $9\times 9$ Sudoku problem (over 9K training examples and 1K test examples) are shown in Table 1. (Convergence plots for this experiment are shown in Appendix F.) Our model is able to learn the constraints of the Sudoku problem, achieving high accuracy early in the training process (95.0% test accuracy in 22 epochs/37 minutes on a GTX 1080 Ti GPU), and demonstrating 98.3% boardwise test accuracy after 100 training epochs (172 minutes). On the other hand, the ConvNet baseline does poorly. It learns to correctly solve 72.6% of puzzles in the training set but fails altogether to generalize: accuracy on the heldout set reaches at most 0.04%. The ConvNetMask baseline, which receives a binary mask denoting which entries must be completed, performs only somewhat better, correctly solving 15.1% of puzzles in the heldout set. We note that our test accuracy is qualitatively similar to the results obtained in Palm et al. (2017), but that our network is able to learn the structure of Sudoku without explicitly encoding the relationships between variables.
To underscore that our architecture truly learns the rules of the game, as opposed to overfitting to locality or other structure in the inputs, we test our SATNet architecture on permuted Sudoku boards, i.e. boards for which we apply a fixed permutation of the underlying bit representation (and adjust the corresponding input masks and labels accordingly). This removes any locality structure, and the resulting Sudoku boards do not have clear visual analogues that can be solved by humans. However, the relationships between bits are unchanged (modulo the permutation) and should therefore be discoverable by architectures that can truly learn the underlying logical structure. Table 1 shows results for this problem in comparison to the convolutional neural network baselines. Our architecture is again able to learn the rules of the (permuted) game, demonstrating the same 98.3% boardwise test accuracy as in the original game. In contrast, the convolutional neural network baselines perform even more poorly than in the original game (achieving 0% test accuracy even with the binary mask as input), as there is little locality structure to exploit. Overall, these results demonstrate that SATNet can truly learn the logical relationships between discrete variables.
4.3 Visual Sudoku
In this experiment, we demonstrate that SATNet can be integrated into larger deep network architectures for endtoend training. Specifically, we solve the visual Sudoku problem: that is, given an image representation of a Sudoku board (as opposed to a onehot encoding or other logical representation) constructed with MNIST digits, our network must output a logical solution to the associated Sudoku problem. An example input is shown in Figure 3. This problem cannot traditionally be represented well by neural network architectures, as it requires the ability to combine multiple neural network layers without hardcoding the logical structure of the problem into intermediate logical layers.
Our architecture for this problem uses a convolutional neural network connected to a SATNet layer. Specifically, we apply a convolutional layer for digit classification (which uses the LeNet architecture (LeCun et al., 1998)) to each cell of the Sudoku input. Each cellwise probabilistic output of this convolutional layer is then fed as logical input to the SATNet layer, along with an input mask (as in Section 4.2). This SATNet layer employs the same architecture and training parameters as described in the previous section. The whole model is trained endtoend to minimize crossentropy loss, and is optimized via Adam with learning rates $2\times {10}^{3}$ for the SATNet layer and ${10}^{5}$ for the convolutional layer.
We compare our approach against a convolutional neural network which combines two sets of convolutional layers. First, the visual inputs are passed through the same convolutional layer as in our SATNet model, which outputs a probabilistic bit representation. Next, this representation is passed through the convolutional architecture that we compared to for the original Sudoku problem, which outputs a solution. We use the same training approach as above.
Table 1 summarizes our experimental results (over 9K training examples and 1K test examples); additional plots are shown in Appendix F. We contextualize these results against the theoretical “best” testing accuracy of 74.7%, which accounts for the Sudoku digit classification accuracy of our specific convolutional architecture; that is, assuming boards with 36.2 out of 81 filled cells on average (as in our test set) and an MNIST model with 99.2% test accuracy (LeCun et al., 1998), we would expect a perfect Sudoku solver to output the correct solution 74.7% ($={0.992}^{36.2}$) of the time. In 100 epochs, our model learns to correctly solve 63.2% of boards at test time, reaching 85% of this theoretical “best.” Hence, our approach demonstrates strong performance in solving visual Sudoku boards endtoend. On the other hand, the baseline convolutional networks make only minuscule improvements to the training loss over the course of 100 epochs, and fail altogether to improve outofsample performance. Accordingly, our SATNet architecture enables endtoend learning of the “rules of the game” directly from pictorial inputs in a way that was not possible with previous architectures.
5 Conclusion
In this paper, we have presented a lowrank differentiable MAXSAT layer that can be integrated into neural network architectures. This layer employs block coordinate descent methods to efficiently compute the forward and backward passes, and is amenable to GPU acceleration. We show that our SATNet architecture can be successfully used to learn logical structures, namely the parity function and the rules of $9\times 9$ Sudoku. We also show, via a visual Sudoku task, that our layer can be integrated into larger deep network architectures for endtoend training. Our layer thus shows promise in allowing deep networks to learn logical structure without hardcoding of the relationships between variables.
More broadly, we believe that this work fills a notable gap in the regime spanning deep learning and logical reasoning. While many “differentiable logical reasoning” systems have been proposed, most of them still require fairly handspecified logical rules and groundings, and thus are somewhat limited in their ability to operate in a truly endtoend fashion. Our hope is that by wrapping a powerful yet generic primitive such as MAXSAT solving within a differentiable framework, our solver can enable “implicit” logical reasoning to occur where needed within larger frameworks, even if the precise structure of the domain is unknown and must be learned from data. In other words, we believe that SATNet provides a step towards integrating symbolic reasoning and deep learning, a longstanding goal in artificial intelligence.
Acknowledgments
PoWei Wang is supported by a grant from the Bosch Center for AI; Priya Donti is supported by the Department of Energy’s Computational Science Graduate Fellowship under grant number DEFG0297ER25308; and Bryan Wilder is supported by the National Science Foundation Graduate Research Fellowship.
References
 Amos & Kolter (2017) Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443, 2017.
 Barvinok (1995) Barvinok, A. I. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry, 13(2):189–202, 1995.
 Cingillioglu & Russo (2018) Cingillioglu, N. and Russo, A. Deeplogic: Endtoend logical reasoning. arXiv preprint arXiv:1805.07433, 2018.
 Dai et al. (2018) Dai, W.Z., Xu, Q.L., Yu, Y., and Zhou, Z.H. Tunneling neural perception and logic reasoning through abductive learning. arXiv preprint arXiv:1802.01173, 2018.
 Djolonga & Krause (2017) Djolonga, J. and Krause, A. Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, pp. 1013–1023, 2017.
 Donti et al. (2017) Donti, P. L., Amos, B., and Kolter, J. Z. Taskbased endtoend model learning in stochastic optimization. arXiv preprint arXiv:1703.04529, 2017.
 Evans & Grefenstette (2018) Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
 Garcez et al. (2015) Garcez, A., Besold, T. R., De Raedt, L., Földiak, P., Hitzler, P., Icard, T., Kühnberger, K.U., Lamb, L. C., Miikkulainen, R., and Silver, D. L. Neuralsymbolic learning and reasoning: contributions and challenges. In Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches, Stanford, 2015.
 Goemans & Williamson (1995) Goemans, M. X. and Williamson, D. P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
 Gomes et al. (2006) Gomes, C. P., van Hoeve, W.J., and Leahu, L. The power of semidefinite programming relaxations for maxsat. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, pp. 104–118. Springer, 2006.
 Hu et al. (2016) Hu, Z., Ma, X., Liu, Z., Hovy, E., and Xing, E. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 2410–2420, 2016.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. L. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2015.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Ling et al. (2018) Ling, C. K., Fang, F., and Kolter, J. Z. What game are we playing? endtoend learning in normal and extensive form games. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, pp. 396–402, 2018. doi: 10.24963/ijcai.2018/55.
 Manhaeve et al. (2018) Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., and De Raedt, L. Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems, pp. 3749–3759, 2018.
 Palm et al. (2017) Palm, R. B., Paquet, U., and Winther, O. Recurrent relational networks. arXiv preprint arXiv:1711.08028, 2017.
 Park (2016) Park, K. Can neural networks crack sudoku?, 2016. URL https://github.com/Kyubyong/sudoku.
 Pataki (1998) Pataki, G. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of operations research, 23(2):339–358, 1998.
 Selsam et al. (2018) Selsam, D., Lamm, M., Bunz, B., Liang, P., de Moura, L., and Dill, D. L. Learning a sat solver from singlebit supervision. arXiv preprint arXiv:1802.03685, 2018.
 ShalevShwartz et al. (2017) ShalevShwartz, S., Shamir, O., and Shammah, S. Failures of gradientbased deep learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 3067–3075, 2017.
 Sourek et al. (2018) Sourek, G., Aschenbrenner, V., Zelezny, F., Schockaert, S., and Kuzelka, O. Lifted relational neural networks: Efficient learning of latent relational structures. Journal of Artificial Intelligence Research, 62:69–100, 2018.
 Tschiatschek et al. (2018) Tschiatschek, S., Sahin, A., and Krause, A. Differentiable submodular maximization. arXiv preprint arXiv:1803.01785, 2018.
 Wang & Kolter (2019) Wang, P.W. and Kolter, J. Z. Lowrank semidefinite programming for the max2sat problem. In AAAI Conference on Artificial Intelligence, 2019.
 Wang et al. (2017) Wang, P.W., Chang, W.C., and Kolter, J. Z. The mixing method: coordinate descent for lowrank semidefinite programming. arXiv preprint arXiv:1706.00476, 2017.
 Wilder et al. (2018) Wilder, B., Dilkina, B., and Tambe, M. Melding the datadecisions pipeline: Decisionfocused learning for combinatorial optimization. In AAAI Conference on Artificial Intelligence, 2018.
 Xu et al. (2018) Xu, J., Zhang, Z., Friedman, T., Liang, Y., and den Broeck, G. V. A semantic loss function for deep learning with symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning, pp. 5498–5507, 2018. URL http://proceedings.mlr.press/v80/xu18h.html.
 Yang et al. (2017) Yang, F., Yang, Z., and Cohen, W. W. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems, pp. 2319–2328, 2017.
Appendix A Derivation of the forward pass coordinate descent update
Our MAXSAT SDP relaxation (described in Section 3.1) is given by
$\underset{V\in {\mathbb{R}}^{k\times (n+1)}}{minimize}$  $\u27e8{S}^{T}S,{V}^{T}V\u27e9,$  (A.1)  
subject to  $\parallel {v}_{i}\parallel =1,i=0,\mathrm{\dots},n,$ 
where $S\in {\mathbb{R}}^{m\times (n+1)}$ and ${v}_{i}$ is the $i$th column vector of $V$.
We rewrite the objective of (A.1) as $\u27e8{S}^{T}S,{V}^{T}V\u27e9\equiv \mathrm{tr}({({S}^{T}S)}^{T}({V}^{T}V))=\mathrm{tr}({V}^{T}V{S}^{T}S)$ by noting that ${S}^{T}S$ is symmetric and by cycling matrices within the trace. We then observe that the objective terms that depend on any given ${v}_{i}$ are given by
$${v}_{i}^{T}\sum _{j=0}^{n}{s}_{j}^{T}{s}_{i}{v}_{j}={v}_{i}^{T}\sum _{\begin{array}{c}j=0\\ (j\ne i)\end{array}}^{n}{s}_{j}^{T}{s}_{i}{v}_{j}+{v}_{i}^{T}{s}_{i}^{T}{s}_{i}{v}_{i},$$  (A.2) 
where ${s}_{i}$ is the $i$th column vector of $S$. Observe ${v}_{i}^{T}{v}_{i}$ in the last term cancels to $1$, and the remaining coefficient
$${g}_{i}\equiv \sum _{\begin{array}{c}j=0\\ (j\ne i)\end{array}}^{n}{s}_{j}^{T}{s}_{i}{v}_{j}=V{S}^{T}{s}_{i}{\parallel {s}_{i}\parallel}^{2}{v}_{i}$$  (A.3) 
is constant with respect to ${v}_{i}$. Thus, (A.2) can be simply rewritten as
$${v}_{i}^{T}{g}_{i}+{s}_{i}^{T}{s}_{i}.$$  (A.4) 
Minimizing this expression over ${v}_{i}$ with respect to the constraint $\parallel {v}_{i}\parallel =1$ yields the block coordinate descent update
$${v}_{i}={g}_{i}/\parallel {g}_{i}\parallel .$$  (A.5) 
Appendix B Details on backpropagation through the MAXSAT SDP
Given the result $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{O}}$, we next seek to compute $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{I}}$ and $\partial \mathrm{\ell}/\partial \mathrm{S}$ by pushing gradients through the SDP solution procedure described in Section 3.1. We do this by taking the total differential through our coordinate descent updates (3) for each output $o\in \mathcal{O}$ at the optimal fixedpoint solution to which these updates converge.
Computing the total differential.
Computing the total differential of the updates (3) and rearranging, we see that for every $o\in \mathcal{O}$,
$\left(\parallel {g}_{o}\parallel {I}_{k}{\parallel {s}_{o}\parallel}^{2}{P}_{o}\right)\U0001d5bd{v}_{o}+{P}_{o}{\displaystyle \sum _{j\in \mathcal{O}}}{s}_{o}^{T}{s}_{j}\U0001d5bd{v}_{j}={P}_{o}{\xi}_{o},$  (B.1)  
where  
${\xi}_{o}\equiv \left({\displaystyle \sum _{j\in {\mathcal{I}}^{\prime}}}{s}_{o}^{T}{s}_{j}\U0001d5bd{v}_{j}+V\U0001d5bd{S}^{T}{s}_{o}+V{S}^{T}\U0001d5bd{s}_{o}2\U0001d5bd{s}_{o}^{T}{s}_{o}{v}_{o}\right),$  (B.2) 
and where ${P}_{o}\equiv {I}_{k}{v}_{o}{v}_{o}^{T},o\in \mathcal{O}$ and ${\mathcal{I}}^{\prime}\equiv \{\top \}\cup \mathcal{I}$.
Rewriting as a linear system.
Rewriting Equation B.1 over all $o\in \mathcal{O}$ as a linear system, we obtain
$\left(\mathrm{diag}(\parallel {g}_{o}\parallel )\otimes {I}_{k}+PC\otimes {I}_{k}\right)\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})=P\mathrm{vec}({\xi}_{o})$  (B.3)  
$\Rightarrow \mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})={\left(P(\left(\mathrm{diag}(\parallel {g}_{o}\parallel )+C\right)\otimes {I}_{k})P\right)}^{\u2020}\mathrm{vec}({\xi}_{o}),$ 
where $C={S}_{\mathcal{O}}^{T}{S}_{\mathcal{O}}\mathrm{diag}({\parallel {s}_{o}\parallel}^{2})$, $P=\mathrm{diag}({P}_{o})$, and the second step follows from the lemma presented in Appendix C.
We then see that by the chain rule, the gradients $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{I}}$ and $\partial \mathrm{\ell}/\partial \mathrm{S}$ are given by the left matrixvector product
${\left({\displaystyle \frac{\partial \mathrm{\ell}}{\partial \mathrm{vec}\left({V}_{\mathcal{O}}\right)}}\right)}^{T}\mathrm{vec}\left(\U0001d5bd{V}_{\mathcal{O}}\right)$  (B.4)  
$=$  ${\left({\displaystyle \frac{\partial \mathrm{\ell}}{\partial \mathrm{vec}\left({V}_{\mathcal{O}}\right)}}\right)}^{T}{\left(P\left(\left(\mathrm{diag}\left(\parallel {g}_{o}\parallel \right)+C\right)\otimes {I}_{k}\right)P\right)}^{\u2020}\mathrm{vec}\left({\xi}_{o}\right)$ 
where the second equality comes from plugging in the result of (B.3).
Now, define $U\in {\mathbb{R}}^{k\times n}$, where the columns ${U}_{\mathcal{I}}=0$ and the columns ${U}_{\mathcal{O}}$ are given by
$$\mathrm{vec}({U}_{\mathcal{O}})={\left(P(\left(\mathrm{diag}(\parallel {g}_{o}\parallel )+C\right)\otimes {I}_{k})P\right)}^{\u2020}\mathrm{vec}\left(\frac{\partial \mathrm{\ell}}{\partial \mathrm{vec}({V}_{\mathcal{O}})}\right).$$  (B.5) 
Then, we see that (B.4) can be written as
$${\left(\frac{\partial \mathrm{\ell}}{\partial \mathrm{vec}({V}_{\mathcal{O}})}\right)}^{T}\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})=\mathrm{vec}{({U}_{\mathcal{O}})}^{T}\mathrm{vec}({\xi}_{o}),$$  (B.6) 
which is the implicit linear form for our gradients.
Computing desired gradients from implicit linear form.
Once we have obtained ${U}_{\mathcal{O}}$ (via coordinate descent), we can explicitly compute the desired gradients $\partial \mathrm{\ell}/\partial {\mathrm{V}}_{\mathcal{I}}$ and $\partial \mathrm{\ell}/\partial \mathrm{S}$ from the implicit form (B.6). For instance, to compute the gradient $\partial \mathrm{\ell}/\partial {\mathrm{v}}_{\iota}$ for some $\iota \in \mathcal{I}$, we would set $\U0001d5bd{v}_{\iota}=1$ and all other gradients to zero in Equation (B.6) (where these gradients are captured within the terms ${\xi}_{o}$).
Explicitly, we compute each $\partial \mathrm{\ell}/\partial {\mathrm{v}}_{\iota \mathrm{j}}$ by setting $\U0001d5bd{v}_{\iota j}=1$ and all other gradients to zero, i.e.
$\frac{\partial \mathrm{\ell}}{\partial {v}_{\iota j}}$  $=\mathrm{vec}{({U}_{\mathcal{O}})}^{T}\mathrm{vec}({\xi}_{o})={\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}^{T}{e}_{j}{s}_{\iota}^{T}{s}_{o}$  (B.7)  
$={e}_{j}^{T}\left({\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}{s}_{o}^{T}\right){s}_{\iota}.$ 
Similarly, we compute each $\partial \mathrm{\ell}/\partial {\mathrm{S}}_{\mathrm{i},\mathrm{j}}$ by setting $\U0001d5bd{S}_{i,j}=1$ and all other gradients to zero, i.e.
$\frac{\partial \mathrm{\ell}}{\partial {S}_{i,j}}$  $={\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}^{T}{\xi}_{o}$  (B.8)  
$={\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}^{T}{v}_{i}{s}_{oj}{u}_{i}^{T}{(V{S}^{T})}_{j}+{u}_{i}^{T}({s}_{ij}{P}_{i}{v}_{i})$  
$={v}_{i}^{T}({\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}{s}_{oj}){u}_{i}^{T}{(V{S}^{T})}_{j}.$ 
In matrix form, these gradients are
$\frac{\partial \mathrm{\ell}}{\partial {V}_{\mathcal{I}}}}=\left({\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}{s}_{o}^{T}\right){S}_{\mathcal{I}},$  (B.9)  
$\frac{\partial \mathrm{\ell}}{\partial S}}={\left({\displaystyle \sum _{o\in \mathcal{O}}}{u}_{o}{s}_{o}^{T}\right)}^{T}V(S{V}^{T})U,$  (B.10) 
where ${u}_{i}$ is the $i$th column of $U$, and where ${S}_{\mathcal{I}}$ denotes the $\mathcal{I}$indexed column subset of $S$.
Appendix C Proof of pseudoinverse computations
We prove the following lemma, used to derive the implicit total differential for $\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})$.
Lemma C.1.
The quantity
$$\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})={\left(P\left((D+C)\otimes {I}_{k}\right)P\right)}^{\u2020}\mathrm{vec}({\xi}_{o})$$  (C.1) 
is the solution of the linear system
$$(D\otimes {I}_{k}+PC\otimes {I}_{k})\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})=P\mathrm{vec}({\xi}_{o}),$$  (C.2) 
where $P\mathrm{=}\mathrm{diag}\mathit{}\mathrm{(}{I}_{k}\mathrm{}{v}_{o}\mathit{}{v}_{o}^{T}\mathrm{)}$, $C\mathrm{=}{S}_{\mathrm{O}}^{T}\mathit{}{S}_{\mathrm{O}}\mathrm{}\mathrm{diag}\mathit{}\mathrm{(}{\mathrm{\parallel}{s}_{o}\mathrm{\parallel}}^{\mathrm{2}}\mathrm{)}$, $D\mathrm{=}\mathrm{diag}\mathit{}\mathrm{(}\mathrm{\parallel}{g}_{i}\mathrm{\parallel}\mathrm{)}$, and ${\xi}_{o}$ is as defined in Equation (B.2).
Proof.
Examining the equation with respect to $\U0001d5bd{v}_{i}$ gives
$$\parallel {g}_{i}\parallel \U0001d5bd{v}_{i}+{P}_{i}\left(\sum _{j}{c}_{ij}\U0001d5bd{v}_{j}{\xi}_{j}\right)=0,$$  (C.3) 
which implies that for all $i$, $\U0001d5bd{v}_{i}={P}_{i}{y}_{i}$ for some ${y}_{i}$. Substituting ${y}_{i}$ into the equality gives
$(D\otimes {I}_{k}+PC\otimes {I}_{k})P\mathrm{vec}({y}_{i})$  (C.4)  
$=$  $P((D+C)\otimes {I}_{k})P\mathrm{vec}({y}_{i})=P\mathrm{vec}({\xi}_{o}).$  (C.5) 
Note that the last equation comes form $D\otimes {I}_{k}P=D\otimes {I}_{k}PP=P(D\otimes {I}_{k})P$ due to the block diagonal structure of the projection $P$. Thus, by the properties of projectors and the pseudoinverse,
$\mathrm{vec}(Y)$  $={(P((D+C)\otimes {I}_{k})P)}^{\u2020}P\mathrm{vec}({\xi}_{o})$  (C.6)  
$={(P((D+C)\otimes {I}_{k})P)}^{\u2020}\mathrm{vec}({\xi}_{o}).$  (C.7) 
Note that the first equation comes from the idempotence property of $P$ (that is, $PP=P$). Substituting $\mathrm{vec}(\U0001d5bd{V}_{\mathcal{O}})=P\mathrm{vec}(Y)$ back gives the solution of $\U0001d5bd{V}_{\mathcal{O}}$. ∎
Appendix D Derivation of the backward pass coordinate descent algorithm
Consider solving for ${U}_{\mathcal{O}}$ as mentioned in Equation (B.5):
$$\left(P(\left(\mathrm{diag}(\parallel {g}_{o}\parallel )+C\right)\otimes {I}_{k})P\right)\mathrm{vec}({U}_{\mathcal{O}})=\mathrm{vec}\left(\frac{\partial \mathrm{\ell}}{\partial \mathrm{vec}({V}_{\mathcal{O}})}\right),$$ 
where $C={S}_{\mathcal{O}}^{T}{S}_{\mathcal{O}}\mathrm{diag}({\parallel {s}_{o}\parallel}^{2})$. The linear system can be computed using block coordinate descent. Specifically, observe this linear system with respect to only the ${u}_{o}$ variable. Since we start from ${U}_{\mathcal{O}}=0$, we can assume that $P\mathrm{vec}({U}_{o})=\mathrm{vec}({U}_{o})$. This yields
$$\parallel {g}_{o}\parallel {P}_{o}{u}_{o}+{P}_{o}\left({U}_{\mathcal{O}}{S}_{\mathcal{O}}^{T}{s}_{o}{\parallel {s}_{o}\parallel}^{2}{u}_{o}\right)={P}_{o}\left(\frac{\partial \mathrm{\ell}}{\partial {v}_{o}}\right).$$  (D.1) 
Let $\mathrm{\Psi}=({U}_{\mathcal{O}}){S}_{\mathcal{O}}^{T}$. Then we have
$$\parallel {g}_{o}\parallel {P}_{o}{u}_{o}={P}_{o}(\mathrm{\Psi}{s}_{o}{\parallel {s}_{o}\parallel}^{2}{u}_{o}\partial \mathrm{\ell}/\partial {v}_{o}).$$  (D.2) 
Define $\U0001d5bd{g}_{i}$ to be the terms contained in parentheses in the righthand side of the above equation. Note that $\U0001d5bd{g}_{i}$ does not depend on the variable ${u}_{o}$. Thus, we have the closedform feasible solution
$${u}_{o}={P}_{o}\U0001d5bd{g}_{o}/\parallel {g}_{o}\parallel .$$  (D.3) 
After updating ${u}_{o}$, we can maintain the term $\mathrm{\Psi}$ by replacing the old ${u}_{o}^{\text{prev}}$ with the new ${u}_{o}$. This yields the rank 1 update
$$\mathrm{\Psi}:=\mathrm{\Psi}+({u}_{o}{u}_{o}^{\text{prev}}){s}_{o}^{T}.$$  (D.4) 
The above procedure is summarized in Algorithm 3. Further, we can verify that the assumption $P\mathrm{vec}({U}_{\mathcal{O}})=\mathrm{vec}({U}_{\mathcal{O}})$ still holds after each update by the projection ${P}_{o}$.
Appendix E Results for the $4\times 4$ Sudoku problem
We compare the performance of our SATNet architecture on a $4\times 4$ reduced version of the Sudoku puzzle against OptNet (Amos & Kolter, 2017) and a convolutional neural network architecture. These results (over 9K training and 1K testing examples) are shown in Figure E.1. We note that our architecture converges quickly – in just two epochs – to 100% boardwise test accuracy.
OptNet takes slightly longer to converge to similar performance, in terms of both time and epochs. In particular, we see that OptNet takes 34 epochs to converge (as opposed to 1 epoch for SATNet). Further, in our preliminary benchmarks, OptNet required 12 minutes to run 20 epochs on a GTX 1080 Ti GPU, whereas SATNet took only 2 minutes to run the same number of epochs. In other words, we see that SATNet requires fewer epochs to converge and takes less time per epoch than OptNet.
Both our SATNet architecture and OptNet outperform the traditional convolutional neural network in this setting, as the ConvNet somewhat overfits to the training set and therefore does not generalize as well to the test set (achieving 93% accuracy). The ConvNetMask, which additionally receives a binary input mask, performs much better (99% test accuracy) but does not achieve perfect performance as in the case of OptNet and SATNet.
Appendix F Convergence plots for $9\times 9$ Sudoku experiments
Convergence plots for our $9\times 9$ Sudoku experiments (original and permuted) are shown in Figure F.1. SATNet performs nearly identically in both the original and permuted settings, generalizing well to the test set at every epoch without overfitting to the training set. The ConvNet and ConvNetMask, on the other hand, do not generalize well. In the original setting, both architectures overfit to the training set, showing littletono improvement in generalization performance over the course of training. In the permuted setting, both ConvNet and ConvNetMask make little progress even on the training set, as they are not able to rely on spatial locality of inputs.
Convergence plots for the visual Sudoku experiments are shown in Figure F.2. Here, we see that SATNet generalizes well in terms of loss throughout the training process, and generalizes somewhat well in terms of wholeboard accuracy. The difference in generalization performance between the logical and visual Sudoku settings can be attributed to the generalization performance of the MNIST classifier trained endtoend with our SATNet layer. The ConvNetMask architecture overfits to the training set, and the ConvNet architecture makes littletono progress even on the training set.