Abstract
Recent work has sought to understand the behavior of neural networks bycomparing representations between layers and between different trained models.We examine methods for comparing neural network representations based oncanonical correlation analysis (CCA). We show that CCA belongs to a family ofstatistics for measuring multivariate similarity, but that neither CCA nor anyother statistic that is invariant to invertible linear transformation canmeasure meaningful similarities between representations of higher dimensionthan the number of data points. We introduce a similarity index that measuresthe relationship between representational similarity matrices and does notsuffer from this limitation. This similarity index is equivalent to centeredkernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKAcan reliably identify correspondences between representations in networkstrained from different initializations.
Quick Read (beta)
Similarity of Neural Network Representations Revisited
Abstract
Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.
1 Introduction
Across a wide range of machine learning tasks, deep neural networks enable learning powerful feature representations automatically from data. Despite impressive empirical advances of deep neural networks in solving various tasks, the problem of understanding and characterizing the neural network representations learned from data remains relatively underexplored. Previous work (e.g. Advani & Saxe (2017); Amari et al. (2018); Saxe et al. (2014)) has made progress in understanding the theoretical dynamics of the neural network training process. These studies are insightful, but fundamentally limited, because they ignore the complex interaction between the training dynamics and structured data. A window into the network’s representation can provide more information about the interaction between machine learning algorithms and data than the value of the loss function alone.
This paper investigates the problem of measuring similarities between deep neural network representations. An effective method for measuring representational similarity could help answer many interesting questions, including: (1) Do deep neural networks with the same architecture trained from different random initializations learn similar representations? (2) Can we establish correspondences between layers of different network architectures? (3) How similar are the representations learned using the same network architecture from different datasets?
We build upon previous studies investigating similarity between the representations of neural networks (Laakso & Cottrell, 2000; Li et al., 2015; Raghu et al., 2017; Morcos et al., 2018; Wang et al., 2018). We are also inspired by the extensive neuroscience literature that uses representational similarity analysis (Kriegeskorte et al., 2008a; Edelman, 1998) to compare representations across brain areas (Haxby et al., 2001; Freiwald & Tsao, 2010), individuals (Connolly et al., 2012), species (Kriegeskorte et al., 2008b), and behaviors (Elsayed et al., 2016), as well as between brains and neural networks (Yamins et al., 2014; KhalighRazavi & Kriegeskorte, 2014; Sussillo et al., 2015).
Our key contributions are summarized as follows:

•
We discuss the invariance properties of similarity indexes and their implications for measuring similarity of neural network representations.
 •

•
We show that CKA is able to determine the correspondence between the hidden layers of neural networks trained from different random initializations and with different widths, scenarios where previously proposed similarity indexes fail.

•
We verify that wider networks learn more similar representations, and show that the similarity of early layers saturates at fewer channels than later layers. We demonstrate that early layers, but not later layers, learn similar representations on different datasets.
Problem Statement
Let $X\in {\mathbb{R}}^{n\times {p}_{1}}$ denote a matrix of activations of ${p}_{1}$ neurons for $n$ examples, and $Y\in {\mathbb{R}}^{n\times {p}_{2}}$ denote a matrix of activations of ${p}_{2}$ neurons for the same $n$ examples. We assume that these matrices have been preprocessed to center the columns. Without loss of generality we assume that ${p}_{1}\le {p}_{2}$. We are concerned with the design and analysis of a scalar similarity index $s(X,Y)$ that can be used to compare representations within and across neural networks, in order to help visualize and understand the effect of different factors of variation in deep learning.
2 What Should Similarity Be Invariant To?
This section discusses the invariance properties of similarity indexes and their implications for measuring similarity of neural network representations. We argue that both intuitive notions of similarity and the dynamics of neural network training call for a similarity index that is invariant to orthogonal transformation and isotropic scaling, but not invertible linear transformation.
2.1 Invariance to Invertible Linear Transformation
A similarity index is invariant to invertible linear transformation if $s(X,Y)=s(XA,YB)$ for any full rank $A$ and $B$. If activations $X$ are followed by a fullyconnected layer $f(X)=\sigma (XW+\beta )$, then transforming the activations by a full rank matrix $A$ as ${X}^{\prime}=XA$ and transforming the weights by the inverse ${A}^{1}$ as ${W}^{\prime}={A}^{1}W$ preserves the output of $f(X)$. This transformation does not appear to change how the network operates, so intuitively, one might prefer a similarity index that is invariant to invertible linear transformation, as argued by Raghu et al. (2017).
However, a limitation of invariance to invertible linear transformation is that any invariant similarity index gives the same result for any representation of width greater than or equal to the dataset size, i.e. ${p}_{2}\ge n$. We provide a simple proof in Appendix A.
Theorem 2.1.
Let $X$ and $Y$ be $n\mathrm{\times}p$ matrices. Suppose $s$ is invariant to invertible linear transformation in the first argument, i.e. $s\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}\mathrm{=}s\mathit{}\mathrm{(}X\mathit{}A\mathrm{,}Z\mathrm{)}$ for arbitrary $Z$ and any $A$ with $\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}A\mathrm{)}\mathrm{=}p$. If $\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}X\mathrm{)}\mathrm{=}\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}Y\mathrm{)}\mathrm{=}n$, then $s\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}\mathrm{=}s\mathit{}\mathrm{(}Y\mathrm{,}Z\mathrm{)}$.
There is thus a practical problem with invariance to invertible linear transformation: Some neural networks, especially convolutional networks, have more neurons in some layers than there are examples the training dataset (Springenberg et al., 2015; Lee et al., 2018; Zagoruyko & Komodakis, 2016). It is somewhat unnatural that a similarity index could require more examples than were used for training.
A deeper issue is that neural network training is not invariant to arbitrary invertible linear transformation of inputs or activations. Even in the linear case, gradient descent converges first along the eigenvectors corresponding to the largest eigenvalues of the input covariance matrix (LeCun et al., 1991), and in cases of overparameterization or early stopping, the solution reached depends on the scale of the input. Similar results hold for gradient descent training of neural networks in the infinite width limit (Jacot et al., 2018). The sensitivity of neural networks training to linear transformation is further demonstrated by the popularity of batch normalization (Ioffe & Szegedy, 2015).
Invariance to invertible linear transformation implies that the scale of directions in activation space is irrelevant. Empirically, however, scale information is both consistent across networks and useful across tasks. Neural networks trained from different random initializations develop representations with similar large principal components, as shown in Figure 1. Consequently, Euclidean distances between examples, which depend primarily upon large principal components, are similar across networks. These distances are meaningful, as demonstrated by the success of perceptual loss and style transfer (Gatys et al., 2016; Johnson et al., 2016; Dumoulin et al., 2017). A similarity index that is invariant to invertible linear transformation ignores this aspect of the representation, and assigns the same score to networks that match only in large principal components or networks that match only in small principal components.
2.2 Invariance to Orthogonal Transformation
Rather than requiring invariance to any invertible linear transformation, one could require a weaker condition; invariance to orthogonal transformation, i.e. $s(X,Y)=s(XU,YV)$ for fullrank orthonormal matrices $U$ and $V$ such that ${U}^{\text{T}}U=I$ and ${V}^{\text{T}}V=I$.
Indexes invariant to orthogonal transformations do not share the limitations of indexes invariant to invertible linear transformation. When ${p}_{2}>n$, indexes invariant to orthogonal transformation remain welldefined. Moreover, orthogonal transformations preserve scalar products and Euclidean distances between examples.
Invariance to orthogonal transformation seems desirable for neural networks trained by gradient descent. Invariance to orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural networks (Chen et al., 1993; Orhan & Pitkow, 2018). In the linear case, orthogonal transformation of the input does not affect the dynamics of gradient descent training (LeCun et al., 1991), and for neural networks initialized with rotationally symmetric weight distributions, e.g. i.i.d. Gaussian weight initialization, training with fixed orthogonal transformations of activations yields the same distribution of training trajectories as untransformed activations, whereas an arbitrary linear transformation would not.
Given a similarity index $s(\cdot ,\cdot )$ that is invariant to orthogonal transformation, one can construct a similarity index ${s}^{\prime}(\cdot ,\cdot )$ that is invariant to any invertible linear transformation by first orthonormalizing the columns of $X$ and $Y$, and then applying $s(\cdot ,\cdot )$. Given thin QR decompositions $X={Q}_{A}{R}_{A}$ and $Y={Q}_{B}{R}_{B}$ one can construct a similarity index ${s}^{\prime}(X,Y)=s({Q}_{X},{Q}_{Y})$, where ${s}^{\prime}(\cdot ,\cdot )$ is invariant to invertible linear transformation because orthonormal bases with the same span are related to each other by orthonormal transformation (see Appendix B).
2.3 Invariance to Isotropic Scaling
We expect similarity indexes to be invariant to isotropic scaling, i.e. $s(X,Y)=s(\alpha X,\beta Y)$ for any $\alpha ,\beta \in {\mathbb{R}}^{+}$. That said, a similarity index that is invariant to both orthogonal transformation and nonisotropic scaling, i.e. rescaling of individual features, is invariant to any invertible linear transformation. This follows from the existence of the singular value decomposition of the transformation matrix. Generally, we are interested in similarity indexes that are invariant to isotropic but not necessarily nonisotropic scaling.
3 Comparing Similarity Structures
Our key insight is that instead of comparing multivariate features of an example in the two representations (e.g. via regression), one can first measure the similarity between every pair of examples in each representation separately, and then compare the similarity structures. In neuroscience, such matrices representing the similarities between examples are called representational similarity matrices (Kriegeskorte et al., 2008a). We show below that, if we use an inner product to measure similarity, the similarity between representational similarity matrices reduces to another intuitive notion of pairwise feature similarity.
Dot ProductBased Similarity.
A simple formula relates dot products between examples to dot products between features:
$\u27e8\text{vec}(X{X}^{\text{T}}),\text{vec}(Y{Y}^{\text{T}})\u27e9=\text{tr}(X{X}^{\text{T}}Y{Y}^{\text{T}})={{Y}^{\text{T}}X}_{\text{F}}^{2}.$  (1) 
The elements of $X{X}^{\text{T}}$ and $Y{Y}^{\text{T}}$ are dot products between the representations of the ${i}^{\text{th}}$ and ${j}^{\text{th}}$ examples, and indicate the similarity between these examples according to the respective networks. The lefthand side of (1) thus measures the similarity between the interexample similarity structures. The righthand side yields the same result by measuring the similarity between features from $X$ and $Y$, by summing the squared dot products between every pair.
HilbertSchmidt Independence Criterion.
Equation 1 implies that, for centered $X$ and $Y$:
$\frac{1}{{(n1)}^{2}}}\text{tr}(X{X}^{\text{T}}Y{Y}^{\text{T}})$  $={\text{cov}({X}^{\text{T}},{Y}^{\text{T}})}_{\text{F}}^{2}.$  (2) 
The HilbertSchmidt Independence Criterion (Gretton et al., 2005) generalizes Equations 1 and 2 to inner products from reproducing kernel Hilbert spaces, where the squared Frobenius norm of the crosscovariance matrix becomes the squared HilbertSchmidt norm of the crosscovariance operator. Let ${K}_{ij}=k({\mathbf{x}}_{i},{\mathbf{x}}_{j})$ and ${L}_{ij}=l({\mathbf{y}}_{i},{\mathbf{y}}_{j})$ where $k$ and $l$ are two kernels. The empirical estimator of HSIC is:
$\text{HSIC}(K,L)={\displaystyle \frac{1}{{(n1)}^{2}}}\text{tr}(KHLH),$  (3) 
where $H$ is the centering matrix ${H}_{n}={I}_{n}\frac{1}{n}{\mathrm{\U0001d7cf\U0001d7cf}}^{\text{T}}$. For linear kernels $k(\mathbf{x},\mathbf{y})=l(\mathbf{x},\mathbf{y})={\mathbf{x}}^{\text{T}}\mathbf{y}$, HSIC yields (2).
Gretton et al. (2005) originally proposed HSIC as a test statistic for determining whether two sets of variables are independent. They prove that the empirical estimator converges to the population value at a rate of $1/\sqrt{n}$, and Song et al. (2007) provide an unbiased estimator. When $k$ and $l$ are universal kernels, HSIC = 0 implies independence, but HSIC is not an estimator of mutual information. HSIC is equivalent to maximum mean discrepancy between the joint distribution and the product of the marginal distributions, and HSIC with a specific kernel family is equivalent to distance covariance (Sejdinovic et al., 2013).
Centered Kernel Alignment.
HSIC is not invariant to isotropic scaling, but it can be made invariant through normalization. This normalized index is known as centered kernel alignment (Cortes et al., 2012; Cristianini et al., 2002):
$\text{CKA}(K,L)={\displaystyle \frac{\text{HSIC}(K,L)}{\sqrt{\text{HSIC}(K,K)\text{HSIC}(L,L)}}}.$  (4) 
Kernel Selection.
Below, we report results of CKA with a linear kernel and the RBF kernel $k({\text{\mathbf{x}}}_{i},{\text{\mathbf{x}}}_{j})=\mathrm{exp}({{\text{\mathbf{x}}}_{i}{\text{\mathbf{x}}}_{j}}_{2}^{2}/(2{\sigma}^{2}))$. For the RBF kernel, there are several possible strategies for selecting the bandwidth $\sigma $, which controls the extent to which similarity of small distances is emphasized over large distances. We set $\sigma $ as a fraction of the median distance between examples. In practice, we find that RBF and linear kernels give similar results across most experiments, so we use linear CKA unless otherwise specified. Our framework extends to any valid kernel, including kernels equivalent to neural networks (Lee et al., 2018; Jacot et al., 2018; GarrigaAlonso et al., 2019; Novak et al., 2019).
4 Related Similarity Indexes
Invariant to  
Invertible Linear  Orthogonal  Isotropic  
Similarity Index  Formula  Transform  Transform  Scaling 
Linear Reg. (${R}_{\text{LR}}^{2}$)  ${{Q}_{Y}^{\text{T}}X}_{\text{F}}^{2}/{X}_{\text{F}}^{2}$  $Y$ only  ✓  ✓ 
CCA (${R}_{\text{CCA}}^{2}$)  ${{Q}_{Y}^{\text{T}}{Q}_{X}}_{\text{F}}^{2}/{p}_{1}$  ✓  ✓  ✓ 
CCA (${\overline{\rho}}_{\text{CCA}}$)  ${{Q}_{Y}^{\text{T}}{Q}_{X}}_{*}/{p}_{1}$  ✓  ✓  ✓ 
SVCCA (${R}_{\text{SVCCA}}^{2}$)  ${{({U}_{Y}{T}_{Y})}^{\text{T}}{U}_{X}{T}_{X}}_{\text{F}}^{2}/\text{min}({{T}_{X}}_{\text{F}}^{2},{{T}_{Y}}_{\text{F}}^{2})$  If same subspace kept  ✓  ✓ 
SVCCA (${\overline{\rho}}_{\text{SVCCA}}$)  ${{({U}_{Y}{T}_{Y})}^{\text{T}}{U}_{X}{T}_{X}}_{*}/\text{min}({{T}_{X}}_{\text{F}}^{2},{{T}_{Y}}_{\text{F}}^{2})$  If same subspace kept  ✓  ✓ 
PWCCA  ${\sum}_{i=1}^{{p}_{1}}{\alpha}_{i}{\rho}_{i}/{\alpha }_{1}$, ${\alpha}_{i}={\sum}_{j}\u27e8{\mathbf{h}}_{i},{\mathbf{x}}_{j}\u27e9$  ✗  ✗  ✓ 
Linear HSIC  ${{Y}^{\text{T}}X}_{\text{F}}^{2}/{(n1)}^{2}$  ✗  ✓  ✗ 
Linear CKA  ${{Y}^{\text{T}}X}_{\text{F}}^{2}/({{X}^{\text{T}}X}_{\text{F}}{{Y}^{\text{T}}Y}_{\text{F}})$  ✗  ✓  ✓ 
RBF CKA  $\text{tr}(KHLH)/\sqrt{\text{tr}(KHKH)\text{tr}(LHLH)}$  ✗  ✓  ✓${}^{*}$ 
${}^{*}$Invariance of RBF CKA to isotropic scaling depends on the procedure used to select the RBF kernel bandwidth parameter. In our experiments, we selected the bandwidth as a fraction of the median distance, which ensures that the similarity index is invariant to isotropic scaling.
In this section, we briefly review linear regression, canonical correlation, and other related methods in the context of measuring similarity between neural network representations. We let ${Q}_{X}$ and ${Q}_{Y}$ represent any orthonormal bases for the columns of $X$ and $Y$, i.e. ${Q}_{X}=X{({X}^{\text{T}}X)}^{1/2}$, ${Q}_{Y}=Y{({Y}^{\text{T}}Y)}^{1/2}$ or orthogonal transformations thereof. Table 1 summarizes the formulae and invariance properties of the indexes used in experiments. For a comprehensive general review of linear indexes for measuring multivariate similarity, see Ramsay et al. (1984).
Linear Regression.
A simple way to relate neural network representations is via linear regression. One can fit every feature in $Y$ as a linear combination of features from $X$. A suitable summary statistic is the total fraction of variance explained by the fit:
$${R}_{\text{LR}}^{2}=1\frac{{\mathrm{min}}_{B}{YXB}_{\text{F}}^{2}}{{Y}_{\text{F}}^{2}}=\frac{{{Q}_{Y}^{\text{T}}X}_{\text{F}}^{2}}{{X}_{\text{F}}^{2}}.$$  (5) 
We are unaware of any application of linear regression to measuring similarity of neural network representations, although Romero et al. (2015) used a least squares loss between activations of two networks to encourage thin and deep “student” networks to learn functions similar to wide and shallow “teacher” networks.
Canonical Correlation Analysis (CCA).
Canonical correlation finds bases for two matrices such that, when the original matrices are projected onto these bases, the correlation is maximized. For $1\le i\le {p}_{1}$, the $i$^{th} canonical correlation coefficient ${\rho}_{i}$ is given by:
${\rho}_{i}=\underset{{\mathbf{w}}_{X}^{i},{\mathbf{w}}_{Y}^{i}}{\mathrm{max}}$  $\text{corr}(X{\mathbf{w}}_{X}^{i},Y{\mathbf{w}}_{Y}^{i})$  (6)  
$\mathrm{subject}\mathrm{to}$  $$  
$$ 
The vectors ${\mathbf{w}}_{X}^{i}\in {\mathbb{R}}^{{p}_{1}}$ and ${\mathbf{w}}_{Y}^{i}\in {\mathbb{R}}^{{p}_{2}}$ that maximize ${\rho}_{i}$ are the canonical weights, which transform the original data into canonical variables $X{\mathbf{w}}_{X}^{i}$ and $Y{\mathbf{w}}_{Y}^{i}$. The constraints in (6) enforce orthogonality of the canonical variables.
For the purpose of this work, we consider two summary statistics of the goodness of fit of CCA:
${R}_{\text{CCA}}^{2}$  $={\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\rho}_{i}^{2}}{{p}_{1}}}={\displaystyle \frac{{{Q}_{Y}^{\text{T}}{Q}_{X}}_{\text{F}}^{2}}{{p}_{1}}}$  (7)  
${\overline{\rho}}_{\text{CCA}}$  $={\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\rho}_{i}}{{p}_{1}}}={\displaystyle \frac{{{Q}_{Y}^{\text{T}}{Q}_{X}}_{*}}{{p}_{1}}},$  (8) 
where $\cdot {}_{*}$ denotes the nuclear norm. The mean squared CCA correlation ${R}_{\text{CCA}}^{2}$ is also known as Yanai’s GCD measure (Ramsay et al., 1984), and several statistical packages report the sum of the squared canonical correlations ${p}_{1}{R}_{\text{CCA}}^{2}={\sum}_{i=1}^{{p}_{1}}{\rho}_{i}^{2}$ under the name Pillai’s trace (SAS Institute, 2015; StataCorp, 2015). The mean CCA correlation ${\overline{\rho}}_{\text{CCA}}$ was previously used to measure similarity between neural network representations in Raghu et al. (2017).
SVCCA.
CCA is sensitive to perturbation when the condition number of $X$ or $Y$ is large (Golub & Zha, 1995). To improve robustness, singular vector CCA (SVCCA) performs CCA on truncated singular value decompositions of $X$ and $Y$ (Raghu et al., 2017; Mroueh et al., 2015; Kuss & Graepel, 2003). As formulated in Raghu et al. (2017), SVCCA keeps enough principal components of the input matrices to explain a fixed proportion of the variance, and drops remaining components. Thus, it is invariant to invertible linear transformation only if the retained subspace does not change.
ProjectionWeighted CCA.
Morcos et al. (2018) propose a different strategy to reduce the sensitivity of CCA to perturbation, which they term “projectionweighted canonical correlation” (PWCCA):
${\rho}_{\text{PW}}$  $={\displaystyle \frac{{\sum}_{i=1}^{c}{\alpha}_{i}{\rho}_{i}}{{\sum}_{i=1}{\alpha}_{i}}}$  ${\alpha}_{i}$  $={\displaystyle \sum _{j}}\u27e8{\mathbf{h}}_{i},{\mathbf{x}}_{j}\u27e9,$  (9) 
where ${\mathbf{x}}_{j}$ is the ${j}^{\text{th}}$ column of $X$, and ${\mathbf{h}}_{i}=X{\mathbf{w}}_{X}^{i}$ is the vector of canonical variables formed by projecting $X$ to the ${i}^{\text{th}}$ canonical coordinate frame. As we show in Appendix C.3, PWCCA is closely related to linear regression, since:
${R}_{\text{LR}}^{2}$  $={\displaystyle \frac{{\sum}_{i=1}^{c}{\alpha}_{i}^{\prime}{\rho}_{i}^{2}}{{\sum}_{i=1}{\alpha}_{i}^{\prime}}}$  ${\alpha}_{i}^{\prime}$  $={\displaystyle \sum _{j}}{\u27e8{\mathbf{h}}_{i},{\mathbf{x}}_{j}\u27e9}^{2}.$  (10) 
Neuron Alignment Procedures.
Other work has studied alignment between individual neurons, rather than alignment between subspaces. Li et al. (2015) examined correlation between the neurons in different neural networks, and attempt to find a bipartite match or semimatch that maximizes the sum of the correlations between the neurons, and then to measure the average correlations. Wang et al. (2018) proposed to search for subsets of neurons $\stackrel{~}{X}\subset X$ and $\stackrel{~}{Y}\subset Y$ such that, to within some tolerance, every neuron in $\stackrel{~}{X}$ can be represented by a linear combination of neurons from $\stackrel{~}{Y}$ and vice versa. They found that the maximum matching subsets are very small for intermediate layers.
Mutual Information.
Among nonlinear measures, one candidate is mutual information, which is invariant not only to invertible linear transformation, but to any invertible transformation. Li et al. (2015) previously used mutual information to measure neuronal alignment. In the context of comparing representations, we believe mutual information is not useful. Given any pair of representations produced by deterministic functions of the same input, mutual information between either and the input must be at least as large as mutual information between the representations. Moreover, in fully invertible neural networks (Dinh et al., 2017; Jacobsen et al., 2018), the mutual information between any two layers is equal to the entropy of the input.
5 Linear CKA versus CCA and Regression
Linear CKA is closely related to CCA and linear regression. If $X$ and $Y$ are centered, then ${Q}_{X}$ and ${Q}_{Y}$ are also centered, so:
${R}_{\text{CCA}}^{2}=\text{CKA}({Q}_{X}{Q}_{X}^{\text{T}},{Q}_{Y}{Q}_{Y}^{\text{T}})\sqrt{{\displaystyle \frac{{p}_{2}}{{p}_{1}}}}.$  (11) 
When performing the linear regression fit of $X$ with design matrix $Y$, ${R}_{\text{LR}}^{2}={{Q}_{Y}^{\text{T}}X}_{F}^{2}/{X}_{F}^{2}$, so:
${R}_{\text{LR}}^{2}=\text{CKA}(X{X}^{\text{T}},{Q}_{Y}{Q}_{Y}^{\text{T}}){\displaystyle \frac{\sqrt{{p}_{1}}{{X}^{\text{T}}X}_{\text{F}}}{{X}_{\text{F}}^{2}}}.$  (12) 
When might we prefer linear CKA over CCA? One way to show the difference is to rewrite $X$ and $Y$ in terms of their singular value decompositions $X={U}_{X}{\mathrm{\Sigma}}_{X}{V}_{X}^{\text{T}}$, $Y={U}_{Y}{\mathrm{\Sigma}}_{Y}{V}_{Y}^{\text{T}}$. Let the ${i}^{\text{th}}$ eigenvector of $X{X}^{\text{T}}$ (leftsingular vector of $X$) be indexed as ${\text{\mathbf{u}}}_{X}^{\text{i}}$. Then ${R}_{\text{CCA}}^{2}$ is:
${R}_{\text{CCA}}^{2}={{U}_{Y}^{\text{T}}{U}_{X}}_{\text{F}}^{2}/{p}_{1}={\displaystyle \sum _{i=1}^{{p}_{1}}}{\displaystyle \sum _{j=1}^{{p}_{2}}}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}/{p}_{1}.$  (13) 
Let the ${i}^{\text{th}}$ eigenvalue of $X{X}^{\text{T}}$ (squared singular value of $X$) be indexed as ${\lambda}_{X}^{i}$. Linear CKA can be written as:
$\text{CKA}(X{X}^{\text{T}},Y{Y}^{\text{T}})$  $={\displaystyle \frac{{{Y}^{\text{T}}X}_{\text{F}}^{2}}{{{X}^{\text{T}}X}_{\text{F}}{{Y}^{\text{T}}Y}_{\text{F}}}}$  
$={\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}{\lambda}_{X}^{i}{\lambda}_{Y}^{j}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{\sqrt{{\sum}_{i=1}^{{p}_{1}}{({\lambda}_{X}^{i})}^{2}}\sqrt{{\sum}_{j=1}^{{p}_{2}}{({\lambda}_{Y}^{j})}^{2}}}}.$  (14) 
Linear CKA thus resembles CCA weighted by the eigenvalues of the corresponding eigenvectors, i.e. the amount of variance in $X$ or $Y$ that each explains. SVCCA (Raghu et al., 2017) and projectionweighted CCA (Morcos et al., 2018) were also motivated by the idea that eigenvectors that correspond to small eigenvalues are less important, but linear CKA incorporates this weighting symmetrically and can be computed without a matrix decomposition.
Comparison of (13) and (14) immediately suggests the possibility of alternative weightings of scalar products between eigenvectors. Indeed, as we show in Appendix D.1, the similarity index induced by “canonical ridge” regularized CCA (Vinod, 1976), when appropriately normalized, interpolates between ${R}_{\text{CCA}}^{2}$, linear regression, and linear CKA.
6 Results
6.1 A Sanity Check for Similarity Indexes
Index  Accuracy 

CCA ($\overline{\rho}$)  1.4 
CCA (${R}_{\text{CCA}}^{2}$)  10.6 
SVCCA ($\overline{\rho}$)  9.9 
SVCCA (${R}_{\text{CCA}}^{2}$)  15.1 
PWCCA  11.1 
Linear Reg.  45.4 
Linear HSIC  22.2 
CKA (Linear)  99.3 
CKA (RBF 0.2)  80.6 
CKA (RBF 0.4)  99.1 
CKA (RBF 0.8)  99.3 
We propose a simple sanity check for similarity indexes: Given a pair of architecturally identical networks trained from different random initializations, for each layer in the first network, the most similar layer in the second network should be the architecturally corresponding layer. We train 10 networks and, for each layer of each network, we compute the accuracy with which we can find the corresponding layer in each of the other networks by maximum similarity. We then average the resulting accuracies. We compare CKA with CCA, SVCCA, PWCCA, and linear regression.
We first investigate a simple VGGlike convolutional network based on AllCNNC (Springenberg et al., 2015) (see Appendix E for architecture details). Figure 2 and Table 2 show that CKA passes our sanity check, but other methods perform substantially worse. For SVCCA, we experimented with a range of truncation thresholds, but no threshold revealed the layer structure (Appendix F.2); our results are consistent with those in Appendix E of Raghu et al. (2017).
We also investigate Transformer networks, where all layers are of equal width. In Appendix F.1, we show similarity between the 12 sublayers of the encoders of Transformer models (Vaswani et al., 2017) trained from different random initializations. All similarity indexes achieve nontrivial accuracy and thus pass the sanity check, although RBF CKA and ${R}_{\text{CCA}}^{2}$ performed slightly better than other methods. However, we found that there are differences in feature scale between representations of feedforward network and selfattention sublayers that CCA does not capture because it is invariant to nonisotropic scaling.
6.2 Using CKA to Understand Network Architectures
CKA can reveal pathology in neural networks representations. In Figure 3, we show CKA between layers of individual CNNs with different depths, where layers are repeated 2, 4, or 8 times. Doubling depth improved accuracy, but greater multipliers hurt accuracy. At 8x depth, CKA indicates that representations of more than half of the network are very similar to the last layer. We validated that these later layers do not refine the representation by training an ${\mathrm{\ell}}^{2}$regularized logistic regression classifier on each layer of the network. Classification accuracy in shallower architectures progressively improves with depth, but for the 8x deeper network, accuracy plateaus less than halfway through the network. When applied to ResNets (He et al., 2016), CKA reveals no pathology (Figure 4). We instead observe a grid pattern that originates from the architecture: Postresidual activations are similar to other postresidual activations, but activations within blocks are not.
CKA is equally effective at revealing relationships between layers of different architectures. Figure 5 shows the relationship between different layers of networks with and without residual connections. CKA indicates that, as networks are made deeper, the new layers are effectively inserted in between the old layers. Other similarity indexes fail to reveal meaningful relationships between different architectures, as we show in Appendix F.5.
In Figure 6, we show CKA between networks with different layer widths. Like Morcos et al. (2018), we find that increasing layer width leads to more similar representations between networks. As width increases, CKA approaches 1; CKA of earlier layers saturates faster than later layers. Networks are generally more similar to other networks of the same width than they are to the widest network we trained.
6.3 Similar Representations Across Datasets
CKA can also be used to compare networks trained on different datasets. In Figure 7, we show that models trained on CIFAR10 and CIFAR100 develop similar representations in their early layers. These representations require training; similarity with untrained networks is much lower. We further explore similarity between layers of untrained networks in Appendix F.3.
6.4 Analysis of the Shared Subspace
Equation 14 suggests a way to further elucidating what CKA is measuring, based on the action of one representational similarity matrix (RSM) $Y{Y}^{\text{T}}$ applied to the eigenvectors ${\mathbf{u}}_{X}^{i}$ of the other RSM $X{X}^{\text{T}}$. By definition, $X{X}^{\text{T}}{\mathbf{u}}_{X}^{i}$ points in the same direction as ${\mathbf{u}}_{X}^{i}$, and its norm ${X{X}^{\text{T}}{\mathbf{u}}_{X}^{i}}_{2}$ is the corresponding eigenvalue. The degree of scaling and rotation by $Y{Y}^{\text{T}}$ thus indicates how similar the action of $Y{Y}^{\text{T}}$ is to $X{X}^{\text{T}}$, for each eigenvector of $X{X}^{\text{T}}$. For visualization purposes, this approach is somewhat less useful than the CKA summary statistic, since it does not collapse the similarity to a single number, but it provides a more complete picture of what CKA measures. Figure 8 shows that, for large eigenvectors, $X{X}^{\text{T}}$ and $Y{Y}^{\text{T}}$ have similar actions, but the rank of the subspace where this holds is substantially lower than the dimensionality of the activations. In the penultimate (global average pooling) layer, the dimensionality of the shared subspace is approximately 10, which is the number of classes in the CIFAR10 dataset.
7 Conclusion and Future Work
Measuring similarity between the representations learned by neural networks is an illdefined problem, since it is not entirely clear what aspects of the representation a similarity index should focus on. Previous work has suggested that there is little similarity between intermediate layers of neural networks trained from different random initializations (Raghu et al., 2017; Wang et al., 2018). We propose CKA as a method for comparing representations of neural networks, and show that it consistently identifies correspondences between layers, not only in the same network trained from different initializations, but across entirely different architectures, whereas other methods do not. We also provide a unified framework for understanding the space of similarity indexes, as well as an empirical framework for evaluation.
We show that CKA captures intuitive notions of similarity, i.e. that neural networks trained from different initializations should be similar to each other. However, it remains an open question whether there exist kernels beyond the linear and RBF kernels that would be better for analyzing neural network representations. Moreover, there are other potential choices of weighting in Equation 14 that may be more appropriate in certain settings. We leave these questions as future work. Nevertheless, CKA seems to be much better than previous methods at finding correspondences between the learned representations in hidden layers of neural networks.
Acknowledgements
We thank Gamaleldin Elsayed, Jaehoon Lee, PaulHenri Mignot, Maithra Raghu, Samuel L. Smith, and Alex Williams for comments on the manuscript, Rishabh Agarwal for ideas, and Aliza Elkin for support.
References
 Advani & Saxe (2017) Advani, M. S. and Saxe, A. M. Highdimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
 Amari et al. (2018) Amari, S.i., Ozeki, T., Karakida, R., Yoshida, Y., and Okada, M. Dynamics of learning in mlp: Natural gradient and singularity revisited. Neural Computation, 30(1):1–33, 2018.
 Björck & Golub (1973) Björck, Å. and Golub, G. H. Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123):579–594, 1973.
 Chen et al. (1993) Chen, A. M., Lu, H.m., and HechtNielsen, R. On the geometry of feedforward neural network error surfaces. Neural Computation, 5(6):910–927, 1993.
 Connolly et al. (2012) Connolly, A. C., Guntupalli, J. S., Gors, J., Hanke, M., Halchenko, Y. O., Wu, Y.C., Abdi, H., and Haxby, J. V. The representation of biological classes in the human brain. Journal of Neuroscience, 32(8):2608–2618, 2012.
 Cortes et al. (2012) Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795–828, 2012.
 Cristianini et al. (2002) Cristianini, N., ShaweTaylor, J., Elisseeff, A., and Kandola, J. S. On kerneltarget alignment. In Advances in Neural Information Processing Systems, pp. 367–373, 2002.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
 Dumoulin et al. (2017) Dumoulin, V., Shlens, J., and Kudlur, M. A learned representation for artistic style. International Conference on Learning Representations, 2, 2017.
 Edelman (1998) Edelman, S. Representation is representation of similarities. Behavioral and Brain Sciences, 21(4):449–467, 1998.
 Elsayed et al. (2016) Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland, M. M., and Cunningham, J. P. Reorganization between preparatory and movement population responses in motor cortex. Nature Communications, 7:13239, 2016.
 Freiwald & Tsao (2010) Freiwald, W. A. and Tsao, D. Y. Functional compartmentalization and viewpoint generalization within the macaque faceprocessing system. Science, 330(6005):845–851, 2010.
 GarrigaAlonso et al. (2019) GarrigaAlonso, A., Rasmussen, C. E., and Aitchison, L. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, 2019.
 Gatys et al. (2016) Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, 2016.
 Golub & Zha (1995) Golub, G. H. and Zha, H. The canonical correlations of matrix pairs and their numerical computation. In Linear Algebra for Signal Processing, pp. 27–49. Springer, 1995.
 Gretton et al. (2005) Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. Measuring statistical dependence with hilbertschmidt norms. In International Conference on Algorithmic Learning Theory, pp. 63–77. Springer, 2005.
 Haxby et al. (2001) Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., and Pietrini, P. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293(5539):2425–2430, 2001.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015.
 Jacobsen et al. (2018) Jacobsen, J.H., Smeulders, A. W., and Oyallon, E. iRevNet: Deep invertible networks. In International Conference on Learning Representations, 2018.
 Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pp. 8571–8580, 2018.
 Johnson et al. (2016) Johnson, J., Alahi, A., and FeiFei, L. Perceptual losses for realtime style transfer and superresolution. In European Conference on Computer Vision, pp. 694–711. Springer, 2016.
 KhalighRazavi & Kriegeskorte (2014) KhalighRazavi, S.M. and Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology, 10(11):e1003915, 2014.
 Kriegeskorte et al. (2008a) Kriegeskorte, N., Mur, M., and Bandettini, P. A. Representational similarity analysisconnecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 2008a.
 Kriegeskorte et al. (2008b) Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., and Bandettini, P. A. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60(6):1126–1141, 2008b.
 Kuss & Graepel (2003) Kuss, M. and Graepel, T. The geometry of kernel canonical correlation analysis. Technical report, Max Planck Institute for Biological Cybernetics, 2003.
 Laakso & Cottrell (2000) Laakso, A. and Cottrell, G. Content and cluster analysis: assessing representational similarity in neural systems. Philosophical Psychology, 13(1):47–76, 2000.
 LeCun et al. (1991) LeCun, Y., Kanter, I., and Solla, S. A. Second order properties of error surfaces: Learning time and generalization. In Advances in Neural Information Processing Systems, pp. 918–924, 1991.
 Lee et al. (2018) Lee, J., Sohldickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
 Li et al. (2015) Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. Convergent learning: Do different neural networks learn the same representations? In Storcheus, D., Rostamizadeh, A., and Kumar, S. (eds.), Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, volume 44 of Proceedings of Machine Learning Research, pp. 196–212, Montreal, Canada, 11 Dec 2015. PMLR.
 LorenzoSeva & Ten Berge (2006) LorenzoSeva, U. and Ten Berge, J. M. Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2(2):57–64, 2006.
 Morcos et al. (2018) Morcos, A., Raghu, M., and Bengio, S. Insights on representational similarity in neural networks with canonical correlation. Advances in Neural Information Processing Systems 31, pp. 5732–5741, 2018.
 Mroueh et al. (2015) Mroueh, Y., Marcheret, E., and Goel, V. Asymmetrically weighted CCA and hierarchical kernel sentence embedding for multimodal retrieval. arXiv preprint arXiv:1511.06267, 2015.
 Novak et al. (2019) Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia, D. A., Pennington, J., and Sohldickstein, J. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
 Orhan & Pitkow (2018) Orhan, E. and Pitkow, X. Skip connections eliminate singularities. In International Conference on Learning Representations, 2018.
 Press (2011) Press, W. H. Canonical correlation clarified by singular value decomposition, 2011. URL http://numerical.recipes/whp/notes/CanonCorrBySVD.pdf.
 Raghu et al. (2017) Raghu, M., Gilmer, J., Yosinski, J., and SohlDickstein, J. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6076–6085. Curran Associates, Inc., 2017.
 Ramsay et al. (1984) Ramsay, J., ten Berge, J., and Styan, G. Matrix correlation. Psychometrika, 49(3):403–423, 1984.
 Robert & Escoufier (1976) Robert, P. and Escoufier, Y. A unifying tool for linear multivariate statistical methods: the RVcoefficient. Applied Statistics, pp. 257–265, 1976.
 Romero et al. (2015) Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In International Conference on Learning Representations, 2015.
 SAS Institute (2015) SAS Institute. Introduction to Regression Procedures. 2015. URL https://support.sas.com/documentation/onlinedoc/stat/141/introreg.pdf.
 Saxe et al. (2014) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
 Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distancebased and rkhsbased statistics in hypothesis testing. The Annals of Statistics, pp. 2263–2291, 2013.
 Smith et al. (2017) Smith, S. L., Turban, D. H., Hamblin, S., and Hammerla, N. Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In International Conference on Learning Representations, 2017.
 Song et al. (2007) Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and Bedo, J. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pp. 823–830. ACM, 2007.
 Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations Workshop, 2015.
 StataCorp (2015) StataCorp. Stata Multivariate Statistics Reference Manual. 2015. URL https://www.stata.com/manuals14/mv.pdf.
 Sussillo et al. (2015) Sussillo, D., Churchland, M. M., Kaufman, M. T., and Shenoy, K. V. A neural network that finds a naturalistic solution for the production of muscle activity. Nature Neuroscience, 18(7):1025, 2015.
 Tucker (1951) Tucker, L. R. A method for synthesis of factor analysis studies. Technical report, Educational Testing Service, Princeton, NJ, 1951.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Vinod (1976) Vinod, H. D. Canonical ridge and econometrics of joint production. Journal of Econometrics, 4(2):147–166, 1976.
 Wang et al. (2018) Wang, L., Hu, L., Gu, J., Wu, Y., Hu, Z., He, K., and Hopcroft, J. E. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In Advances in Neural Information Processing Systems, pp. 9607–9616, 2018.
 Yamins et al. (2014) Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performanceoptimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, 2014.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In Richard C. Wilson, E. R. H. and Smith, W. A. P. (eds.), Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1–87.12. BMVA Press, September 2016. ISBN 1901725596. doi: 10.5244/C.30.87.
Appendix A Proof of Theorem 1
Theorem.
Let $X$ and $Y$ be $n\mathrm{\times}p$ matrices. Suppose $s$ is invariant to invertible linear transformation in the first argument, i.e. $s\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}\mathrm{=}s\mathit{}\mathrm{(}X\mathit{}A\mathrm{,}Z\mathrm{)}$ for arbitrary $Z$ and any $A$ with $\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}A\mathrm{)}\mathrm{=}p$. If $\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}X\mathrm{)}\mathrm{=}\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}\mathit{}\mathrm{(}Y\mathrm{)}\mathrm{=}n$, then $s\mathit{}\mathrm{(}X\mathrm{,}Z\mathrm{)}\mathrm{=}s\mathit{}\mathrm{(}Y\mathrm{,}Z\mathrm{)}$.
Proof.
Let
${X}^{\prime}$  $=\left[\begin{array}{c}\hfill X\hfill \\ \hfill {K}_{X}\hfill \end{array}\right]$  ${Y}^{\prime}$  $=\left[\begin{array}{c}\hfill Y\hfill \\ \hfill {K}_{Y}\hfill \end{array}\right],$ 
where ${K}_{X}$ is a basis for the null space of the rows of $X$ and ${K}_{Y}$ is a basis for the null space of the rows of $Y$. Then let $A={X}^{\prime 1}{Y}^{\prime}$.
$$\left[\begin{array}{c}\hfill X\hfill \\ \hfill {K}_{X}\hfill \end{array}\right]A=\left[\begin{array}{c}\hfill Y\hfill \\ \hfill {K}_{Y}\hfill \end{array}\right]\u27f9XA=Y.$$ 
Because ${X}^{\prime}$ and ${Y}^{\prime}$ have rank $p$ by construction, $A$ also has rank $p$. Thus, $s(X,Z)=s(XA,Z)=s(Y,Z)$. ∎
Appendix B Orthogonalization and Invariance to Invertible Linear Transformation
Here we show that any similarity index that is invariant to orthogonal transformation can be made invariant to invertible linear transformation by orthogonalizing the columns of the input.
Proposition 1.
Let $X$ be an $n\mathrm{\times}p$ matrix of full column rank and let $A$ be an invertible $p\mathrm{\times}p$ matrix. Let $X\mathrm{=}{Q}_{X}\mathit{}{R}_{X}$ and $X\mathit{}A\mathrm{=}{Q}_{X\mathit{}A}\mathit{}{R}_{X\mathit{}A}$, where ${Q}_{X}^{T}\mathit{}{Q}_{X}\mathrm{=}{Q}_{X\mathit{}A}^{T}\mathit{}{Q}_{X\mathit{}A}\mathrm{=}I$ and ${R}_{X}$ and ${R}_{X\mathit{}A}$ are invertible. If $s\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{,}\mathrm{\cdot}\mathrm{)}$ is invariant to orthogonal transformation, then $s\mathit{}\mathrm{(}{Q}_{X}\mathrm{,}Y\mathrm{)}\mathrm{=}s\mathit{}\mathrm{(}{Q}_{X\mathit{}A}\mathrm{,}Y\mathrm{)}$.
Proof.
Let $B={R}_{X}A{R}_{XA}^{1}$. Then ${Q}_{X}B={Q}_{XA}$, and B is an orthogonal transformation:
$${B}^{\text{T}}B={B}^{\text{T}}{Q}_{X}^{\text{T}}{Q}_{X}B={Q}_{XA}^{\text{T}}{Q}_{XA}=I.$$ 
Thus $s({Q}_{X},Y)=s({Q}_{X}B,Y)=s({Q}_{XA},Y)$. ∎
Appendix C CCA and Linear Regression
C.1 Linear Regression
Consider the linear regression fit of the columns of an $n\times m$ matrix $C$ with an $n\times p$ matrix $A$:
$\widehat{B}=\underset{B}{\mathrm{arg}\mathrm{min}}{CAB}_{\text{F}}^{2}={({A}^{\text{T}}A)}^{1}{A}^{\text{T}}C.$ 
Let $A=QR$, the thin QR decomposition of A. Then the fitted values are given by:
$\widehat{C}$  $=A\widehat{B}$  
$=A{({A}^{\text{T}}A)}^{1}{A}^{\text{T}}C$  
$=QR{({R}^{\text{T}}{Q}^{\text{T}}QR)}^{1}{R}^{\text{T}}{Q}^{\text{T}}C$  
$=QR{R}^{1}{({R}^{\text{T}})}^{1}{R}^{\text{T}}{Q}^{\text{T}}C$  
$=Q{Q}^{\text{T}}C.$ 
The residuals $E=C\widehat{C}$ are orthogonal to the fitted values, i.e.
${E}^{\text{T}}\widehat{C}$  $={(CQ{Q}^{\text{T}}C)}^{\text{T}}Q{Q}^{\text{T}}C$  
$={C}^{\text{T}}Q{Q}^{\text{T}}C{C}^{\text{T}}Q{Q}^{\text{T}}C=0.$ 
Thus:
${E}_{\text{F}}^{2}$  $=\text{tr}({E}^{\text{T}}E)$  
$=\text{tr}({E}^{\text{T}}C{E}^{\text{T}}\widehat{C})$  
$=\text{tr}({(C\widehat{C})}^{\text{T}}C)$  
$=\text{tr}({C}^{\text{T}}C)\text{tr}({C}^{\text{T}}Q{Q}^{\text{T}}C)$  
$={C}_{\text{F}}^{2}{{Q}^{\text{T}}C}_{\text{F}}^{2}.$  (15) 
Assuming that $C$ was centered by subtracting its column means prior to the linear regression fit, the total fraction of variance explained by the fit is:
${R}^{2}$  $=1{\displaystyle \frac{{E}_{\text{F}}^{2}}{{C}_{\text{F}}^{2}}}=1{\displaystyle \frac{{C}_{\text{F}}^{2}{{Q}^{\text{T}}C}_{\text{F}}^{2}}{{C}_{\text{F}}^{2}}}={\displaystyle \frac{{{Q}^{\text{T}}C}_{\text{F}}^{2}}{{C}_{\text{F}}^{2}}}.$  (16) 
Although we have assumed that $Q$ is obtained from QR decomposition, any orthonormal basis with the same span will suffice, because orthogonal transformations do not change the Frobenius norm.
C.2 CCA
Let $X$ be an $n\times {p}_{1}$ matrix and $Y$ be an $n\times {p}_{2}$ matrix, and let $p=\text{min}({p}_{1},{p}_{2})$. Given the thin QR decompositions of $X$ and $Y$, $X={Q}_{X}{R}_{X}$, $Y={Q}_{Y}{R}_{Y}$ such that ${Q}_{X}^{\text{T}}{Q}_{X}=I$, ${Q}_{Y}^{\text{T}}{Q}_{Y}=I$, the canonical correlations ${\rho}_{i}$ are the singular values of $A={Q}_{X}^{\text{T}}{Q}_{Y}$ (Björck & Golub, 1973; Press, 2011) and thus the square roots of the eigenvalues of ${A}^{\text{T}}A$. The squared canonical correlations ${\rho}_{i}^{2}$ are the eigenvalues of ${A}^{\text{T}}A={Q}_{Y}^{\text{T}}{Q}_{X}{Q}_{X}^{\text{T}}{Q}_{Y}$. Their sum is ${\sum}_{i=1}^{p}{\rho}_{i}^{2}=\text{tr}({A}^{\text{T}}A)={{Q}_{Y}^{\text{T}}{Q}_{X}}_{\text{F}}^{2}$.
Now consider the linear regression fit of the columns of ${Q}_{X}$ with $Y$. Assume that ${Q}_{X}$ has zero mean. Substituting ${Q}_{Y}$ for $Q$ and ${Q}_{X}$ for $C$ in Equation 16, and noting that ${{Q}_{X}}_{\text{F}}^{2}={p}_{1}$:
${R}^{2}={\displaystyle \frac{{{Q}_{Y}^{\text{T}}{Q}_{X}}_{\text{F}}^{2}}{{p}_{1}}}={\displaystyle \frac{{\sum}_{i=1}^{p}{\rho}_{i}^{2}}{{p}_{1}}}.$  (17) 
C.3 ProjectionWeighted CCA
Morcos et al. (2018) proposed to compute projectionweighted canonical correlation as:
${\overline{\rho}}_{\text{PW}}$  $={\displaystyle \frac{{\sum}_{i=1}^{c}{\alpha}_{i}{\rho}_{i}}{{\sum}_{i=1}{\alpha}_{i}}}$  ${\alpha}_{i}$  $={\displaystyle \sum _{j}}\u27e8{\mathbf{h}}_{i},{\mathbf{x}}_{j}\u27e9,$ 
where the ${\mathbf{x}}_{j}$ are the columns of $X$, and the ${\mathbf{h}}_{i}$ are the canonical variables formed by projecting $X$ to the canonical coordinate frame. Below, we show that if we modify ${\overline{\rho}}_{\text{PW}}$ by squaring the dot products and ${\rho}_{i}$, we recover linear regression. Specifically:
${R}_{\text{MPW}}^{2}$  $={\displaystyle \frac{{\sum}_{i=1}^{c}{\alpha}_{i}^{\prime}{\rho}_{i}^{2}}{{\sum}_{i=1}{\alpha}_{i}^{\prime}}}={R}_{\text{LR}}^{2}$  ${\alpha}_{i}^{\prime}$  $={\displaystyle \sum _{j}}{\u27e8{\mathbf{h}}_{i},{\mathbf{x}}_{j}\u27e9}^{2}.$ 
Our derivation begins by forming the SVD ${Q}_{X}^{\text{T}}{Q}_{Y}=U\mathrm{\Sigma}{V}^{\text{T}}$. $\mathrm{\Sigma}$ is a diagonal matrix of the canonical correlations ${\rho}_{i}$, and the matrix of canonical variables $H={Q}_{X}U$. Then ${R}_{\text{MPW}}^{2}$ is:
${R}_{\text{MPW}}^{2}$  $={\displaystyle \frac{{{X}^{\text{T}}H\mathrm{\Sigma}}_{\text{F}}^{2}}{{{X}^{\text{T}}H}_{\text{F}}^{2}}}$  (18)  
$={\displaystyle \frac{\text{tr}({({X}^{\text{T}}H\mathrm{\Sigma})}^{\text{T}}({X}^{\text{T}}H\mathrm{\Sigma}))}{\text{tr}({({X}^{\text{T}}H)}^{\text{T}}({X}^{\text{T}}H))}}$  
$={\displaystyle \frac{\text{tr}(\mathrm{\Sigma}{H}^{\text{T}}X{X}^{\text{T}}H\mathrm{\Sigma})}{\text{tr}({H}^{\text{T}}X{X}^{\text{T}}H)}}$  
$={\displaystyle \frac{\text{tr}({X}^{\text{T}}H{\mathrm{\Sigma}}^{2}{H}^{\text{T}}X)}{\text{tr}({X}^{\text{T}}H{H}^{\text{T}}X)}}$  
$={\displaystyle \frac{\text{tr}({R}_{X}^{\text{T}}{Q}_{X}^{\text{T}}H{\mathrm{\Sigma}}^{2}{H}^{\text{T}}{Q}_{X}{R}_{X})}{\text{tr}({R}_{X}^{\text{T}}{Q}_{X}^{\text{T}}{Q}_{X}U{U}^{\text{T}}{Q}_{X}^{\text{T}}{Q}_{X}{R}_{X})}}.$ 
Noting that ${Q}_{X}^{\text{T}}H=U$ and $U\mathrm{\Sigma}={Q}_{X}^{\text{T}}{Q}_{Y}V$:
${R}_{\text{MPW}}^{2}$  $={\displaystyle \frac{\text{tr}({R}_{X}^{\text{T}}U{\mathrm{\Sigma}}^{2}{U}^{\text{T}}{R}_{X})}{\text{tr}({R}_{X}^{\text{T}}{Q}_{X}^{\text{T}}{Q}_{X}{R}_{X})}}$  
$={\displaystyle \frac{\text{tr}({R}_{X}^{\text{T}}{Q}_{X}^{\text{T}}{Q}_{Y}V\mathrm{\Sigma}{U}^{\text{T}}{R}_{X})}{\text{tr}({X}^{\text{T}}X)}}$  
$={\displaystyle \frac{\text{tr}({X}^{\text{T}}{Q}_{Y}{Q}_{Y}^{\text{T}}{Q}_{X}{R}_{X})}{\text{tr}({X}^{\text{T}}X)}}$  
$={\displaystyle \frac{\text{tr}({X}^{\text{T}}{Q}_{Y}{Q}_{Y}^{\text{T}}X)}{\text{tr}({X}^{\text{T}}X)}}$  
$={\displaystyle \frac{{{Q}_{Y}^{\text{T}}X}_{\text{F}}^{2}}{{X}_{\text{F}}^{2}}}.$ 
Substituting ${Q}_{Y}$ for $Q$ and $X$ for $C$ in Equation 16:
$${R}_{\text{LR}}^{2}=\frac{{{Q}_{Y}^{\text{T}}X}_{\text{F}}^{2}}{{X}_{\text{F}}^{2}}={R}_{\text{MPW}}^{2}.$$ 
Appendix D Notes on Other Methods
D.1 Canonical Ridge
Beyond CCA, we could also consider the “canonical ridge” regularized CCA objective (Vinod, 1976):
${\sigma}_{i}=\underset{{\mathbf{w}}_{X}^{i},{\mathbf{w}}_{Y}^{i}}{\mathrm{max}}$  $\frac{{(X{\mathbf{w}}_{X}^{i})}^{\text{T}}(Y{\mathbf{w}}_{Y}^{i})}{\sqrt{{X{\mathbf{w}}_{X}^{i}}^{2}+{\kappa}_{X}{{\mathbf{w}}_{X}^{i}}_{2}^{2}}\sqrt{{Y{\mathbf{w}}_{Y}^{i}}^{2}+{\kappa}_{Y}{{\mathbf{w}}_{Y}^{i}}^{2}}}$  (19)  
$\mathrm{subject}\mathrm{to}$  $$  
$$ 
Given the singular value decompositions $X={U}_{X}{\mathrm{\Sigma}}_{X}{V}_{X}^{\text{T}}$ and $Y={U}_{Y}{\mathrm{\Sigma}}_{Y}{V}_{Y}^{\text{T}}$, one can form “partially orthogonalized” bases ${\stackrel{~}{Q}}_{X}={U}_{X}{\mathrm{\Sigma}}_{X}{({\mathrm{\Sigma}}_{X}^{2}+{\kappa}_{X}I)}^{1/2}$ and ${\stackrel{~}{Q}}_{Y}={U}_{Y}{\mathrm{\Sigma}}_{Y}{({\mathrm{\Sigma}}_{Y}^{2}+{\kappa}_{Y}I)}^{1/2}$. Given the singular value decomposition of their product $\stackrel{~}{U}\stackrel{~}{\mathrm{\Sigma}}{\stackrel{~}{V}}^{\text{T}}={\stackrel{~}{Q}}_{X}^{\text{T}}{\stackrel{~}{Q}}_{Y}$, the canonical weights are given by ${W}_{X}={V}_{X}{({\mathrm{\Sigma}}_{X}^{2}+{\kappa}_{X}I)}^{1/2}\stackrel{~}{U}$ and ${W}_{Y}={V}_{Y}{({\mathrm{\Sigma}}_{Y}^{2}+{\kappa}_{Y}I)}^{1/2}\stackrel{~}{V}$, as previously shown by Mroueh et al. (2015). As in the unregularized case (Equation 13), there is a convenient expression for the sum of the squared singular values $\sum {\stackrel{~}{\sigma}}_{i}^{2}$ in terms of the eigenvalues and eigenvectors of $X{X}^{\text{T}}$ and $Y{Y}^{\text{T}}$. Let the ${i}^{\text{th}}$ leftsingular vector of $X$ (eigenvector of $X{X}^{\text{T}}$) be indexed as ${\text{\mathbf{u}}}_{X}^{\text{i}}$ and let the ${i}^{\text{th}}$ eigenvalue of $X{X}^{\text{T}}$ (squared singular value of $X$) be indexed as ${\lambda}_{X}^{i}$, and similarly let the leftsingular vectors of $Y{Y}^{\text{T}}$ be indexed as ${\text{\mathbf{u}}}_{Y}^{\text{i}}$ and the eigenvalues as ${\lambda}_{Y}^{i}$. Then:
$\sum _{i=1}^{{p}_{1}}}{\stackrel{~}{\sigma}}_{i}^{2$  $={{\stackrel{~}{Q}}_{Y}^{\text{T}}{\stackrel{~}{Q}}_{X}}_{\text{F}}^{2}$  (20)  
$={{({\mathrm{\Sigma}}_{Y}^{2}+{\kappa}_{Y}I)}^{1/2}{\mathrm{\Sigma}}_{Y}{U}_{Y}^{\text{T}}{U}_{X}{\mathrm{\Sigma}}_{X}{({\mathrm{\Sigma}}_{X}^{2}+{\kappa}_{X}I)}^{1/2}}_{\text{F}}^{2}$  (21)  
$={\displaystyle \sum _{i=1}^{{p}_{1}}}{\displaystyle \sum _{j=1}^{{p}_{2}}}{\displaystyle \frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{j}}{({\lambda}_{X}^{i}+{\kappa}_{X})({\lambda}_{Y}^{j}+{\kappa}_{Y})}}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}.$  (22) 
Unlike in the unregularized case, the singular values ${\sigma}_{i}$ do not measure the correlation between the canonical variables. Instead, they become arbitrarily small as ${\kappa}_{X}$ or ${\kappa}_{Y}$ increase. Thus, we need to normalize the statistic to remove the dependency on the regularization parameters.
Applying von Neumann’s trace inequality yields a bound:
$\sum _{i=1}^{{p}_{1}}}{\stackrel{~}{\sigma}}_{i}^{2$  $=\text{tr}({\stackrel{~}{Q}}_{Y}{\stackrel{~}{Q}}_{Y}^{\text{T}}{\stackrel{~}{Q}}_{X}{\stackrel{~}{Q}}_{X}^{\text{T}})$  (23)  
$=\text{tr}(({U}_{Y}{\mathrm{\Sigma}}_{Y}^{2}{({\mathrm{\Sigma}}_{Y}^{2}+{\kappa}_{Y}I)}^{1}{U}_{Y}^{\text{T}})({U}_{X}{\mathrm{\Sigma}}_{X}^{2}{({\mathrm{\Sigma}}_{X}^{2}+{\kappa}_{X}I)}^{1}{U}_{X}^{\text{T}}))$  (24)  
$\le {\displaystyle \sum _{i=1}^{{p}_{1}}}{\displaystyle \frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{i}}{({\lambda}_{X}^{i}+{\kappa}_{X})({\lambda}_{Y}^{i}+{\kappa}_{Y})}}.$  (25) 
Applying the CauchySchwarz inequality to (25) yields the alternative bounds:
$\sum _{i=1}^{{p}_{1}}}{\stackrel{~}{\sigma}}_{i}^{2$  $\le \sqrt{{\displaystyle \sum _{i=1}^{{p}_{1}}}{\left({\displaystyle \frac{{\lambda}_{X}^{i}}{{\lambda}_{X}^{i}+{\kappa}_{X}}}\right)}^{2}}\sqrt{{\displaystyle \sum _{i=1}^{{p}_{1}}}{\left({\displaystyle \frac{{\lambda}_{Y}^{i}}{{\lambda}_{Y}^{i}+{\kappa}_{Y}}}\right)}^{2}}$  (26)  
$\le \sqrt{{\displaystyle \sum _{i=1}^{{p}_{1}}}{\left({\displaystyle \frac{{\lambda}_{X}^{i}}{{\lambda}_{X}^{i}+{\kappa}_{X}}}\right)}^{2}}\sqrt{{\displaystyle \sum _{i=1}^{{p}_{2}}}{\left({\displaystyle \frac{{\lambda}_{Y}^{i}}{{\lambda}_{Y}^{i}+{\kappa}_{Y}}}\right)}^{2}}.$  (27) 
A normalized form of (22) could be produced by dividing by any of (25), (26), or (27).
If ${\kappa}_{X}={\kappa}_{Y}=0$, then (25) and (26) are equal to ${p}_{1}$. In this case, (22) is simply the sum of the squared canonical correlations, so normalizing by either of these bounds recovers ${R}_{\text{CCA}}^{2}$.
If ${\kappa}_{Y}=0$, then as ${\kappa}_{X}\to \mathrm{\infty}$, normalizing by the bound from (25) recovers ${R}^{2}$:
$\underset{{\kappa}_{X}\to \mathrm{\infty}}{lim}{\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}\frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{j}}{({\lambda}_{X}^{i}+{\kappa}_{X})({\lambda}_{Y}^{j}+0)}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{{\sum}_{i=1}^{{p}_{1}}\frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{i}}{({\lambda}_{X}^{i}+{\kappa}_{X})({\lambda}_{Y}^{i}+0)}}}$  (28)  
$=$  $\underset{{\kappa}_{X}\to \mathrm{\infty}}{lim}{\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}\frac{{\lambda}_{X}^{i}}{\left(\frac{{\lambda}_{X}^{i}}{{\kappa}_{X}}+1\right)}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{{\sum}_{i=1}^{{p}_{1}}\frac{{\lambda}_{X}^{i}}{\left(\frac{{\lambda}_{X}^{i}}{{\kappa}_{X}}+1\right)}}}$  (29)  
$=$  $\frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}{\lambda}_{X}^{i}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{{\sum}_{i=1}^{{p}_{1}}{\lambda}_{X}^{i}}$  (30)  
$=$  $\frac{{{U}_{Y}^{\text{T}}{U}_{X}{\mathrm{\Sigma}}_{X}}_{\text{F}}^{2}}{{X}_{\text{F}}^{2}}}={\displaystyle \frac{{{Q}_{Y}^{\text{T}}X}_{\text{F}}^{2}}{{X}_{\text{F}}^{2}}}={R}_{\text{LR}}^{2}.$  (31) 
The bound from (27) differs from the bounds in (25) and (26) because it is multiplicatively separable in $X$ and $Y$. Normalizing by this bound leads to $\text{CKA}({\stackrel{~}{Q}}_{X}{\stackrel{~}{Q}}_{X}^{\text{T}},{\stackrel{~}{Q}}_{Y}{\stackrel{~}{Q}}_{Y}^{\text{T}})$:
$\frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}\frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{j}}{({\lambda}_{X}^{i}+{\kappa}_{X})({\lambda}_{Y}^{j}+{\kappa}_{Y})}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{\sqrt{{\sum}_{i=1}^{{p}_{1}}{\left(\frac{{\lambda}_{X}^{i}}{{\lambda}_{X}^{i}+{\kappa}_{X}}\right)}^{2}}\sqrt{{\sum}_{i=1}^{{p}_{2}}{\left(\frac{{\lambda}_{Y}^{i}}{{\lambda}_{Y}^{i}+{\kappa}_{Y}}\right)}^{2}}}$  (32)  
$=$  $\mathrm{}{\displaystyle \frac{{{\stackrel{~}{Q}}_{Y}^{\text{T}}{\stackrel{~}{Q}}_{X}}_{\text{F}}^{2}}{{{\stackrel{~}{Q}}_{X}^{\text{T}}{\stackrel{~}{Q}}_{X}}_{\text{F}}{{\stackrel{~}{Q}}_{Y}^{\text{T}}{\stackrel{~}{Q}}_{Y}}_{\text{F}}}}=\text{CKA}({\stackrel{~}{Q}}_{X}{\stackrel{~}{Q}}_{X}^{\text{T}},{\stackrel{~}{Q}}_{Y}{\stackrel{~}{Q}}_{Y}^{\text{T}}).$  (33) 
Moreover, setting ${\kappa}_{X}={\kappa}_{Y}=\kappa $ and taking the limit as $\kappa \to \mathrm{\infty}$, the normalization from (27) leads to $\text{CKA}(X{X}^{\text{T}},Y{Y}^{\text{T}})$:
$\underset{\kappa \to \mathrm{\infty}}{lim}{\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}\frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{j}}{({\lambda}_{X}^{i}+\kappa )({\lambda}_{Y}^{j}+\kappa )}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{\sqrt{{\sum}_{i=1}^{{p}_{1}}{\left(\frac{{\lambda}_{X}^{i}}{{\lambda}_{X}^{i}+\kappa}\right)}^{2}}\sqrt{{\sum}_{i=1}^{{p}_{2}}{\left(\frac{{\lambda}_{Y}^{i}}{{\lambda}_{Y}^{i}+\kappa}\right)}^{2}}}}$  (34)  
$=$  $\underset{\kappa \to \mathrm{\infty}}{lim}{\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}\frac{{\lambda}_{X}^{i}{\lambda}_{Y}^{j}}{\left(\frac{{\lambda}_{X}^{i}}{\kappa}+1\right)\left(\frac{{\lambda}_{Y}^{j}}{\kappa}+1\right)}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{\sqrt{{\sum}_{i=1}^{{p}_{1}}{\left(\frac{{\lambda}_{X}^{i}}{\frac{{\lambda}_{X}^{i}}{\kappa}+1}\right)}^{2}}\sqrt{{\sum}_{i=1}^{{p}_{2}}{\left(\frac{{\lambda}_{Y}^{i}}{\frac{{\lambda}_{Y}^{i}}{\kappa}+1}\right)}^{2}}}}$  (35)  
$=$  $\mathrm{}{\displaystyle \frac{{\sum}_{i=1}^{{p}_{1}}{\sum}_{j=1}^{{p}_{2}}{\lambda}_{X}^{i}{\lambda}_{Y}^{j}{\u27e8{\text{\mathbf{u}}}_{X}^{i},{\text{\mathbf{u}}}_{Y}^{j}\u27e9}^{2}}{\sqrt{{\sum}_{i=1}^{{p}_{1}}{\left({\lambda}_{X}^{i}\right)}^{2}}\sqrt{{\sum}_{i=1}^{{p}_{2}}{\left({\lambda}_{Y}^{i}\right)}^{2}}}}$  (36)  
$=$  $\mathrm{}\text{CKA}(X{X}^{\text{T}},Y{Y}^{\text{T}}).$ 
Overall, the hyperparameters of the canonical ridge objective make it less useful for exploratory analysis. These hyperparameters could be selected by crossvalidation, but this is computationally expensive, and the resulting estimator would be biased by sample size. Moreover, our goal is not to map representations of networks to a common space, but to measure the similarity between networks. Appropriately chosen regularization will improve outofsample performance of the mapping, but it makes the meaning of “similarity” more ambiguous.
D.2 The Orthogonal Procrustes Problem
The orthogonal Procrustes problem consists of finding an orthogonal rotation in feature space that produces the smallest error:
$\widehat{Q}=\underset{Q}{\mathrm{arg}\mathrm{min}}{YXQ}_{\text{F}}^{2}\mathrm{subject}\mathrm{to}{Q}^{\text{T}}Q=I.$  (37) 
The objective can be written as:
${YXQ}_{\text{F}}^{2}$  $=\text{tr}({(YXQ)}^{\text{T}}(YXQ))$  
$=\text{tr}({Y}^{\text{T}}Y)\text{tr}({Y}^{\text{T}}XQ)\text{tr}({Q}^{\text{T}}{X}^{\text{T}}Y)+\text{tr}({Q}^{\text{T}}{X}^{\text{T}}XQ)$  
$={Y}_{\text{F}}^{2}+{X}_{\text{F}}^{2}2\text{tr}({Y}^{\text{T}}XQ).$  (38) 
Thus, an equivalent objective is:
$\widehat{Q}=\underset{Q}{\mathrm{arg}\mathrm{max}}\text{tr}({Y}^{\text{T}}XQ)\mathrm{subject}\mathrm{to}{Q}^{\text{T}}Q=I.$  (39) 
The solution is $\widehat{Q}=U{V}^{\text{T}}$ where $U\mathrm{\Sigma}{V}^{\text{T}}={X}^{\text{T}}Y$, the singular value decomposition. At the maximum of (39):
$\text{tr}({Y}^{\text{T}}X\widehat{Q})=\text{tr}(V\mathrm{\Sigma}{U}^{\text{T}}U{V}^{\text{T}})=\text{tr}(\mathrm{\Sigma})={{X}^{\text{T}}Y}_{*}={{Y}^{\text{T}}X}_{*},$  (40) 
which is similar to what we call “dot productbased similarity” (Equation 1), but with the squared Frobenius norm of ${Y}^{\text{T}}X$ (the sum of the squared singular values) replaced by the nuclear norm (the sum of the singular values). The Frobenius norm of ${Y}^{\text{T}}X$ can be obtained as the solution to a similar optimization problem:
${{Y}^{\text{T}}X}_{\text{F}}$  $=\underset{W}{\mathrm{max}}\text{tr}({Y}^{\text{T}}XW)\mathrm{subject}\mathrm{to}\text{tr}({W}^{\text{T}}W)=1.$  (41) 
In the context of neural networks, Smith et al. (2017) previously proposed using the solution to the orthogonal Procrustes problem to align word embeddings from different languages, and demonstrated that it outperformed CCA.
Appendix E Architecture Details
All nonResNet architectures are based on AllCNNC (Springenberg et al., 2015), but none are architecturally identical. The Plain10 model is very similar, but we place the final linear layer after the average pooling layer and use batch normalization because these are common choices in modern architectures. We use these models because they train in minutes on modern hardware.
Tiny10 

$3\times 3$ conv. 16BNReLu $\times 2$ 
$3\times 3$ conv. 32 stride 2BNReLu 
$3\times 3$ conv. 32BNReLu $\times 2$ 
$3\times 3$ conv. 64 stride 2BNReLu 
$3\times 3$ conv. 64 valid paddingBNReLu 
$1\times 1$ conv. 64BNReLu 
Global average pooling 
Logits 
Plain$\left(8n+2\right)$ 

$3\times 3$ conv. 96BNReLu $\times \left(3n1\right)$ 
$3\times 3$ conv. 96 stride 2BNReLu 
$3\times 3$ conv. 192BNReLu $\times \left(3n1\right)$ 
$3\times 3$ conv. 192 stride 2BNReLu 
$3\times 3$ conv. 192 BNReLu $\times \left(n1\right)$ 
$3\times 3$ conv. 192 valid paddingBNReLu 
$1\times 1$ conv. 192BNReLu $\times n$ 
Global average pooling 
Logits 
Width$n$ 

$3\times 3$ conv. $n$BNReLu $\times 2$ 
$3\times 3$ conv. $n$ stride 2BNReLu 
$3\times 3$ conv. $n$BNReLu $\times 2$ 
$3\times 3$ conv. $n$ stride 2BNReLu 
$3\times 3$ conv. $n$ valid paddingBNReLu 
$1\times 1$ conv. $n$BNReLu 
Global average pooling 
Logits 
Appendix F Additional Experiments
F.1 Sanity Check for Transformer Encoders
When applied to Transformer encoders, all similarity indexes we investigated passed the sanity check described in Section 6.1. In Figure F.2, we show similarity between the 12 sublayers of the encoders of 10 Transformer models (45 pairs) (Vaswani et al., 2017) trained from different random initializations to perform English to German translation. Each Transformer sublayer contains four operations, shown in Figure F.2, and results vary based which operation the representation is taken after. Table F.1 shows the accuracy with which we can identify corresponding layers between network pairs by maximal similarity.
The Transformer architecture alternates between selfattention and feedforward network sublayers. The checkerboard pattern in similarity plots for the Attention/FFN layer in Figure F.2 indicates that representations of feedforward network sublayers are more similar to other feedforward network sublayers than to selfattention sublayers, and similarly, representations of selfattention sublayers are more similar to other selfattention sublayers than to feedforward network layers. CKA also reveals a checkerboard pattern for activations after the channelwise scale operation (before the selfattention/feedforward network operation) that other methods do not. Because CCA is invariant to nonisotropic scaling, CCA similarities before and after channelwise scaling are identical. Thus, CCA cannot capture this structure, even though it is consistent across different networks.