To understand neural network behavior, recent works quantitatively comparedifferent networks' learned representations using canonical correlationanalysis (CCA), centered kernel alignment (CKA), and other dissimilaritymeasures. Unfortunately, these widely used measures often disagree onfundamental observations, such as whether deep networks differing only inrandom initialization learn similar representations. These disagreements raisethe question: which, if any, of these dissimilarity measures should we believe?We provide a framework to ground this question through a concrete test:measures should have sensitivity to changes that affect functional behavior,and specificity against changes that do not. We quantify this through a varietyof functional behaviors including probing accuracy and robustness todistribution shift, and examine changes such as varying random initializationand deleting principal components. We find that current metrics exhibitdifferent weaknesses, note that a classical baseline performs surprisinglywell, and highlight settings where all metrics appear to fail, thus providing achallenge set for further improvement.