Abstract
We challenge a common assumption underlying most supervised deep learning:that a model makes a prediction depending only on its parameters and thefeatures of a single input. To this end, we introduce a general-purpose deeplearning architecture that takes as input the entire dataset instead ofprocessing one datapoint at a time. Our approach uses self-attention to reasonabout relationships between datapoints explicitly, which can be seen asrealizing non-parametric models using parametric attention mechanisms. However,unlike conventional non-parametric models, we let the model learn end-to-endfrom the data how to make use of other datapoints for prediction. Empirically,our models solve cross-datapoint lookup and complex reasoning tasks unsolvableby traditional deep learning models. We show highly competitive results ontabular data, early results on CIFAR-10, and give insight into how the modelmakes use of the interactions between points.