Abstract
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting oflearnable activation functions with the potential to capture more complexrelationships from data. Although KANs are useful in finding symbolicrepresentations and continual learning of one-dimensional functions, theireffectiveness in diverse machine learning (ML) tasks, such as vision, remainsquestionable. Presently, KANs are deployed by replacing multilayer perceptrons(MLPs) in deep network architectures, including advanced architectures such asvision Transformers (ViTs). In this paper, we are the first to design a generallearnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operateon any choice of basis. However, the computing and memory costs of trainingthem motivated us to propose a more modular version, and we designed particularlearnable attention, called Fourier-KArAt. Fourier-KArAt and its variantseither outperform their ViT counterparts or show comparable performance onCIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures'performance and generalization capacity by analyzing their loss landscapes,weight distributions, optimizer path, attention visualization, and spectralbehavior, and contrast them with vanilla ViTs. The goal of this paper is not toproduce parameter- and compute-efficient attention, but to encourage thecommunity to explore KANs in conjunction with more advanced architectures thatrequire a careful understanding of learnable activations. Our open-source codeand implementation details are available on: https://subhajitmaity.me/KArAt