pLSTM: parallelizable Linear Source Transition Mark networks

Abstract

Modern recurrent architectures, such as xLSTM and Mamba, have recentlychallenged the Transformer in language modeling. However, their structureconstrains their applicability to sequences only or requires processingmulti-dimensional data structures, such as images or molecular graphs, in apre-defined sequential order. In contrast, Multi-Dimensional RNNs (MDRNNs) arewell suited for data with a higher level structure, like 2D grids, trees, anddirected acyclic graphs (DAGs). In this work, we extend the notion ofmulti-dimensionality to linear RNNs. We introduce parallelizable Linear SourceTransition Mark networks (pLSTMs) using Source, Transition, and Mark gates thatact on the line graph of a general DAG. This enables parallelization in analogyto parallel associative scans and the chunkwise-recurrent form of sequentiallinear RNNs, but for DAGs. For regular grids (1D and 2D), like images, thisscheme can be efficiently implemented using einsum operations, concatenations,and padding in logarithmic time. pLSTMs tackle the vanishing/explodingactivation/gradient problem for long distances in DAGs via two distinct modes:a directed propagation mode (P-mode) and a diffusive distribution mode(D-mode). To showcase the long-range capabilities of pLSTM, we introducearrow-pointing extrapolation as a synthetic computer vision task that containslong-distance directional information. We demonstrate that pLSTMs generalizewell to larger image sizes, whereas Transformers struggle to extrapolate. Onestablished molecular graph and computer vision benchmarks, pLSTMs also showstrong performance. Code and Datasets are available at:https://github.com/ml-jku/plstm_experiments.

Quick Read (beta)

loading the full paper ...