Abstract
Graph Transformers (GTs) have emerged as a powerful paradigm for graphrepresentation learning due to their ability to model diverse nodeinteractions. However, existing GTs often rely on intricate architecturaldesigns tailored to specific interactions, limiting their flexibility. Toaddress this, we propose a unified hierarchical mask framework that reveals anunderlying equivalence between model architecture and attention maskconstruction. This framework enables a consistent modeling paradigm bycapturing diverse interactions through carefully designed attention masks.Theoretical analysis under this framework demonstrates that the probability ofcorrect classification positively correlates with the receptive field size andlabel consistency, leading to a fundamental design principle: an effectiveattention mask should ensure both a sufficiently large receptive field and ahigh level of label consistency. While no single existing mask satisfies thisprinciple across all scenarios, our analysis reveals that hierarchical masksoffer complementary strengths, motivating their effective integration. Then, weintroduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer withMulti-Level Masking and Dual Attention Computation. M3Dphormer incorporatesthree theoretically grounded hierarchical masks and employs a bi-level expertrouting mechanism to adaptively integrate multi-level interaction information.To ensure scalability, we further introduce a dual attention computation schemethat dynamically switches between dense and sparse modes based on local masksparsity. Extensive experiments across multiple benchmarks demonstrate thatM3Dphormer achieves state-of-the-art performance, validating the effectivenessof our unified framework and model design.