Abstract
In order to make the foundation model more efficient and effective, our ideais combining sequence transformation and state transformation. First, we provethe availability of rotary position embedding in the state space dualityalgorithm, which reduces the perplexity of the hybrid quadratic causalself-attention and state space duality by more than 4%, to ensure that thecombining sequence transformation unifies position encoding. Second, we proposedynamic mask attention, which maintains 100% accuracy in the more challengingmulti-query associative recall task, improving by more than 150% compared toquadratic causal self-attention and state space duality, to ensure that thecombining sequence transformation selectively filters relevant information.Third, we design cross domain mixture of experts, which makes the computationalspeed of expert retrieval with more than 1024 experts 8 to 10 times faster thanthe mixture of experts, to ensure that the combining state transformationquickly retrieval mixture. Finally, we summarize these matrix algorithms thatcan form the foundation model: Wonderful Matrices, which can be a competitor topopular model architectures.