Abstract
We present hyper-connections, a simple yet effective method that can serve asan alternative to residual connections. This approach specifically addressescommon drawbacks observed in residual connection variants, such as the seesaweffect between gradient vanishing and representation collapse. Theoretically,hyper-connections allow the network to adjust the strength of connectionsbetween features at different depths and dynamically rearrange layers. Weconduct experiments focusing on the pre-training of large language models,including dense and sparse models, where hyper-connections show significantperformance improvements over residual connections. Additional experimentsconducted on vision tasks also demonstrate similar improvements. We anticipatethat this method will be broadly applicable and beneficial across a wide rangeof AI problems.