Abstract
We introduce FFN Fusion, an architectural optimization technique that reducessequential computation in large language models by identifying and exploitingnatural opportunities for parallelization. Our key insight is that sequences ofFeed-Forward Network (FFN) layers, particularly those remaining after theremoval of specific attention layers, can often be parallelized with minimalaccuracy impact. We develop a principled methodology for identifying and fusingsuch sequences, transforming them into parallel operations that significantlyreduce inference latency while preserving model behavior. Applying thesetechniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base(Ultra-253B-Base), an efficient and soon-to-be publicly available model thatachieves a 1.71X speedup in inference latency and 35X lower per-token costwhile maintaining strong performance across benchmarks. Through extensiveexperiments on models from 49B to 253B parameters, we demonstrate that FFNFusion becomes increasingly effective at larger scales and can complementexisting optimization techniques like quantization and pruning. Mostintriguingly, we find that even full transformer blocks containing bothattention and FFN layers can sometimes be parallelized, suggesting newdirections for neural architecture design.