Abstract
We present InterACT: Inter-dependency aware Action Chunking with HierarchicalAttention Transformers, a novel imitation learning framework for bimanualmanipulation that integrates hierarchical attention to captureinter-dependencies between dual-arm joint states and visual inputs. InterACTconsists of a Hierarchical Attention Encoder and a Multi-arm Decoder, bothdesigned to enhance information aggregation and coordination. The encoderprocesses multi-modal inputs through segment-wise and cross-segment attentionmechanisms, while the decoder leverages synchronization blocks to refineindividual action predictions, providing the counterpart's prediction ascontext. Our experiments on a variety of simulated and real-world bimanualmanipulation tasks demonstrate that InterACT significantly outperforms existingmethods. Detailed ablation studies validate the contributions of key componentsof our work, including the impact of CLS tokens, cross-segment encoders, andsynchronization blocks.