We present TrackFormer, an end-to-end multi-object tracking and segmentationmodel based on an encoder-decoder Transformer architecture. Our approachintroduces track query embeddings which follow objects through a video sequencein an autoregressive fashion. New track queries are spawned by the DETR objectdetector and embed the position of their corresponding object over time. TheTransformer decoder adjusts track query embeddings from frame to frame, therebyfollowing the changing object positions. TrackFormer achieves a seamless dataassociation between frames in a new tracking-by-attention paradigm by self- andencoder-decoder attention mechanisms which simultaneously reason aboutlocation, occlusion, and object identity. TrackFormer yields state-of-the-artperformance on the tasks of multi-object tracking (MOT17) and segmentation(MOTS20). We hope our unified way of performing detection and tracking willfoster future research in multi-object tracking and video understanding. Codewill be made publicly available.