Abstract
Recent research arXiv:2410.15027 has explored the use of diffusiontransformers (DiTs) for task-agnostic image generation by simply concatenatingattention tokens across images. However, despite substantial computationalresources, the fidelity of the generated images remains suboptimal. In thisstudy, we reevaluate and streamline this framework by hypothesizing thattext-to-image DiTs inherently possess in-context generation capabilities,requiring only minimal tuning to activate them. Through diverse taskexperiments, we qualitatively demonstrate that existing text-to-image DiTs caneffectively perform in-context generation without any tuning. Building on thisinsight, we propose a remarkably simple pipeline to leverage the in-contextabilities of DiTs: (1) concatenate images instead of tokens, (2) perform jointcaptioning of multiple images, and (3) apply task-specific LoRA tuning usingsmall datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuningwith large datasets. We name our models In-Context LoRA (IC-LoRA). Thisapproach requires no modifications to the original DiT models, only changes tothe training data. Remarkably, our pipeline generates high-fidelity image setsthat better adhere to prompts. While task-specific in terms of tuning data, ourframework remains task-agnostic in architecture and pipeline, offering apowerful tool for the community and providing valuable insights for furtherresearch on product-level task-agnostic generation systems. We release ourcode, data, and models at https://github.com/ali-vilab/In-Context-LoRA