In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation

Abstract

We propose In-Context Translation (ICT), a general learning framework tounify visual recognition (e.g., semantic segmentation), low-level imageprocessing (e.g., denoising), and conditional image generation (e.g.,edge-to-image synthesis). Thanks to unification, ICT significantly reduces theinherent inductive bias that comes with designing models for specific tasks,and it maximizes mutual enhancement across similar tasks. However, theunification across a large number of tasks is non-trivial due to various dataformats and training pipelines. To this end, ICT introduces two designs.Firstly, it standardizes input-output data of different tasks into RGB imagepairs, e.g., semantic segmentation data pairs an RGB image with itssegmentation mask in the same RGB format. This turns different tasks into ageneral translation task between two RGB images. Secondly, it standardizes thetraining of different tasks into a general in-context learning, where"in-context" means the input comprises an example input-output pair of thetarget task and a query image. The learning objective is to generate the"missing" data paired with the query. The implicit translation process is thusbetween the query and the generated image. In experiments, ICT unifies tenvision tasks and showcases impressive performance on their respectivebenchmarks. Notably, compared to its competitors, e.g., Painter andPromptDiffusion, ICT trained on only 4 RTX 3090 GPUs is shown to be moreefficient and less costly in training.

Quick Read (beta)

loading the full paper ...