Text-conditional diffusion models are able to generate high-fidelity imageswith diverse contents. However, linguistic representations frequently exhibitambiguous descriptions of the envisioned objective imagery, requiring theincorporation of additional control signals to bolster the efficacy oftext-guided diffusion models. In this work, we propose Cocktail, a pipeline tomix various modalities into one embedding, amalgamated with a generalizedControlNet (gControlNet), a controllable normalisation (ControlNorm), and aspatial guidance sampling method, to actualize multi-modal andspatially-refined control for text-conditional diffusion models. Specifically,we introduce a hyper-network gControlNet, dedicated to the alignment andinfusion of the control signals from disparate modalities into the pre-traineddiffusion model. gControlNet is capable of accepting flexible modality signals,encompassing the simultaneous reception of any combination of modality signals,or the supplementary fusion of multiple modality signals. The control signalsare then fused and injected into the backbone model according to our proposedControlNorm. Furthermore, our advanced spatial guidance sampling methodologyproficiently incorporates the control signal into the designated region,thereby circumventing the manifestation of undesired objects within thegenerated image. We demonstrate the results of our method in controllingvarious modalities, proving high-quality synthesis and fidelity to multipleexternal signals.