Introducing an additional mask prompt to better model the relationship between foreground and background, enabling the diffusion model to generate higher-quality and more controllable images that maintain higher fidelity to the reference image.
Visual Autoregressive (VAR) modeling redefines autoregressive learning on images as a coarse-to-fine "next-scale prediction" strategy, which allows autoregressive transformers to learn visual distributions fast and generalize well, surpassing diffusion models in image synthesis.