核心概念
A method is proposed to enable pre-trained latent diffusion models to achieve state-of-the-art results on the image harmonization task by addressing the image distortion issue caused by the VAE compression.
摘要
The paper presents a method called DiffHarmony that adapts a pre-trained latent diffusion model, specifically Stable Diffusion, to the image harmonization task. The key challenges addressed are:
-
Computational resource consumption of training diffusion models from scratch: DiffHarmony leverages the pre-trained Stable Diffusion model to quickly converge on the image harmonization task.
-
Reconstruction error induced by the VAE compression in latent diffusion models: Two strategies are proposed to mitigate this issue:
- Performing inference at higher resolutions (512px or 1024px) to generate higher quality initial harmonized images.
- Introducing an additional refinement stage using a simple U-Net model to further enhance the clarity of the harmonized images.
Extensive experiments on the iHarmony4 dataset demonstrate the superiority of the proposed DiffHarmony method compared to state-of-the-art image harmonization approaches. The method achieves the best overall performance in terms of PSNR, MSE, and foreground MSE metrics. Further analysis shows that DiffHarmony particularly excels when the foreground region is large, compensating for the reconstruction loss from the VAE compression.
統計資料
The composite image 𝐼𝑐 and foreground mask 𝑀 are concatenated as image conditions and input to the adapted Stable Diffusion model.
The harmonized image ˜𝐼ℎ generated by DiffHarmony is further refined using a U-Net model.
引述
"Directly applying the above diffusion models to the image harmonization task faces the significant challenge of enormous computational resource consumption due to training from scratch."
"The latent diffusion model takes as its input a feature map of an image that has undergone KL-reg VAE encoding (compressing) process, resulting in a reduced resolution of 1/8 relative to the original image."