SCP-Diff | Project Page

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Huan-ang Gao^*1, Mingju Gao^*1, Jiaju Li^1,2, Wenyi Li¹,
Rong Zhi³, Hao Tang⁴, Hao Zhao^†1

¹ Institute for AI Industry Research (AIR), Tsinghua University
² University of Chinese Academy of Sciences ³ Mercedes-Benz Group China Ltd. ⁴ Carnegie Mellon University
^*Indicates Equal Contribution ^†Indicates Corresponding Author

Abstract

Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.

Our code release is undergoing a review process within the company of our co-authors due to regulations. If you meet with problems when trying to reproduce our results or have problems with our implementation, feel free to contact us :)

Results

Quantitative Comparison of Different Noise Priors
Method	Cityscapes	ADE20K
mIoU ↑	Acc ↑	FID ↓	mIoU ↑	Acc ↑	FID ↓
Normal Prior	65.14 (+0.00)	94.14 (+0.00)	23.35 (+0.00)	20.73 (+0.00)	61.14 (+0.00)	20.58 (+0.00)
Spatial Prior	66.77 (+1.63)	94.29 (+0.15)	12.83 (-10.52)	20.86 (+0.13)	64.46 (+3.32)	16.03 (-4.55)
Categorical Prior	66.86 (+1.72)	94.54 (+0.40)	11.63 (-11.72)	21.86 (+1.13)	66.63 (+5.49)	16.56 (-4.02)
Joint Prior	67.92 (+2.78)	94.65 (+0.51)	10.53 (-12.82)	25.61 (+4.88)	71.79 (+10.65)	12.66 (-7.92)

Method

Cityscapes

ADE20K

mIoU ↑

Acc ↑

FID ↓

mIoU ↑

Acc ↑

FID ↓

Normal Prior

65.14 (+0.00)

94.14 (+0.00)

23.35 (+0.00)

20.73 (+0.00)

61.14 (+0.00)

20.58 (+0.00)

Spatial Prior

66.77 (+1.63)

94.29 (+0.15)

12.83 (-10.52)

20.86 (+0.13)

64.46 (+3.32)

16.03 (-4.55)

Categorical Prior

66.86 (+1.72)

94.54 (+0.40)

11.63 (-11.72)

21.86 (+1.13)

66.63 (+5.49)

16.56 (-4.02)

Joint Prior

67.92 (+2.78)

94.65 (+0.51)

10.53 (-12.82)

25.61 (+4.88)

71.79 (+10.65)

12.66 (-7.92)

Quantitative Comparison of Different Noise Priors

@article{gao2024scp, title={SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior}, author={Gao, Huan-ang and Gao, Mingju and Li, Jiaju and Li, Wenyi and Zhi, Rong and Tang, Hao and Zhao, Hao}, journal={arXiv preprint arXiv:2403.09638}, year={2024} }

SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Abstract

Method

Overall pipeline for our proposed method.

Results

BibTeX