Hoodie: Hierarchical Point Cloud and Latent Code Diffusion for Dressed Avatar Generation

Guiyu Zhang*1, Zhenyu Ding*1, Huan-ang Gao1, Xiaoxue Chen1,
Zhaoxin Fan2, Jianzhu Ma1, Jian Zhao†3,4, Hao Zhao†1
1 Institute for AI Industry Research (AIR), Tsinghua University    
2 Renmin University of China    3 EVOL Lab, TeleAI, China Telecom    4 Northwestern Polytechnical University   
*Indicates Equal Contribution
Indicates Corresponding Author

Left to right: Semantic Label Map, Ground Truth, Ours. The proposed SCP-Diff can generate photo-realistic images from semantic masks. While the state-of-the-art method ECGAN achieves 44.5 FID on Cityscapes, our method achieves 10.5 FID. The quality is credited to the strong spatial and categorical prior of Cityscapes.

Abstract

Diffusion probabilistic models have been successfully adapted to generate 3D point clouds and achieved impressive fidelity. However, existing methods can only generate a single point cloud from noise, leaving joint generation and conditional generation elusive to grasp. These two demands originate in various real-world problems and we take dressed avatar generation as an example. Specifically speaking, directly generating a dressed avatar as a single point cloud cannot meet the demand of changing garment, and generating garment and avatar as two point clouds inevitably leads to mismatch. Meanwhile, generating matched garment for an undressed avatar is not possible for current diffusion techniques. To this end, we present Hoodie which is the first method that successfully resolves aforementioned issues. Technically, Hoodie first trains two separate point cloud diffusion models with global latent codes, then trains a latent code diffusion model for the concatenation of human and garment latents. This hierarchical architecture not only supports the joint generation of human and matched garment, but also supports conditional inference that generates matched garment given a 3D human input. Besides, we integrate a point cloud upsampling GAN to improve the uniformity of generated point clouds. Large-scale quantitative and qualitative evaluations show that Hoodie achieves strong performance on aforementioned new tasks it enables. Code, data and models will be publicly available.

Motivation & Methods

Explanation of the underlying mechanisms of our proposed method. (a) visualizes the latent spaces for different point clouds, where paired human and garment latents are assigned identical colors. The challenge is that existing point cloud diffusion methods cannot bridge the gap to generate paired outputs. To address this gap, we present Hoodie, a hierarchical diffusion-based framework for dressed avatar generation, modeling the joint distribution across two latent spaces. Hoodie not only supports joint generation of humans and their matched garment, as shown in (b), but also supports the conditional garment generation for human point clouds, demonstrated in (c).


The overall pipeline of Hoodie and the implementation of joint generation.


Illustration of the implementation of our proposed conditional generation.

Results

    
    
    
    
    
    

Method Cityscapes ADE20K
mIoU ↑ Acc ↑ FID ↓ mIoU ↑ Acc ↑ FID ↓
Normal Prior 65.14 (+0.00) 94.14 (+0.00) 23.35 (+0.00) 20.73 (+0.00) 61.14 (+0.00) 20.58 (+0.00)
Spatial Prior 66.77 (+1.63) 94.29 (+0.15) 12.83 (-10.52) 20.86 (+0.13) 64.46 (+3.32) 16.03 (-4.55)
Categorical Prior 66.86 (+1.72) 94.54 (+0.40) 11.63 (-11.72) 21.86 (+1.13) 66.63 (+5.49) 16.56 (-4.02)
Joint Prior 67.92 (+2.78) 94.65 (+0.51) 10.53 (-12.82) 25.61 (+4.88) 71.79 (+10.65) 12.66 (-7.92)
Quantitative Comparison of Different Noise Priors

BibTeX

If you find our work useful in your research, please consider citing:
@article{gao2024scp,
  title={SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior},
  author={Gao, Huan-ang and Gao, Mingju and Li, Jiaju and Li, Wenyi and Zhi, Rong and Tang, Hao and Zhao, Hao},
  journal={arXiv preprint arXiv:2403.09638},
  year={2024}
}