Ultraman: Single Image 3D Human Reconstruction
with Ultra Speed and Detail

Mingjin Chen*1,2, Junhao Chen*1,3, Xiaojun Ye1,4, Huan-ang Gao1, Xiaoxue Chen1, Zhaoxin Fan5, Hao Zhao†1

1Institute for AI Industry Research (AIR), Tsinghua University
2Beijing Normal University - Hong Kong Baptist University United International College
3Tsinghua Shenzhen International Graduate School, Tsinghua University
4College of Computer Science, Zhejiang University
5State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
*Indicates Equal Contribution
Indicates Corresponding Author

Without any 3D or 2D pre-training, our proposed Ultraman is able to quickly synthesize complete, realistic and highly detailed 3D avatars based on a single input RGB image.


3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called Ultraman for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, Ultraman greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of Ultraman on various standard datasets. In addition, Ultraman outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.


Overview of the framework of Ultraman. Ultraman takes a image $I_0$ of human as input. The blue rounded dashed rectangle is denoted as the Prompt Generation Module and the responses to the questions are generated by GPT4V. The red rounded dashed rectangles are indicated as mesh reconstruction modules. It generates the mesh and UV map. The yellow rounded dashed rectangle is the multi-view texture generation module. It includes a control model $\mathcal{M}_c$, which controls the generation of texture in the current viewpoint by accepting the prompt from the current viewpoint, and by using the depth map rendered by the mesh and the input image. The Texturing module pastes the corresponding texture back onto the mesh according to the generation mask.

Visual Comparisons

Animated Avatars

DIY clothes

You can make DIY designs for your clothes according to your needs.


  title={Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail}, 
  author={Mingjin Chen and Junhao Chen and Xiaojun Ye and Huan-ang Gao and Xiaoxue Chen and Zhaoxin Fan and Hao Zhao},