1Institute for AI Industry Research (AIR), Tsinghua University
2Tsinghua Shenzhen International Graduate School, Tsinghua University
3School of Software and Microelectronics, Peking University
4College of Computer Science, Zhejiang University
5College of Computer Science and Technology, Harbin Engineering University
6State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
*Indicates Equal Contribution
†Indicates Corresponding Author
Transforming high-level, abstract ideas into 3D models is a challenging yet crucial task in the field of computer vision and graphics. Existing methods, such as Text-to-3D (T-2-3D), often behave poorly when dealing with intricate and nuanced language inputs, usually necessitating extensive user interaction. Our paper introduces a new framework, Idea-to-3D (Idea-2-3D), which harnesses the power of Large Multimodal Models (LMMs) to address these challenges. Specifically, powered by GPT-4V(ision), Idea-2-3D is adept at interpreting multimodal inputs, including interelaved image-text sequences, to automatically design and generate 3D models. Despite the potential of LMMs in enhancing T-2-3D, integrating and deciphering multimodal inputs presents its own set of difficulties. Idea-2-3D resolves these issues with its distinctive iterative self-refinement technique, a memory module that logs iterative progress, and its capacity to craft high-fidelity 3D models from user ideas. Our extensive user preference experiments substantiate Idea-2-3D's effectiveness, particularly in complex design tasks. The user study demonstrates that in 94.2\% of cases, Idea-2-3D meets users' requirements, marking a degree of match between IDEA and 3D models that is 2.3 times higher than existing T-2-3D methods. Moreover, in 93.5\% of the cases, users agreed that Idea-2-3D was better than T-2-3D. Codes, data and models will made publicly available.
Overview of the framework of Idea-2-3D, which employs LMM to explore the T-2-3D model's potential through multimodal iterative self-refinement to provide valid T-2-3D prompts for the input user IDEA. Green rounded rectangles indicate steps completed by GPT-4V. Purple rounded rectangles indicate T-2-3D modules, including T2I models and I-2-3D models. The yellow rounded rectangle indicates the off-the-shelf 3D model multi-view generation algorithm. To get a better reconstruction, we remove the background of the image between steps 2 and 3. The blue color indicates the memory module, which saves all the feedback from previous rounds, the best 3D model, and the best text prompt.
From the supplementary experiments (sixth iteration and seventh iteration), it can be seen that despite the increase in the number of iterations, there is no significant improvement in the quality of the 3D model after removing our proposed module. The quantitative analysis in shows that our proposed MEMORY module can effectively prevent the 3D model generation effect of each iteration from deviating from the user input. Our proposed feedback can effectively improve the speed of iterative convergence, i.e., the generated 3D model can reach the matching with user input items in shorter iteration rounds. If the storage information of the previous model in the memory module is removed, there will be a lower upper limit for the convergence of the generated 3D model to match the user input.
@article{chen2024idea23d,
title={Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs},
author={Junhao Chen and Xiang Li and Xiaojun Ye and Chao Li and Zhaoxin Fan and Hao Zhao},
year={2024},
eprint={2404.04363},
archivePrefix={arXiv},
primaryClass={cs.CV}
}