Idea-2-3D : Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Junhao Chen*1,2, Xiang Li*3, Xiaojun Ye1,4, Chao Li5, Zhaoxin Fan6, Hao Zhao†1

1Institute for AI Industry Research (AIR), Tsinghua University
2Tsinghua Shenzhen International Graduate School, Tsinghua University
3School of Software and Microelectronics, Peking University
4College of Computer Science, Zhejiang University
5College of Computer Science and Technology, Harbin Engineering University
6State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
*Indicates Equal Contribution
Indicates Corresponding Author

The Idea-2-3D framework synergizes the capabilities of the Large Multimodal Model (LMM), Text-to-Image (T2I), and Image-to-3D (I23D) models to transform complex multimodal input IDEAs into tangible 3D models. This process begins with the user articulating high-level 3D design requirements IDEA. Following this, the LMM generates textual prompts (Prompt Generation) that are then converted into 3D models. These models are evaluated through a Multiview Image Generation and Evaluation process, leading to the Selection of an Optimal 3D Model. Subsequently, the Text-to-3D (T-2-3D) prompt is refined (Feedback Generation) using insights from the GPT-4V. Additionally, an integrated memory module (see Sec. Memory Module), while not depicted here, meticulously records each iteration, facilitating a multimodal, iterative self-refinement cycle within the framework.

Abstract


Transforming high-level, abstract ideas into 3D models is a challenging yet crucial task in the field of computer vision and graphics. Existing methods, such as Text-to-3D (T-2-3D), often behave poorly when dealing with intricate and nuanced language inputs, usually necessitating extensive user interaction. Our paper introduces a new framework, Idea-to-3D (Idea-2-3D), which harnesses the power of Large Multimodal Models (LMMs) to address these challenges. Specifically, powered by GPT-4V(ision), Idea-2-3D is adept at interpreting multimodal inputs, including interelaved image-text sequences, to automatically design and generate 3D models. Despite the potential of LMMs in enhancing T-2-3D, integrating and deciphering multimodal inputs presents its own set of difficulties. Idea-2-3D resolves these issues with its distinctive iterative self-refinement technique, a memory module that logs iterative progress, and its capacity to craft high-fidelity 3D models from user ideas. Our extensive user preference experiments substantiate Idea-2-3D's effectiveness, particularly in complex design tasks. The user study demonstrates that in 94.2\% of cases, Idea-2-3D meets users' requirements, marking a degree of match between IDEA and 3D models that is 2.3 times higher than existing T-2-3D methods. Moreover, in 93.5\% of the cases, users agreed that Idea-2-3D was better than T-2-3D. Codes, data and models will made publicly available.

Method


Overview of the framework of Idea-2-3D, which employs LMM to explore the T-2-3D model's potential through multimodal iterative self-refinement to provide valid T-2-3D prompts for the input user IDEA. Green rounded rectangles indicate steps completed by GPT-4V. Purple rounded rectangles indicate T-2-3D modules, including T2I models and I-2-3D models. The yellow rounded rectangle indicates the off-the-shelf 3D model multi-view generation algorithm. To get a better reconstruction, we remove the background of the image between steps 2 and 3. The blue color indicates the memory module, which saves all the feedback from previous rounds, the best 3D model, and the best text prompt.

Ablation Study

From the supplementary experiments (sixth iteration and seventh iteration), it can be seen that despite the increase in the number of iterations, there is no significant improvement in the quality of the 3D model after removing our proposed module. The quantitative analysis in shows that our proposed MEMORY module can effectively prevent the 3D model generation effect of each iteration from deviating from the user input. Our proposed feedback can effectively improve the speed of iterative convergence, i.e., the generated 3D model can reach the matching with user input items in shorter iteration rounds. If the storage information of the previous model in the memory module is removed, there will be a lower upper limit for the convergence of the generated 3D model to match the user input.



Citation


@article{chen2024idea23d,
  title={Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs}, 
  author={Junhao Chen and Xiang Li and Xiaojun Ye and Chao Li and Zhaoxin Fan and Hao Zhao},
  year={2024},
  eprint={2404.04363},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}