Gen3DEval: Using vLLMs for automatic evaluation of generated 3D objects

1Meta AI 2University College London
🎉 Accepted to CVPR 2025! 🎉

Gen3DEval: A holistic ranking metric to assess the quality of generated 3D objects on appearance, surface quality and text fidelity using a vision large language model (vLLM) which is trained to choose the better out of two objects on the three evaluation dimensions (appearance, text fidelity or surface quality).

Abstract

Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation.

Overview of the Method

Image 1

In stage 1, we train a vLLM to choose which object is better in terms of appearance, surface quality or text fidelity. This is further divided into 2 parts. In pre-training, we train the vision-to-language projector using image summary VQA. In the supervised fine-tuning (SFT) stage, we use comparison data to train for instruction following and preference evaluation. In stage 2, we compute a ranking metric for the set of methods by applying the trained vLLM from stage 1 pairwise on Gen3DEval-Bench prompts.

Training Dataset

We use single and multi-view RGB and surface normals renderings of a 3D object generated from a prompt. We take these objects and perturb them to simulate common appearance, surface and text-related artefacts in generative 3D methods.

Pre-Training Dataset

Image 1

Supervised FineTuning Dataset

Image 2

Benchmark: Gen3DEval-Bench

We create a diverse set of 80 prompts, Gen3DEval-Bench, which considers the diversity of objects, textures, and levels of composition. We determine the size of this benchmark with the consideration that text-to-3D generation is a time- and computation-intensive process, aiming to make the benchmark easily accessible. It is split between 40 animate (humanoids, animals) objects and 40 inanimate objects, as well as into 43 single object and 37 composite object prompts, i.e., combining multiple objects. The average number of words per prompt is 12.863.

Benchmark Num. Prompts Avg. word length Object Type (Animate) Object Type (Inanimate) Composition (Single Obj.) Composition (Multi-object)
T3Bench 300 7.98 36 264 100 200
ChatGPTEval3D 110 11.49 18 92 65 45
DreamFusion 404 6.98 211 192 154 250
Gen3DEval-Bench 80 12.863 40 40 43 37

Comparing Gen3DEval-Bench with existing 3D generation prompt benchmarks.

Asset Comparison Examples

Leaderboard Rankings

Method Appearance Text Fidelity Overall
Gen3DEval applied to 3D generation methods on Gen3DEval-Bench. Methods are ranked from Best(1) to Worst(8) on text fidelity and appearance score. The appearance and text fidelity scores are used for the overall score. Image-to-3D methods are denoted with *.

BibTeX


      @inproceedings{maiti2025gen3DEval,
        author={Shalini Maiti and Lourdes Agapito and Filippos Kokkinos},
        title={Gen3DEval: Using vLLMs for automatic evaluation of generated 3D objects}, 
        booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
        year = {2025}
      }