Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation.
In stage 1, we train a vLLM to choose which object is better in terms of appearance, surface quality or text fidelity. This is further divided into 2 parts. In pre-training, we train the vision-to-language projector using image summary VQA. In the supervised fine-tuning (SFT) stage, we use comparison data to train for instruction following and preference evaluation. In stage 2, we compute a ranking metric for the set of methods by applying the trained vLLM from stage 1 pairwise on Gen3DEval-Bench prompts.
We use single and multi-view RGB and surface normals renderings of a 3D object generated from a prompt. We take these objects and perturb them to simulate common appearance, surface and text-related artefacts in generative 3D methods.
We create a diverse set of 80 prompts, Gen3DEval-Bench, which considers the diversity of objects, textures, and levels of composition. We determine the size of this benchmark with the consideration that text-to-3D generation is a time- and computation-intensive process, aiming to make the benchmark easily accessible. It is split between 40 animate (humanoids, animals) objects and 40 inanimate objects, as well as into 43 single object and 37 composite object prompts, i.e., combining multiple objects. The average number of words per prompt is 12.863.
Benchmark | Num. Prompts | Avg. word length | Object Type (Animate) | Object Type (Inanimate) | Composition (Single Obj.) | Composition (Multi-object) |
---|---|---|---|---|---|---|
T3Bench | 300 | 7.98 | 36 | 264 | 100 | 200 |
ChatGPTEval3D | 110 | 11.49 | 18 | 92 | 65 | 45 |
DreamFusion | 404 | 6.98 | 211 | 192 | 154 | 250 |
Gen3DEval-Bench | 80 | 12.863 | 40 | 40 | 43 | 37 |
Comparing Gen3DEval-Bench with existing 3D generation prompt benchmarks.
Appearance
Method | Appearance | Text Fidelity | Overall |
---|
@inproceedings{maiti2025gen3DEval,
author={Shalini Maiti and Lourdes Agapito and Filippos Kokkinos},
title={Gen3DEval: Using vLLMs for automatic evaluation of generated 3D objects},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}