Gen3DEval: Using vLLMs for automatic evaluation of generated 3D objects

Abstract

Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation.

Overview of the Method

In stage 1, we train a vLLM to choose which object is better in terms of appearance, surface quality or text fidelity. This is further divided into 2 parts. In pre-training, we train the vision-to-language projector using image summary VQA. In the supervised fine-tuning (SFT) stage, we use comparison data to train for instruction following and preference evaluation. In stage 2, we compute a ranking metric for the set of methods by applying the trained vLLM from stage 1 pairwise on Gen3DEval-Bench prompts.

Training Dataset

We use single and multi-view RGB and surface normals renderings of a 3D object generated from a prompt. We take these objects and perturb them to simulate common appearance, surface and text-related artefacts in generative 3D methods.

Pre-Training Dataset

Supervised FineTuning Dataset

Benchmark: Gen3DEval-Bench

We create a diverse set of 80 prompts, Gen3DEval-Bench, which considers the diversity of objects, textures, and levels of composition. We determine the size of this benchmark with the consideration that text-to-3D generation is a time- and computation-intensive process, aiming to make the benchmark easily accessible. It is split between 40 animate (humanoids, animals) objects and 40 inanimate objects, as well as into 43 single object and 37 composite object prompts, i.e., combining multiple objects. The average number of words per prompt is 12.863.

Benchmark	Num. Prompts	Avg. word length	Object Type (Animate)	Object Type (Inanimate)	Composition (Single Obj.)	Composition (Multi-object)
T3Bench	300	7.98	36	264	100	200
ChatGPTEval3D	110	11.49	18	92	65	45
DreamFusion	404	6.98	211	192	154	250
Gen3DEval-Bench	80	12.863	40	40	43	37

Comparing Gen3DEval-Bench with existing 3D generation prompt benchmarks.

Asset Comparison Examples

Choose a text prompt:

Appearance

Leaderboard Rankings

Method

Appearance

Text Fidelity

Overall

Gen3DEval

BibTeX

@inproceedings{maiti2025gen3DEval, author={Shalini Maiti and Lourdes Agapito and Filippos Kokkinos}, title={Gen3DEval: Using vLLMs for automatic evaluation of generated 3D objects}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025} }