The TSAIL team led by Professor Zhu Jun from the Department of Computer Science, Tsinghua University recently published a paper "ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation". This paper shared a new technology called ProlificDreamer, which can generate ultra-high-quality 3D content from only text.

The ProlificDreamer algorithm brings significant advances in the field of Text-to-3D. With ProlificDreamer, entering the text "a pineapple" will produce a very realistic and high-definition 3D pineapple like the example below.

Given slightly more difficult text, such as "Michelangelo style statue of dog reading news on a cellphone"? Not a problem.

In the fields of digital creation and virtual reality, Text-to-3D technology has important value and wide application potential. This technology can generate concrete 3D models from simple text descriptions, providing powerful tools for designers, game developers and digital artists.

However, in order to generate accurate 3D models from text, traditional methods require large datasets of labeled 3D models. These datasets need to contain many different types and styles of 3D models, and each model needs to be associated with a corresponding textual description. Creating such datasets requires a lot of time and human resources, and no large-scale datasets are currently available.

DreamFusion, proposed by Google, uses a pre-trained 2D text-to-image diffusion model to complete open-domain text-to-3D synthesis for the first time without 3D data. However, the results generated by the Score Distillation Sampling (SDS) algorithm proposed by DreamFusion face serious problems such as oversaturation, over smoothing, and lack of details. High-quality 3D content generation is still one of the very difficult frontier problems.

The ProlificDreamer paper proposes the Variational Score Distillation (VSD) algorithm, which reformulates the text-to-3D problem from the perspective of Bayesian modeling and variational inference. Specifically, VSD models the 3D parameters as a probability distribution and optimizes the distance between the distribution of its rendered 2D images and the distribution of a pretrained 2D diffusion model. It can be proved that the 3D parameters in the VSD algorithm approximate the process of sampling from the 3D distribution, which solves the problems of oversaturation, oversmoothing, and lack of diversity in the SDS algorithm proposed by DreamFusion. In addition, SDS often requires large supervision weights (CFG=100), while VSD is the first algorithm that can use normal CFG (=7.5).

Unlike previous methods, ProlificDreamer does not simply optimize a single 3D object, but optimizes the probability distribution corresponding to the 3D object. In general, given a valid text input, there exists a probability distribution covering all possible 3D objects described by the text.

Specifically, the algorithm flow chart of VSD is shown below. The iterative update of 3D objects requires the use of two models: one is a pre-trained 2D diffusion model (such as Stable-Diffusion), and the other is LoRA (low-rank adaptation) based on this pre-trained model. This LoRA estimates the score function of the 2D image distribution induced by the current 3D object and is further used to update the 3D object. The algorithm is actually simulating the Wasserstein gradient flow, and can guarantee that the distribution obtained by convergence meets the minimum KL divergence with the pre-trained 2D diffusion model.

First randomly sample the 3D parameters \(\theta\) of the network from the current distribution, and the camera pose \(c\)

Then use the differentiable rendering method to render the corresponding 2D image \(x_0 = g(\theta, c)\)

Then update the parameter \(\theta\) of the 3D expression as \(\theta - \eta_1 E_{t, \epsilon, c} [\omega(t) (\epsilon_{\mathrm{pretrain}}(x_t, t, y) - \epsilon_\phi (x_t, t, c, y) \frac{\partial g(\theta, c)}{\partial \theta}\)

Finally update the potential 3D distribution \(\phi\) contained in LoRA as \(\phi - \eta_2 \nabla_\phi \mathbb E_{t, \epsilon} ||\epsilon_\phi(x_t, t, c, y ) - \epsilon||_2^2\) , which is closer to the original \(q_t^{\mu_\tau}\) distribution, and so on until convergence.

ProlificDreamer can generate “meticulously detailed and photo-realistic 3D textured meshes”, “high rendering resolution (i.e., 512 × 512) and high-fidelity NeRF with rich structures and complex effects”, “diverse and semantically correct 3D scenes given the same text”.

The paper listed several examples about the results of ProlificDreamer compared with baselines, which you may view more details from the link below.

Paper: https://arxiv.org/abs/2305.16213

Project: https://ml.cs.tsinghua.edu.cn/prolificdreamer/

**XRender | Fast · Affordable · Reliable**