BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Han, Yuci; Toth, Charles; Yilmaz, Alper; Anderson, John E.; Shuart, William J.

doi:10.5194/isprs-annals-XI-2-2026-269-2026

Articles | Volume XI-2-2026

https://doi.org/10.5194/isprs-annals-XI-2-2026-269-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-annals-XI-2-2026-269-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XI-2-2026

03 Jul 2026

| 03 Jul 2026

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Yuci Han, Charles Toth, Alper Yilmaz, John E. Anderson, and William J. Shuart

Keywords: 3D Gaussian Splatting, Video Diffusion Model, Novel View Synthesis

Abstract. We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes, using extremely sparse unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recovering view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifacts-free, consistent novel views. We perform evaluation using the challenging DL3DV-10K dataset, demonstrating significant visual quality improvements over previous state-of-the-art diffusion-based methods on NVS tasks.

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Useful Links

Useful External Links

Our Contact