ISPRS-Annals

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

ISPRS-Annals

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

2194-9050

Copernicus Publications

Göttingen, Germany

10.5194/isprs-annals-XI-2-2026-269-2026

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Han

Yuci

¹ Toth

Charles

https://orcid.org/0000-0001-9461-4887

² Yilmaz

Alper

² Anderson

John E.

³ Shuart

William J.

Dept. of Electrical and Computer Engineering, The Ohio State University, USA

Dept. of Civil, Environmental and Geodetic Engineering, The Ohio State University, USA

Geospatial Research Laboratory (GRL), Engineer Research and Development Center (ERDC), US Army Corps of Engineers (USACE), USA

03 07 2026

XI-2-2026 269 276

2026

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/

This article is available from https://isprs-annals.copernicus.org/articles/XI-2-2026/269/2026/isprs-annals-XI-2-2026-269-2026.html

The full text article is available as a PDF file from https://isprs-annals.copernicus.org/articles/XI-2-2026/269/2026/isprs-annals-XI-2-2026-269-2026.pdf

We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes, using extremely sparse unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recovering view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifacts-free, consistent novel views. We perform evaluation using the challenging DL3DV-10K dataset, demonstrating significant visual quality improvements over previous state-of-the-art diffusion-based methods on NVS tasks.