Image descriptionCATSplat

Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

ICCV 2025
1Korea University         2Google         3Purdue University

Qualitative Results

RealEsate10K

ACID

NYU

Abstract

TL;DR: We present CATSplat, a novel generalizable transformer-based 3D scene reconstruction framework designed to break through the inherent constraints in monocular settings.

Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. Unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from single-view image features. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under monocular settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

How does CATSplat work?

Paper pipeline.

CATSplat takes an image \( \mathcal{I} \) and predicts 3D Gaussian primitives \( \{({\mu}_j, {\alpha}_j, {\Sigma}_j, {c}_j )\}^{J}_{j} \) to construct a scene-representative 3D radiance field in a single forward pass. Our primary goal is to go beyond the finite knowledge inherent in single-view image features leveraging our two innovative priors. Through cross-attention layers, we enhance image features $F^\mathcal{I}_i$ to be highly informative by incorporating valuable insights: contextual cues from text features $F^C_i$, and spatial cues from 3D point features $F^S_i$.

Module architecture.

(Left) Detailed architecture of the transformer pipline.
In the $i$-th layer, we first operate cross-attention between $F_i^{\mathcal{I}}$ and $F_i^C$, then proceed cross-attention with $F_i^S$.
We also use a ratio $\gamma$ to preserve visual information from $F_i^{\mathcal{I}}$ while incorporating extra cues from $F_i^C$ and $F_i^S$.

(Right) Detailed architecture of 3D point feature extraction from a monocular input image $\mathcal{I}$.
Our point cloud encoder takes back-projected points $P$ and produces point features $F^S$ based on the
PointNet structure. Here, T-Net indicates an affine transform network.

Quantitative Results

Comparisons of Novel View Synthesis (NVS) performance with state-of-the-art single-view 3D reconstruction approaches on the RE10K dataset. Following the standard protocol, we evaluate NVS metrics on unseen target frames located $n$ frames away from the input source frame. Also, we randomly sample an extra target frame within 30 frames apart from the source frame.

Comparisons of Novel View Synthesis performance with state-of-the-art few-view 3D reconstruction approaches on the RE10K dataset. Although we mainly focus on comparing with the leading single-view method, Flash3D, we also provide scores of two-view methods for additional references. Following Flash3D, we use interpolation and extrapolation protocols from previous works, pixelSplat and latentSplat, respectively.

Qualitative Results

Qualitative comparisons of NVS performance between Flash3D and ours with
Ground Truth on the novel view frames from RE10k and ACID (cross-dataset).

BibTeX

@article{roh2024catsplat,
    title={CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image},
    author={Roh, Wonseok and Jung, Hwanhee and Kim, Jong Wook and Lee, Seunggwan and Yoo, Innfarn and Lugmayr, Andreas and Chi, Seunggeun and Ramani, Karthik and Kim, Sangpil},
    journal={arXiv preprint arXiv:2412.12906},
    year={2024}
  }