2Xplat: Two Experts Are Better Than One Generalist

Loading...

Abstract

Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such ``all-in-one'' designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.

Method Overview

2Xplat follows a two-expert design that explicitly decouples geometry estimation from appearance modeling for pose-free feed-forward 3D Gaussian Splatting. A dedicated geometry expert first predicts camera poses from uncalibrated input views, and those estimated poses are then passed to an appearance expert that synthesizes the final 3D Gaussian representation. This decomposition allows each component to focus on a more specialized role, leading to stronger reconstruction quality and more robust feed-forward 3DGS generation.

Method overview figure

Quantitative Comparison

2Xplat consistently achieves higher PSNR than the previous state-of-the-art method across different view settings and backbone architectures while maintaining comparable or significantly faster inference speed, demonstrating its robustness and efficiency. The results shown in the plot below are measured on DL3DV at a resolution of 224×224, with inference speed evaluated on a single NVIDIA RTX 3090 GPU.

Quantitative comparison figure

As shown in the graph below, across all view settings (32, 64, and 128 views) on DL3DV at a resolution of 960x540, our "pose-free" method, 2Xplat, achieves performance that is consistently comparable to "pose-dependent" approaches, with only a small gap to the best-performing baseline. This demonstrates that our model can effectively recover high-quality rendering without relying on explicit camera pose supervision.

Quantitative comparison figure

Qualitative Results

Dataset

RE10K

224x224 6 Context Views
Dataset

DL3DV

224x224 6 / 12 / 24 Context Views
Dataset

DL3DV

960x540 32 Context Views

Qualitative Comparison

Comparison between 2Xplat and YoNoSplat on the DL3DV dataset using 12 context views with a resolution of 224×224.

BibTeX

@article{jeong2026twoxplat,
  title={Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting},
  author={Hwasik Jeong and Seungryong Lee and Gyeongjin Kang and Seungkwon Yang and Xiangyu Sun and Seungtae Nam and Eunbyung Park},
  journal={arXiv preprint arXiv:2603.21064},
  year={2026}
}