Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such ``all-in-one'' designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
2Xplat follows a two-expert design that explicitly decouples geometry estimation from appearance modeling for pose-free feed-forward 3D Gaussian Splatting. A dedicated geometry expert first predicts camera poses from uncalibrated input views, and those estimated poses are then passed to an appearance expert that synthesizes the final 3D Gaussian representation. This decomposition allows each component to focus on a more specialized role, leading to stronger reconstruction quality and more robust feed-forward 3DGS generation.
2Xplat consistently achieves higher PSNR than the previous state-of-the-art method across different view settings and backbone architectures while maintaining comparable or significantly faster inference speed, demonstrating its robustness and efficiency. The results shown in the plot below are measured on DL3DV at a resolution of 224×224, with inference speed evaluated on a single NVIDIA RTX 3090 GPU.
As shown in the graph below, across all view settings (32, 64, and 128 views) on DL3DV at a resolution of 960x540, our "pose-free" method, 2Xplat, achieves performance that is consistently comparable to "pose-dependent" approaches, with only a small gap to the best-performing baseline. This demonstrates that our model can effectively recover high-quality rendering without relying on explicit camera pose supervision.
Comparison between 2Xplat and YoNoSplat on the DL3DV dataset using 12 context views with a resolution of 224×224.
@article{jeong2026twoxplat,
title={Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting},
author={Hwasik Jeong and Seungryong Lee and Gyeongjin Kang and Seungkwon Yang and Xiangyu Sun and Seungtae Nam and Eunbyung Park},
journal={arXiv preprint arXiv:2603.21064},
year={2026}
}