Divide-and-Conquer: The Dual-Hierarchical Optimization for Semantic 4D Gaussian Splatting

Zhiying Yan1,2,*, Yiyuan Liang1,2,*, Shilv Cai3, Tao Zhang1,2,
Sheng Zhong1,2, Luxin Yan1,2, Xu Zou1,2,†
1Huazhong University of Science and Technology
2National Key Laboratory of Multispectral Information Intelligent Processing Technology
3Nanyang Technological University, Singapore

*Indicates Equal Contribution,Indicates Corresponding Author

ICME 2025


Our method is dedicated to achieving high-quality rendering novel views and accurate semantic understanding of dynamic scenes, while providing support for downstream tasks in 4D scenarios.

Abstract

Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes captured from a monocular camera, resulting in better handling of object information with temporal variations than static scenes. However, most recent works focus on the semantics of static scenes. Directly applying them to understand dynamic scenes is impractical, which fails to capture the temporal behaviors and features of dynamic objects. To the best of our knowledge, few existing 3DGS-based methods emphasize the semantic comprehension of dynamic scenes. While demonstrating promising capabilities in simple scenes, it struggles to achieve high-fidelity rendering and accurate semantic features in scenarios where the static background contains significant noise and the dynamic foreground exhibits substantial deformation with intricate textures. Because the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between their corresponding Gaussians, which leads to artifacts and noise especially at the boundaries between the semantic masks of dynamic foreground and static background. To address these limitations, we propose the Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and their features. The latter helps conquer the issue of dynamic foreground rendering distortion in scenes where the static background has complex noise (e.g., the “Broom’’ scene in HyperNeRF dataset). Extensive experiments show that our method consistently outperforms baselines on both synthetic and real-world datasets.


Method Overview

The overall pipeline of our model. We add semantic properties to each Gaussian and obtain the geometric deformation of the Gaussian at each timestamp t through the deformation field. In the coarse stage, Gaussians are subjected to geometric constraints. While in the fine stage, geometries are relaxed and semantic feature constraints are introduced, ensuring foreground-background separation. We utilize dynamic foreground masks obtained from scene priors for hierarchical Gaussian guidance of the scene, enhancing the rendering quality of dynamic foregrounds with complex backgrounds.


Visual Results

The following results show the novel rendering views and the extracted semantic feature maps using our method, evaluated on both the real-world HyperNeRF dataset and the synthetic D-NeRF dataset. The visualization of the feature maps is displayed using PCA for dimension reduction.

Split-Cookie ChickChicken Americano Torchocolate

Jumpingjacks Standup Trex Hook


Segmentation on Synthetic Dataset

Our method achieves excellent semantic segmentation performance not only on real-world datasets but also on synthetic datasets.


Comparison with Baseline

Our method outperforms the baseline in terms of rendering quality, semantic feature completeness, and semantic segmentation accuracy. (Our method is on the left, Baseline is on the right)


Multi-Scale Semantic Feature and Segmentation

Visualization results of multi-scale dynamic semantic segmentation .

Multi-Scale "Chickchicken" Multi-Scale "Broom"

Semantic Editing

Visual illustration of our method’s ability to semantically edit objects.

Remove "Cookie" Turn "Cookie" into "Pizza"