Transformer-based 3D pose estimation pipeline with modular SmoothNet integration for animation generation

Traditional 3D human pose estimation and animation generation methods still face high computational cost and temporal instability. To address these issues, this study develops a Transformer-based 3D pose estimation pipeline for spatial prediction and integrates SmoothNet as an external modular post-processing component for temporal smoothing. The backbone extracts image features with ResNet, encodes global dependencies using Transformer blocks, and outputs 3D keypoints for downstream reconstruction. SmoothNet is then applied to reduce frame-level jitter in predicted pose sequences. We evaluate the full pipeline on H3WB, AGORA, and AthletePose3D using MPJPE, PA-MPJPE, MPVE, and acceleration error. In addition, a fairness-controlled protocol on H3WB is introduced to compare MHFormer and the proposed backbone under unified SmoothNet post-processing settings, separating backbone effects from smoothing effects. Animation quality is further validated through Blender and Unreal Engine (UE) deployment.