Publications
Symbols (†) denotes student co-authors.
Beyond Pixel Histories: World Models with Persistent 3D State
Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian
arXiv preprint arXiv:2603.03482 2026
A new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer.
Luminark: Training-free, Probabilistically-Certified Watermarking for General Vision Generative Models
Jiayi Xu†, Zhang Zhang†, Yuanrui Zhang†, Ruitao Chen†, Yixian Xu†, Tianyu He, Di He
arXiv preprint arXiv:2601.01085 2026
A training-free and probabilistically-certified watermarking method for general vision generative models.
Quotient-Space Diffusion Models
Yixian Xu†, Yusong Wang†, Shengjie Luo†, Kaiyuan Gao†, Tianyu He, Di He, Chang Liu
International Conference on Learning Representations (ICLR) Oral 2026
A formal framework for diffusion modeling on a general quotient space.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu†, Diankun Wu†, Tianyu He, Junliang Guo, Yang Ye†, Yueqi Duan, Jiang Bian
International Conference on Learning Representations (ICLR) 2026
Combining video diffusion with 3D representation for geometrically consistent world modeling.
Fast Autoregressive Video Generation with Diagonal Decoding
Yang Ye†, Junliang Guo, Haoyu Wu†, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian
Findings of the Computer Vision and Pattern Recognition Conference (CVPR) 2026
Accelerating autoregressive video generation through diagonal decoding strategy.
Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft
Junchao Huang†, Xinting Hu, Boyao Han†, Shaoshuai Shi, Zhuotao Tian, Tianyu He, Li Jiang
arXiv preprint arXiv:2510.03198 2025
A learning framework that pairs training protocols with a geometry-indexed spatial memory.
Reinforcement Learning with Inverse Rewards for World Model Post-training
Yang Ye†, Tianyu He, Shuo Yang†, Jiang Bian
arXiv preprint arXiv:2509.23958 2025
A post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid
Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2025
A simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world.
VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang†, Junliang Guo, Xinyi Xie†, Tianyu He, Xu Sun, Jiang Bian
Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2025
A video VAE approach that decouples structure and dynamics for improved video representation.
Video in-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Wentao Zhang†, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian
International Conference on Learning Representations (ICLR) 2025
Demonstrating that autoregressive transformers can perform zero-shot video imitation via in-context learning.
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
Yuchi Wang†, Junliang Guo, Jianhong Bai†, Runyi Yu†, Tianyu He, Xu Tan, Xu Sun, Jiang Bian
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2025
Text-guided emotion and motion control for avatar generation.
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing
Jianhong Bai†, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian
Proceedings of the ACM International Conference on Multimedia (ACM MM) 2025
A unified tuning-free framework for both video motion and appearance editing.
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI
Xiaoyu Chen†, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian
arXiv preprint arXiv:2411.00785 2024
Proposing latent action models for embodied AI foundation models.
End-to-End Rate-Distortion Optimized 3D Gaussian Representation
Henan Wang†, Hanxin Zhu†, Tianyu He, Runsen Feng†, Jiajun Deng, Jiang Bian, Zhibo Chen
The European Conference on Computer Vision (ECCV) 2024
Formulating the compact 3D Gaussian learning as an end-to-end Rate-Distortion Optimization (RDO) problem.
Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?
Hanxin Zhu†, Tianyu He, Xin Li, Bingchen Li†, Zhibo Chen
Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2024
Investigating the sufficiency of vanilla MLP architectures in NeRF for few-shot view synthesis.
GAIA: Zero-Shot Talking Avatar Generation
Tianyu He, Junliang Guo, Runyi Yu†, Yuchi Wang†, Jialiang Zhu†, Kaikai An†, Leyi Li†, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian
International Conference on Learning Representations (ICLR) 2024
Zero-shot talking avatar generation without subject-specific fine-tuning.
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
Chenpeng Du†, Qi Chen†, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian
Proceedings of the ACM International Conference on Multimedia (ACM MM) 2023
High fidelity speech-driven talking face generation using diffusion autoencoder.
HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details
Zenghao Chai†, Tianke Zhang†, Tianyu He, Xu Tan, Tadas Baltruvsaitis, HsiangTao Wu, Runnan Li, Sheng Zhao, Chun Yuan, Jiang Bian
Proceedings of the International Conference on Computer Vision (ICCV) 2023
High-fidelity 3D face reconstruction with both static and dynamic facial details.