Publications

Symbols (†) denotes student co-authors.

Beyond Pixel Histories: World Models with Persistent 3D State

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

arXiv preprint arXiv:2603.03482 2026

A new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer.

LIVE: Long-horizon Interactive Video World Modeling

Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang

arXiv preprint arXiv:2602.03747 2026

A framework that alleviates autoregressive error accumulation via a cycle-consistency constraint.

Luminark: Training-free, Probabilistically-Certified Watermarking for General Vision Generative Models

Jiayi Xu, Zhang Zhang, Yuanrui Zhang, Ruitao Chen, Yixian Xu, Tianyu He, Di He

arXiv preprint arXiv:2601.01085 2026

A training-free and probabilistically-certified watermarking method for general vision generative models.

Paper

Quotient-Space Diffusion Models

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He, Di He, Chang Liu

International Conference on Learning Representations (ICLR) Oral 2026

A formal framework for diffusion modeling on a general quotient space.

Paper

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian

International Conference on Learning Representations (ICLR) 2026

Combining video diffusion with 3D representation for geometrically consistent world modeling.

Fast Autoregressive Video Generation with Diagonal Decoding

Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian

Findings of the Computer Vision and Pattern Recognition Conference (CVPR) 2026

Accelerating autoregressive video generation through diagonal decoding strategy.

AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian

Findings of the Computer Vision and Pattern Recognition Conference (CVPR) 2026

Autoregressive 4D content generation from monocular video input.

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, Li Jiang

arXiv preprint arXiv:2510.03198 2025

A learning framework that pairs training protocols with a geometry-indexed spatial memory.

Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Ye, Tianyu He, Shuo Yang, Jiang Bian

arXiv preprint arXiv:2509.23958 2025

A post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.

Paper

Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Siyi Xie, Hanxin Zhu, Xinyi Chen, Tianyu He, Xin Li, Zhibo Chen

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2025

Generating spatial audio for immersive 4D scene exploration.

Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian

arXiv preprint arXiv:2506.01380 2025

Achieving autoregressive video generation at 30+ FPS through next-frame diffusion.

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian

arXiv preprint arXiv:2504.08388 2025

A real-time and open-source interactive world model built on Minecraft.

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2025

A simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world.

PaperCode

VidTwin: Video VAE with Decoupled Structure and Dynamics

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2025

A video VAE approach that decouples structure and dynamics for improved video representation.

Video in-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators

Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

International Conference on Learning Representations (ICLR) 2025

Demonstrating that autoregressive transformers can perform zero-shot video imitation via in-context learning.

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2025

Text-guided emotion and motion control for avatar generation.

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian

Proceedings of the ACM International Conference on Multimedia (ACM MM) 2025

A unified tuning-free framework for both video motion and appearance editing.

VidTok: A Versatile and Open-Source Video Tokenizer

Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian

arXiv preprint arXiv:2412.13061 2024

A versatile and open-source video tokenizer for video understanding and generation.

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian

arXiv preprint arXiv:2411.00785 2024

Proposing latent action models for embodied AI foundation models.

Compositional 3D-aware Video Generation with LLM Director

Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, Jiang Bian

Advances in Neural Information Processing Systems (NeurIPS) 2024

Compositional 3D-aware video generation directed by large language models.

End-to-End Rate-Distortion Optimized 3D Gaussian Representation

Henan Wang, Hanxin Zhu, Tianyu He, Runsen Feng, Jiajun Deng, Jiang Bian, Zhibo Chen

The European Conference on Computer Vision (ECCV) 2024

Formulating the compact 3D Gaussian learning as an end-to-end Rate-Distortion Optimization (RDO) problem.

Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?

Hanxin Zhu, Tianyu He, Xin Li, Bingchen Li, Zhibo Chen

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) 2024

Investigating the sufficiency of vanilla MLP architectures in NeRF for few-shot view synthesis.

GAIA: Zero-Shot Talking Avatar Generation

Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian

International Conference on Learning Representations (ICLR) 2024

Zero-shot talking avatar generation without subject-specific fine-tuning.

Memories are One-to-Many Mapping Alleviators in Talking Face Generation

Anni Tang, Tianyu He, Xu Tan, Jun Ling, Li Song

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2024

Using memory mechanisms to alleviate one-to-many mapping in talking face generation.

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

Chenpeng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, Jiang Bian

Proceedings of the ACM International Conference on Multimedia (ACM MM) 2023

High fidelity speech-driven talking face generation using diffusion autoencoder.

HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details

Zenghao Chai, Tianke Zhang, Tianyu He, Xu Tan, Tadas Baltruvsaitis, HsiangTao Wu, Runnan Li, Sheng Zhao, Chun Yuan, Jiang Bian

Proceedings of the International Conference on Computer Vision (ICCV) 2023

High-fidelity 3D face reconstruction with both static and dynamic facial details.