Tianyu He

Senior Researcher

Microsoft Research Asia

Research Interests

Multimodal Foundation Models

Interactive Video World Models

About

I am a Senior Researcher of a stealth Neo Lab. Previously, I was a Senior Researcher in the Machine Learning Group at Microsoft Research Asia. My work focuses on the intersection of multimodality and generative modeling, specifically aimed at advancing how AI understands and creates complex, real-world content.

I graduated from the University of Science and Technology of China (USTC) in 2019. Prior to joining Microsoft in 2022, I spent three years at Alibaba DAMO Academy (AliStar) developing industry-leading algorithms. My research is regularly published in top-tier venues such as NeurIPS, ICLR, CVPR, ICCV, and ECCV.

Driven by a long-term goal of building intelligence for the physical world, my current research focuses on:

Multimodal Foundation Models: Engineering unified architectures with a focus on data/model scaling and reasoning to power next-generation agents and robotics.
Interactive Video World Models: Exploring real-time causal generation, interactive control, and consistent world modeling to bridge the gap between digital synthesis and physical simulation.

🚀 I am hiring! If you are interested in a position focused on Multimodal Foundation Models or World Models! Please email me if you are interested.

Selected Projects

View All →

LIVE: Long-horizon Interactive Video World Modeling

Junchao Huang^†, Ziyang Ye^†, Xinting Hu, Tianyu He, Guiyu Zhang^†, Shaoshuai Shi, Jiang Bian, Li Jiang

arXiv preprint arXiv:2602.03747 (2026)

A framework that alleviates autoregressive error accumulation via a cycle-consistency constraint.

Paper Project

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu^†, Diankun Wu^†, Tianyu He, Junliang Guo, Yang Ye^†, Yueqi Duan, Jiang Bian

International Conference on Learning Representations (ICLR) (2026)

Combining video diffusion with 3D representation for geometrically consistent world modeling.

Paper Project Code Media

Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Ye^†, Tianyu He, Shuo Yang^†, Jiang Bian

arXiv preprint arXiv:2509.23958 (2025)

A post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.

Paper

Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Xinle Cheng^†, Tianyu He, Jiayi Xu^†, Junliang Guo, Di He, Jiang Bian

arXiv preprint arXiv:2506.01380 (2025)

Achieving autoregressive video generation at 30+ FPS through next-frame diffusion.

Paper Project Media

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo, Yang Ye^†, Tianyu He, Haoyu Wu^†, Yushu Jiang^†, Tim Pearce, Jiang Bian

arXiv preprint arXiv:2504.08388 (2025)

A real-time and open-source interactive world model built on Minecraft.

Paper Code Media

VidTok: A Versatile and Open-Source Video Tokenizer

Anni Tang^†, Tianyu He, Junliang Guo, Xinle Cheng^†, Li Song, Jiang Bian

arXiv preprint arXiv:2412.13061 (2024)

A versatile and open-source video tokenizer for video understanding and generation.

Paper Code Media

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI

Xiaoyu Chen^†, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian

arXiv preprint arXiv:2411.00785 (2024)

Proposing latent action models for embodied AI foundation models.

Paper Project Media

GAIA: Zero-Shot Talking Avatar Generation

Tianyu He, Junliang Guo, Runyi Yu^†, Yuchi Wang^†, Jialiang Zhu^†, Kaikai An^†, Leyi Li^†, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian

International Conference on Learning Representations (ICLR) (2024)

Zero-shot talking avatar generation without subject-specific fine-tuning.

Paper Project Media