Tianyu He

Tianyu He

Senior Researcher

Microsoft Research Asia

Research Interests

Multimodal Foundation Models
Interactive Video World Models

About

I am a Senior Researcher of a stealth Neo Lab. Previously, I was a Senior Researcher in the Machine Learning Group at Microsoft Research Asia. My work focuses on the intersection of multimodality and generative modeling, specifically aimed at advancing how AI understands and creates complex, real-world content.

I graduated from the University of Science and Technology of China (USTC) in 2019. Prior to joining Microsoft in 2022, I spent three years at Alibaba DAMO Academy (AliStar) developing industry-leading algorithms. My research is regularly published in top-tier venues such as NeurIPS, ICLR, CVPR, ICCV, and ECCV.

Driven by a long-term goal of building intelligence for the physical world, my current research focuses on:

  • Multimodal Foundation Models: Engineering unified architectures with a focus on data/model scaling and reasoning to power next-generation agents and robotics.
  • Interactive Video World Models: Exploring real-time causal generation, interactive control, and consistent world modeling to bridge the gap between digital synthesis and physical simulation.

๐Ÿš€ I am hiring! If you are interested in a position focused on Multimodal Foundation Models or World Models! Please email me if you are interested.

Selected Projects

View All โ†’

LIVE: Long-horizon Interactive Video World Modeling

Junchao Huangโ€ , Ziyang Yeโ€ , Xinting Hu, Tianyu He, Guiyu Zhangโ€ , Shaoshuai Shi, Jiang Bian, Li Jiang

arXiv preprint arXiv:2602.03747 (2026)

A framework that alleviates autoregressive error accumulation via a cycle-consistency constraint.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wuโ€ , Diankun Wuโ€ , Tianyu He, Junliang Guo, Yang Yeโ€ , Yueqi Duan, Jiang Bian

International Conference on Learning Representations (ICLR) (2026)

Combining video diffusion with 3D representation for geometrically consistent world modeling.

Reinforcement Learning with Inverse Rewards for World Model Post-training

Yang Yeโ€ , Tianyu He, Shuo Yangโ€ , Jiang Bian

arXiv preprint arXiv:2509.23958 (2025)

A post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.

Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Xinle Chengโ€ , Tianyu He, Jiayi Xuโ€ , Junliang Guo, Di He, Jiang Bian

arXiv preprint arXiv:2506.01380 (2025)

Achieving autoregressive video generation at 30+ FPS through next-frame diffusion.

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo, Yang Yeโ€ , Tianyu He, Haoyu Wuโ€ , Yushu Jiangโ€ , Tim Pearce, Jiang Bian

arXiv preprint arXiv:2504.08388 (2025)

A real-time and open-source interactive world model built on Minecraft.

VidTok: A Versatile and Open-Source Video Tokenizer

Anni Tangโ€ , Tianyu He, Junliang Guo, Xinle Chengโ€ , Li Song, Jiang Bian

arXiv preprint arXiv:2412.13061 (2024)

A versatile and open-source video tokenizer for video understanding and generation.

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI

Xiaoyu Chenโ€ , Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, Jiang Bian

arXiv preprint arXiv:2411.00785 (2024)

Proposing latent action models for embodied AI foundation models.

GAIA: Zero-Shot Talking Avatar Generation

Tianyu He, Junliang Guo, Runyi Yuโ€ , Yuchi Wangโ€ , Jialiang Zhuโ€ , Kaikai Anโ€ , Leyi Liโ€ , Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian

International Conference on Learning Representations (ICLR) (2024)

Zero-shot talking avatar generation without subject-specific fine-tuning.