About

I’m a senior compute architect at NVIDIA, focused on accelerating Deep Learning software stack on cutting-edge GPUs such as Hopper, Blackwell and Rubin.

Currently, I am engaged in developing a deep learning compiler and enhancing end-to-end training performance. My research interests span compiler optimization, high-performance computing, and AI systems. I work on pushing state-of-the-art deep learning models to industry-leading performance.

In my spare time, I maintain a keen interest in emerging deep learning algorithms, including embodied AI, AI4Science, LLM and AI4Graphics.


🔨 Deep Learning Compilers Link to heading

Selected compiler contributions I’ve worked on:

  • Triton-to-TileIR (2024–Now): Bridged the Triton and CuTile ecosystems.[code]
  • CuTile (2023–Now): Implemented optimization passes and bug fixes [code] [blog] [GTC]
  • Fuser (2022): Enabled graph operations such as gather, scatter, and index_select. [code] [blog]

🚀 Deep Learning Models Link to heading

Training optimizations for production systems:

⚡ High-Performance Kernels Link to heading

Optimizations for GPU computing:


📬 Contact Link to heading

Feel free to reach out via GitHub or email at cs.xinjie@gmail.com