About
I’m a senior compute architect at NVIDIA, focused on accelerating Deep Learning software stack on cutting-edge GPUs such as Hopper, Blackwell and Rubin.
Currently, I am engaged in developing a deep learning compiler and enhancing end-to-end training performance. My research interests span compiler optimization, high-performance computing, and AI systems. I work on pushing state-of-the-art deep learning models to industry-leading performance.
In my spare time, I maintain a keen interest in emerging deep learning algorithms, including embodied AI, AI4Science, LLM and AI4Graphics.
🔨 Deep Learning Compilers Link to heading
Selected compiler contributions I’ve worked on:
- Triton-to-TileIR (2024–Now): Bridged the Triton and CuTile ecosystems.[code]
- CuTile (2023–Now): Implemented optimization passes and bug fixes [code] [blog] [GTC]
- Fuser (2022): Enabled graph operations such as gather, scatter, and index_select. [code] [blog]
🚀 Deep Learning Models Link to heading
Training optimizations for production systems:
- OpenFold2 (2023): MLPerf Training HPC Benchmark Suite Results, round v3.1 [code] [blog] [paper]
- GPT-3 (2023): MLPerf Training Benchmark Suite Results, round v4.0 [megatron-lm] [nemo]
- GNN (2022): Added TorchScript support for the PyG community. [code]
⚡ High-Performance Kernels Link to heading
Optimizations for GPU computing:
- SpMM (2021): 🏆 Champion of the Graph Challenge [code] · [paper] [blog]
- K-Truss (2021): [paper]
📬 Contact Link to heading
Feel free to reach out via GitHub or email at cs.xinjie@gmail.com