About

I’m a compute architect at NVIDIA, focused on accelerating Deep Learning software stack on cutting-edge GPUs such as Hopper and Blackwell. Currently, I am engaged in developing a deep learning compiler and enhancing end-to-end training performance. In my spare time, I maintain a keen interest in emerging deep learning algorithms, including embodied intelligence, AI4Science, LLM and graphics.

Some summarizes of my working:

Part of my contribution on Deep Learning Compilers:

Fuser (2022): Support graphOps like gather/scatter/index_select. code blog
pytorch_geometric (2022): Add TorchScript support for PyG community. code

Part of my contribution on Deep Learning models and frameworks for public users:

Openfold (2023): MLPerf Training HPC Benchmark Suite Results, round v3.1;
- code blog paper
GPT3 (2023): MLPerf Training Benchmark Suite Results, round v4.0;
- Support arbitrary combinations of parallel orders. megatron-lm nemo

Part of my contribution on fast kernels:

SpMM optimization (2021): Champion of the Graph Challenge 2021
- code paper blog
K-Truss decomposition (2021) paper