About
I’m a compute architect at NVIDIA, focused on accelerating Deep Learning software stack on cutting-edge GPUs such as Hopper and Blackwell. Currently, I am engaged in developing a deep learning compiler and enhancing end-to-end training performance. In my spare time, I maintain a keen interest in emerging deep learning algorithms, including embodied intelligence, AI4Science, LLM and graphics.
Some summarizes of my working:
Part of my contribution on Deep Learning Compilers:
- Fuser (2022): Support graphOps like gather/scatter/index_select. code blog
- pytorch_geometric (2022): Add TorchScript support for PyG community. code
Part of my contribution on Deep Learning models and frameworks for public users:
- Openfold (2023): MLPerf Training HPC Benchmark Suite Results, round v3.1;
- GPT3 (2023): MLPerf Training Benchmark Suite Results, round v4.0;
- Support arbitrary combinations of parallel orders. megatron-lm nemo
Part of my contribution on fast kernels:
- SpMM optimization (2021): Champion of the Graph Challenge 2021
- K-Truss decomposition (2021) paper