About Me
I am a PhD candidate in Computer Science & Engineering at Texas A&M University, focusing on high-performance and distributed machine learning systems. I specialize in optimizing sparse linear algebra for efficient ML pipelines. Previously, I interned at Amazon AWS, where I built a differentiable GPU kernel autotuner that reduced LLM kernel-tuning time from days to hours using transfer learning.
After completing my PhD (tentatively in Fall 2027), I plan to pursue industry roles as an ML systems engineer or researcher, focusing on building high-performance and scalable machine learning infrastructure.
I can:
- Build custom PyTorch GNN training pipelines backed by optimized sparse linear algebra.
- Integrate custom C++/CUDA kernels into PyTorch with LibTorch & PyBind11.
- Develop distributed training systems with PyTorch Distributed and efficient disk-based data streaming.
- Profile and optimize Python/C++ pipelines for performance bottlenecks.
- Apply ML for systems optimization (e.g., developed neural GPU kernel autotuner with transfer-learning support).
Prior to my PhD, I taught undergraduate courses for over 5 years, including in tenure-track roles, spanning computer systems, theory, and core programming, such as Web Programming, Computer Architecture, Theory of Computation, Programming Languages (C/C++/Java), and Data Structures & Algorithms.
Research Background
My PhD research centers on scalable graph learning systems (GNNs, KGEs, GraphRAG) and custom sparse linear algebra.
At MLSys 2025, I presented a generalized method for expressing KGE training models through sparse matrix multiplication to improve large-scale efficiency. This led to model training speedup of up to 5.3× in CPU, 4.2× in GPU, and an improvement of up to 11.1× in CUDA memory efficiency.
I also built a high-performance CPU SpMM library (ACM WebConf 2024) that accelerates PyTorch GNN training by up to 93× across Intel, AMD, and ARM CPUs.
Additionally, during my Amazon Summer Internship 2025, I developed a Differentiable GPU Kernel Autotuner, which achieved up to 60% higher accuracy in predicting optimal configurations compared to 16 other autotuning models.
Currently, I am collaborating with Oak Ridge National Lab on a distributed, differentiable framework for large-scale KGE training across hybrid compute tiers.
Leadership Roles
I led and coordinated various academic and technical initiatives, including curriculum revisions, postgraduate programs, and programming contests at UIU, MIST, and BUET, fostering a culture of innovation and academic excellence across departments. See all my leadership roles and initiatives here.
Exploration
I deepen my ML systems expertise by solving LeetGPU problems across PyTorch, CUDA, and Triton, building intuition for low-level kernel design and performance trade-offs. I stay current with ML infrastructure and systems trends by developing small experimental tools and continuously expanding my software skill set, with 12+ open-source and commercial projects in Python, C++, and Java. I also regularly read and distill recent ML systems papers, sharing relevant hands-on tutorials and technical walkthroughs on my blog to communicate systems concepts clearly and effectively.