Md Saidul Hoque Anik

Recent News

Date	caption
12/25	Released tinyCUDA, a lightweight C++ CUDA memory management and profiling library
11/25	Sparstitute project debrief (Berkeley) on GNN model-parallel systems
10/25	DDP Lecture Session for TAMU Graduate Supercomputing Course
05/25	Summer Intern at Amazon SageMaker
02/25	Paper accepted at MLSys 2025

About Me

I am a PhD candidate in Computer Science & Engineering at Texas A&M University, focusing on high-performance and distributed machine learning systems. I specialize in optimizing sparse linear algebra for efficient ML pipelines. Previously, I interned at Amazon AWS, where I built a differentiable GPU kernel autotuner that reduced LLM kernel-tuning time from days to hours using transfer learning.

After completing my PhD (tentatively in Fall 2027), I plan to pursue industry roles as an ML systems engineer or researcher, focusing on building high-performance and scalable machine learning infrastructure.

I can:

Build custom PyTorch GNN training pipelines backed by optimized sparse linear algebra.

Integrate custom C++/CUDA kernels into PyTorch with LibTorch & PyBind11.

Develop distributed training systems with PyTorch Distributed and efficient disk-based data streaming.

Profile and optimize Python/C++ pipelines for performance bottlenecks.

Apply ML for systems optimization (e.g., developed neural GPU kernel autotuner with transfer-learning support).

Prior to my PhD, I taught undergraduate courses for over 5 years, including in tenure-track roles, spanning computer systems, theory, and core programming, such as Web Programming, Computer Architecture, Theory of Computation, Programming Languages (C/C++/Java), and Data Structures & Algorithms.

Research Background

My PhD research centers on scalable graph learning systems (GNNs, KGEs, GraphRAG) and custom sparse linear algebra.
At MLSys 2025, I presented a generalized method for expressing KGE training models through sparse matrix multiplication to improve large-scale efficiency. This led to model training speedup of up to 5.3× in CPU, 4.2× in GPU, and an improvement of up to 11.1× in CUDA memory efficiency.
I also built a high-performance CPU SpMM library (ACM WebConf 2024) that accelerates PyTorch GNN training by up to 93× across Intel, AMD, and ARM CPUs.
Additionally, during my Amazon Summer Internship 2025, I developed a Differentiable GPU Kernel Autotuner, which achieved up to 60% higher accuracy in predicting optimal configurations compared to 16 other autotuning models.
Currently, I am collaborating with Oak Ridge National Lab on a distributed, differentiable framework for large-scale KGE training across hybrid compute tiers.

Featured Research

Differentiable GPU kernel autotuner with Transfer Learning

[Amazon AWS Internship 2025] I developed a robust, end-to-end differentiable GPU kernel autotuner for vLLM that achieves up to 1.6× higher accuracy using only 1% ground-truth data and enables transfer learning with up to 12.7% accuracy gains, reducing kernel tuning time from days to hours.

Aug 2025

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations

[MLSys 2025] I expressed and reformulated 10 KG embedding models using sparse–dense matrix multiplication (SpMM), achieving up to 5.3× CPU speedup, 4.2× GPU speedup, and up to 11.1× reduction in GPU memory footprint.

May 2025

A Sparse Approach for Translation-based Training of Knowledge Graph Embeddings

[SC24 Best Poster Finalist] In this work, I accelerated knowledge-graph embedding training by replacing traditional scatter/gather operations with sparse–dense matrix multiplication, reducing memory usage and achieving significant CPU, GPU, and multi-GPU speedups.

Nov 2024

Discover all 20+ research projects or view the full list of publications

Leadership Roles

I led and coordinated various academic and technical initiatives, including curriculum revisions, postgraduate programs, and programming contests at UIU, MIST, and BUET, fostering a culture of innovation and academic excellence across departments. See all my leadership roles and initiatives here.

Exploration

I deepen my ML systems expertise by solving LeetGPU problems across PyTorch, CUDA, and Triton, building intuition for low-level kernel design and performance trade-offs. I stay current with ML infrastructure and systems trends by developing small experimental tools and continuously expanding my software skill set, with 12+ open-source and commercial projects in Python, C++, and Java. I also regularly read and distill recent ML systems papers, sharing relevant hands-on tutorials and technical walkthroughs on my blog to communicate systems concepts clearly and effectively.