Blogs and Notes

Understanding the Roofline Model without the Roof (First)
A blog post explaining the Roofline model by starting from basic processor–memory interaction to motivate the X-axis (arithmetic intensity) and Y-axis (performance), then showing how optimizations move a kernel right or up on the plot, and finally introducing the memory-bandwidth and compute roofs that bound achievable performance.
Jan 25, 2026 Hands-on
SIMD Intrinsics in Practice: Measuring Scalar vs SSE2 vs AVX2 Performance
A simple 8-element vector add benchmark shows that AVX2 with proper alignment is fastest, followed by SSE2, unaligned AVX2, and finally scalar code.
Jan 19, 2026 Hands-on
Getting Started with TinyCUDA
TinyCUDA is a header-only C++17 wrapper that eliminates CUDA boilerplate—auto-handles cudaMalloc/cudaMemcpy via Buffer, kernel errors with CUDA_CHECK, and timings via KernelProfiler. Ideal for rapid 1D buffer prototyping without full-framework overhead (no autograd or multi-D tensors).
Dec 26, 2025 Project
[Paper Summary] REFRAG: Rethinking RAG based Decoding
This paper from Meta Superintelligence Labs proposes an efficient decoding framework that optimizes Retrieval-Augmented Generation (RAG) by addressing latency and memory bottlenecks associated with long-context inputs. It achieves this through a novel approach involving compressed chunk embeddings and a reinforcement learning-based selective expansion policy, exploiting attention sparsity. The framework demonstrates substantial improvements in time-to-first-token (TTFT) acceleration and effective context window extension while maintaining perplexity and downstream task accuracy.
Oct 24, 2025 Paper Reading
iSpLib: An Auto-tuned GNN Accelerator for PyTorch
[Published in WebConf 2024] iSpLib is a PyTorch library that accelerates Graph Neural Network (GNN) training by integrating auto-tuned sparse matrix operations from FusedMM, delivering up to 93× speedup on large graphs like Reddit and OGBN-Proteins. It features plug-and-play patching for PyTorch Geometric, backpropagation support for semirings, and caching for fixed adjacency matrices, boosting models like GCN (54×), GraphSAGE (23-32×), and GIN (51×).
Sep 14, 2024 Project

Categories

Understanding the Roofline Model without the Roof (First)

SIMD Intrinsics in Practice: Measuring Scalar vs SSE2 vs AVX2 Performance

Getting Started with TinyCUDA

[Paper Summary] REFRAG: Rethinking RAG based Decoding

iSpLib: An Auto-tuned GNN Accelerator for PyTorch