Differentiable GPU kernel autotuner with Transfer Learning

Autotuner
CUDA
Kernel
Differentiable
vLLM
Transfer Learning
Amazon Internship
PyTorch
SciPy
GPU
Developed a robust, end-to-end differentiable GPU kernel autotuner for vLLM that requires very little (n<1000) ground truth for tuning.
Published

August 2025

NoteNote

Amazon AWS Internship Summer 2025

Amazon intern workshop day! James Basa on the right.

Amazon intern workshop day! James Basa on the right.

Overview

Developed a robust, end-to-end differentiable GPU kernel autotuner for vLLM that works well with as low as 1% ground truth of the total search space. The solution also performs transfer learning and can leverage cheaper kernel tuning data to reduce new kernel tuning time from days to hours.

Findings

  • Across six datasets, our auto-tuner obtained an improvement in cross-validation accuracy of 0.85× − 1.60× compared to 16 other ML and tabular transformer models (commonly used for performance modeling) on 1% ground truth. Accuracy continued to improve and outperformed all baselines as the training data increased.

  • In transfer learning, up to 12.7% accuracy improvement was observed when an expensive CUDA kernel was tuned using only 100 config-perf pairs and leveraged with similar and cheaper Triton kernel’s tuning data.