Differentiable GPU kernel autotuner with Transfer Learning

Autotuner

CUDA

Kernel

Differentiable

vLLM

Transfer Learning

Amazon Internship

PyTorch

SciPy

GPU

[Amazon AWS Internship 2025] I developed a robust, end-to-end differentiable GPU kernel autotuner for vLLM that achieves up to 1.6× higher accuracy using only 1% ground-truth data and enables transfer learning with up to 12.7% accuracy gains, reducing kernel tuning time from days to hours.

Published

August 2025

Note

Amazon AWS Internship Summer 2025

Amazon intern workshop day! James Basa on the right.

Overview

Developed a robust, end-to-end differentiable GPU kernel autotuner for vLLM that works well with as low as 1% ground truth of the total search space. The solution also performs transfer learning and can leverage cheaper kernel tuning data to reduce new kernel tuning time from days to hours.

Findings

Across six datasets, our auto-tuner obtained an improvement in cross-validation accuracy of 0.85× − 1.60× compared to 16 other ML and tabular transformer models (commonly used for performance modeling) on 1% ground truth. Accuracy continued to improve and outperformed all baselines as the training data increased.
In transfer learning, up to 12.7% accuracy improvement was observed when an expensive CUDA kernel was tuned using only 100 config-perf pairs and leveraged with similar and cheaper Triton kernel’s tuning data.