LeetGPU-1: Vector Addition
Note: This is part of a learning series on CUDA and Triton, focusing on correctness-first implementations rather than performance optimization.
Problem Statement
Implement a program that performs element-wise addition of two vectors containing 32-bit floating point numbers on a GPU. The program should take two input vectors of equal length and produce a single output vector containing their sum.
Example
Input: A = [1.0, 2.0, 3.0, 4.0]
B = [5.0, 6.0, 7.0, 8.0]
Output: C = [6.0, 8.0, 10.0, 12.0]
Constraints
Input vectors
AandBhave identical lengths1 ≤
N≤ 100,000,000
Solution
PyTorch
For this signature def solve(A: torch.Tensor, B: torch.Tensor, C: torch.Tensor, N: int), the solution C = A + B won’t work because C is being passed in the input argument. We need to copy the the result to C instead of creating a new variable C.
- Option-1:
C._copy(A + B) - Option-2:
torch.add(A, B, out=C)
Solution
import torch
# A, B, C are tensors on the GPU
def solve(A: torch.Tensor, B: torch.Tensor, C: torch.Tensor, N: int):
torch.add(A, B, out=C)Runtime: 1.25ms
Triton
Similar to SIMD vectorized load and store.
tl.program_id(0): Get current block idtl.arange(0, BLOCK_SIZE): Similar tonp.arange(). List of offset numbers to add.tl.load(a + block_offset, mask = mask):ais the starting pointer address to the data.maskhandles uneven block.tl.store(c + block_offset, output, mask = mask): store takes a middle argument on what to write.
Solution
import torch
import triton
import triton.language as tl
@triton.jit
def vector_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
block_start = tl.program_id(0) * BLOCK_SIZE
block_offset = block_start + tl.arange(0, BLOCK_SIZE)
mask = block_offset < n_elements
a_val = tl.load(a + block_offset, mask = mask)
b_val = tl.load(b + block_offset, mask = mask)
output = a_val + b_val
tl.store(c + block_offset, output, mask = mask)
# a, b, c are tensors on the GPU
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):
BLOCK_SIZE = 1024
grid = (triton.cdiv(N, BLOCK_SIZE),)
vector_add_kernel[grid](a, b, c, N, BLOCK_SIZE) Runtime: 1.34ms
CUDA
blockIdx.x: Current block id. Similar totl.program_id(0)blockDim.x: Block sizethreadIdx.x: Thread id inside a block. Generated by CUDA runtime soarangeis not needed.A[threadId]is same as*(A + threadId)in C programming language
Solution
#include <cuda_runtime.h>
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
int threadId = blockIdx.x * blockDim.x + threadIdx.x;
if (threadId >= N)
return;
C[threadId] = A[threadId] + B[threadId];
}
// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int N) {
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
cudaDeviceSynchronize();
}Runtime: 1.18ms
C[threadId] = A[threadId]; C[threadId] += B[threadId];is actually slower (1.24ms) thanC[threadId] = A[threadId] + B[threadId];since it involves two write operations in HBM. But both are faster than PyTorch and Triton!