LeetGPU-1: Vector Addition

LeetGPU
Notes and solutions in PyTorch, Triton, and CUDA. Runtime shown for T4 GPU.
Author

Md Saidul Hoque Anik

Published

December 2025

Note: This is part of a learning series on CUDA and Triton, focusing on correctness-first implementations rather than performance optimization.

Problem Statement

Implement a program that performs element-wise addition of two vectors containing 32-bit floating point numbers on a GPU. The program should take two input vectors of equal length and produce a single output vector containing their sum.

Example

Input:  A = [1.0, 2.0, 3.0, 4.0]
        B = [5.0, 6.0, 7.0, 8.0]
Output: C = [6.0, 8.0, 10.0, 12.0]

Constraints

  • Input vectors A and B have identical lengths

  • 1 ≤ N ≤ 100,000,000

Solution

PyTorch

Note

For this signature def solve(A: torch.Tensor, B: torch.Tensor, C: torch.Tensor, N: int), the solution C = A + B won’t work because C is being passed in the input argument. We need to copy the the result to C instead of creating a new variable C.

  • Option-1: C._copy(A + B)
  • Option-2: torch.add(A, B, out=C)

Solution

import torch

# A, B, C are tensors on the GPU
def solve(A: torch.Tensor, B: torch.Tensor, C: torch.Tensor, N: int):
    torch.add(A, B, out=C)

Runtime: 1.25ms

Triton

Similar to SIMD vectorized load and store.

Note
  • tl.program_id(0): Get current block id
  • tl.arange(0, BLOCK_SIZE): Similar to np.arange(). List of offset numbers to add.
  • tl.load(a + block_offset, mask = mask): a is the starting pointer address to the data. mask handles uneven block.
  • tl.store(c + block_offset, output, mask = mask): store takes a middle argument on what to write.

Solution

import torch
import triton
import triton.language as tl

@triton.jit
def vector_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
    block_start = tl.program_id(0) * BLOCK_SIZE
    block_offset = block_start + tl.arange(0, BLOCK_SIZE)
    mask = block_offset < n_elements
    a_val = tl.load(a + block_offset, mask = mask)
    b_val = tl.load(b + block_offset, mask = mask)
    output = a_val + b_val
    tl.store(c + block_offset, output, mask = mask)

   
# a, b, c are tensors on the GPU
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):    
    BLOCK_SIZE = 1024
    grid = (triton.cdiv(N, BLOCK_SIZE),)
    vector_add_kernel[grid](a, b, c, N, BLOCK_SIZE) 

Runtime: 1.34ms

CUDA

Note
  • blockIdx.x: Current block id. Similar to tl.program_id(0)
  • blockDim.x: Block size
  • threadIdx.x: Thread id inside a block. Generated by CUDA runtime so arange is not needed.
  • A[threadId] is same as *(A + threadId) in C programming language

Solution

#include <cuda_runtime.h>

__global__ void vector_add(const float* A, const float* B, float* C, int N) {
    int threadId = blockIdx.x * blockDim.x + threadIdx.x;
    if (threadId >= N)
        return;
    
    C[threadId] = A[threadId] + B[threadId];
}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int N) {
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
    cudaDeviceSynchronize();
}

Runtime: 1.18ms

C[threadId] = A[threadId]; C[threadId] += B[threadId]; is actually slower (1.24ms) than C[threadId] = A[threadId] + B[threadId]; since it involves two write operations in HBM. But both are faster than PyTorch and Triton!

Reference