Today I want to better understand how GPUs work and how to code for them. We will barely scratch the surface of GPU programming! The optimizations one can do in terms of memory management go extremely deep. Still, you gotta start somewhere 🤷
CPU vs GPU
CPUs are optimized to minimize latency1 of serial tasks. They are ideal to carry out the main processing functions of a computer and tasks with little parallelization.
1 Time-to-complete.
2 The same program (sequence of operations) being applied to different data in parallel.
On the other hand, GPUs are optimized for data-parallel computing2. The main advantages of GPU over a CPU are:
High computational throughput: GPUs can do many more operations per second than CPUs3.
High memory bandwidth: GPUs can transfer information at a much faster rate than CPUs4.
3 A H100 GPUs have ~30 TeraFLOPS (64-bit precision), while an AMD EPYC 9654 CPU (one of the best commercial CPUs) can theoretically only reach ~1.8 TeraFLOPS.
4 H100 PCIe have around ~2 TB/s peak Bandwidth while AMD EPYC 9654 CPU has ~0.4 TB/s.
To better understand how this works, let’s take a look at the chips design:
5 Aka “ALU”: Arithmetic and Logic Units. In NVIDIA GPUs they call them “CUDA Cores”.
In both GPUs and CPUs we find the same components. The main difference between both chips is the “space” allocated to each of them (see Figure 1):
Cores or ALUs: They are responsible to carry out basic calculations like addition, multiplication, logic operations…
Control units: Orchestrate the operations of the chip by directing the flow of data between the chip and other components of the computer. Notice that CPUs have a control unit per core, whereas GPUs have one for many cores. GPUs cores are all doing the same operation, so no need to have a dedicated control unit for each of them.
Caches: Small and fast memories designated to store copies of frequently-accessed data or instructions. The goal is to bridge the gap between lightning‐fast compute units and much slower off‐chip DRAM. Both chips employ a “multi-level cache hierarchy”: There exist different cache implementations of varying speed and space located at either a core level or shared across cores to ensure quick memory access6. They are usually implemented through SRAM (Static Random Access Memory).
DRAM: Is where all program data is loaded. In NVIDIA GPUs this is often implemented as HBM (High-Bandwidth-Memory), which achieves high parallel data transfers by stacking multiple DRAM dies in parallel.
Better utilizing how GPU programs use cache and DRAM is the main idea behind many modern ML model optimizations. E.g. “fused CUDA kernels” avoid materializing intermediate operations into GPU DRAM memory (HBM) to overcome the data copying bottleneck. A great example is flash attention.
6 Faster caches are more expensive and have higher energy requirements, thus they are smaller and “closer” to the core.
TLDR: Threads of the same process share memory, process are independent.
Process: A process is an executing instance of a program, with its own private memory space, file descriptors, and system resources. Processes do not share address spaces; each maintains separate copies of code, data, and stacks, providing isolation and protection but at higher communication cost.
Thread: Smallest sequence of programmed instructions that can be managed independently by a scheduler, typically part of the operating system . Threads run within a process and share the process’s address space and resources, such as global variables and open files. Context switches between threads are lighter than process switches because threads share much of their environment.
GPU program
When talking about GPU programs, we often distinguish the “Host” (the CPU and its memory) and the “Device” (the GPU and its memory).
CUDA is C with a set of extensions which allow for “Heterogeneous Programming”7 implemented specifically for NVIDIA GPUs. In CUDA, the Host (CPU) is in control of the program, whenever we run into a part of the program that can be massively parallelized, we’ll delegate it to the Device (GPU) and execute it there. Something to keep in mind is that data between the Host and the Device is transferred through the PCIe Bus, which is usually slow (and thus a big bottleneck).
7 Combining CPU and GPU utilization.
CUDA programming model
Alright, so far we’ve seen that GPUs offer a massive amount of cores at our disposal. CUDA defines a “compute hierarchy” to easily organize them. Ordered by increasing complexity, we have:
Threads: Lowest-level execution of a program. When a CUDA function is launched it runs as a parallel (and independent) set of threads, often in the order of thousands. Each thread is mapped to a single CUDA core in the GPU. They are executed in a SIMD-fashion: Single-Instruction Multiple-Data8.
Blocks: Groups threads9.
Grid: Group of blocks. One kernel launch is executed in one grid, which is mapped to the entire GPU.
8 CUDA people also call this SIMT: Single-Instruction Multiple-Thread
9 The limit should be around 1024 threads per block for modern GPUs
10 For instance, we can launch a block of 12 threads organized in a \(3 \times 4\) matrix as depicted in Figure 2
A block of threads can be organized into a 1D, 2D or 3D structure10. Analogously, a grid of blocks can also be organized in 1D, 2D or 3D structures.
This is a virtualization provided for coding convenience. For instance: if working with 2D-data (such as images), it is useful that each thread is associated to a pixel. In this case, it is nice to have a 2D coordinate system in place so each threads knows “where” it is and which data to access. See Tip 1 for a more detailed example. This will be much clearer after we see some actual code.
Say we have a \(720 \times 1024\) gray-scale image (stored as a 2D matrix) and we want to apply some gaussian blur to it (because we are bored, for instance).
We can then launch a 2D group of \(720 \times 1024\) threads. Each of which mapping to a single pixel. Obviously, we’ll associate the thread at position \((i, j)\) with the pixel at position \((i, j)\).
When running the CUDA function, each thread can query its coordinates within the block, making it very easy to access the desired pixel value and its surrounding ones to compute the blur.
This exemplifies how “pretending” that each thread has a location in a 2D grid can simplify the code to work on 2D type of data. The same goes for vector operations in a 1D array or 3D tensors for instance.
Program example
Let’s see how all this really works through an example.
A pointer p holds the memory address of another variable. It “points” where to find a variable.
- We can access the variable value by doing
*p - We can access the memory address of a variable by doing
&var
In code:
float var = 5.0; // var is a variable of type float
float *p = NULL; // p is a pointer (memory address) to a float type
// NOTE: p has type float* (memory address of type float)
// *p has type float
// It is good practice to initialize p as null not to point to random addresses
// Stuff we can do:
*p = 10.0; // Access the value at the memory address p and set it to 10.0
printf(*p) // Should print 10.0 (value at memory address of p)
printf(&var); // Access the memory address of var
p = &var; // Set p to point to var's memory address
printf(*p); // Should now print 5, as p points to var's memory address now
printf(p); // Prints the same as printf(&var)What about arrays? Pointers point to the first element of the array.
int arr[3] = {1,2,3}; // We have an array of 3 integers
int *r = arr // Points to &arr[0] (memory address of first elem of array)
r + 2 // Points to &arr[2]
// Thus:
printf(*r) // prints 1
printf(*(r + 1)) // prints 2
printf(*(r + 2)) // prints 3
// Actually, in C, the expression arr[i] is already a shorthand for *(arr + i)We say that a function argument is passed:
- By reference, when we pass a pointer to the variable, instead of a copy of the variable.
- By value, when we pass a copy of the data.
Essentially, if we pass by reference, we will be working on the original variable, instead of a local copy of it. Usually, when calling functions we pass variables by reference (aka as pointers, unless they are small stuff such as integers).
In this example, we can see how to pass a pointer to a variable and modify its value:
#include <stdio.h>
void increment_ref(int *px) {
*px = *px + 1; // modify the original
}
int main(void) {
int a = 5;
increment_ref(&a); // Pass pointer to a
printf("%d\n", a); // prints 6
return 0;
}How does it work for arrays?
void double_values(int arr[], int length) {
// int arr[] is actually the same as int* arr
for (int i = 0; i < length; i++) {
arr[i] = arr[i] * 2;
}
}
int main(void) {
int data[4] = { 1, 2, 3, 4 };
// Call the function, passing the array name 'data' and its size.
// 'data' automatically converts (aka 'decays') to a pointer to &data[0].
double_values(data, 4);
return 0;
}Notice we could have also written the double_values (less conveniently) function as:
void double_values(int *p, int length) {
for (int i = 0; i < length; i++) {
*(p + i) = *(p + i) * 2;
}
}Consider the following problem: We want a program which adds a constant value k to a given array of values arr in-place. In C, using a serial approach, this would look like:
void add_constant_function(int arr[], int array_len, int k) {
for (int i = 0; i < array_len; i++) {
arr[i] += k;
}
}In CUDA, we can take advantage of the thread organization to ease the implementation11:
11 Notice this is an oversimplified example so we don’t get overwhelmed.
#include <stdio.h>
#include <cuda_runtime.h>
// GPU kernel: Function running in the GPU
__global__ void add_constant_kernel(int *arr, int k, int array_len) {
int i = threadIdx.x; // Get thread index in the 1D array of array_len
arr[i] += k;
}
int main(void) {
const int array_len = 5;
const int k = 3;
int h_arr[array_len] = {1, 2, 3, 4, 5}; // in host_ (CPU RAM) memory
int *d_arr = NULL; // in device_ (GPU RAM) variable
// 1) Allocate and copy to device
cudaMalloc(&d_arr, array_len * sizeof(int));
cudaMemcpy(d_arr, h_arr, array_len * sizeof(int), cudaMemcpyHostToDevice);
// 2) Launch kernel with one block of array_len threads in a 1D-fashion
add_constant_kernel<<<1, array_len>>>(d_arr, k, array_len);
// 3) Copy result back to CPU RAM and clean up
cudaMemcpy(h_arr, d_arr, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_arr);
return 0;
}Stuff worth mentioning:
- We precede host variables as
h_and device variables asd_to keep track of where is what. - We used the cuda-specific functions:
cudaMalloc:- Reserves (aka allocates) an array of 5 ints in GPU memory (HBM)12.
- Takes as input a pointer to the pointer
d_arr🤯. We are passingd_arrby reference so we can modify (in-place) its content to point to the address in GPU memory. - After calling
cudaMallocd_arrholds a memory address of the GPU memory.
cudaMemcpy:- Copies data CPU -> GPU if
cudaMemcpyHostToDevice - Copies data GPU -> CPU if
cudaMemcpyDeviceToHost
- Copies data CPU -> GPU if
add_constant_kernel<<<1, array_len>>>(d_arr, k, array_len);initializes a kernel with 1 block of array_len threads and passes the given arguments.
12 Same as malloc in standard C.
Design considerations
Before we wrap up, I’d like to go over a couple other hardware characteristics which are important to keep in mind when designing programs.
Threads are internally executed in groups of 3213 called warps.
More formally, a warp is the unit of scheduling on the Streaming Multiprocessor (SM):
13 In theory this depends on the GPU but I think it is almost always 32
- All 32 threads in a warp execute the same instruction at the same time (SIMT).
- If threads within a warp diverge (e.g., take different
ifbranches), the SM serializes execution, which degrades performance.
Thus, when writing CUDA programs we usually make it so that the number of threads per block is a multiple of 32. This ensures the most efficient utilization of threads14. However, one must watch register and shared-memory usage per block not to exhaust it15.
14 If you were to make it so the block is not a multiple of 32 threads, you’d be initializing warps with idle threads.
15 There are tools such CUDA Occupancy calculator or Nsight which help debug it.
16 Usually moving data around is slower than performing operations
All this has another implication at a memory level. When processing information, data needs to be transmitted from the slow-but-big global memory (HBM) to the fast-but-small cache (shared memory, SRAM). This cache memory is shared between all threads within a given block (even if in different warps) and the data transmission is the main bottleneck for most applications16.
Turns out that (because of the way this data transmission is implemented) reading and writing17 contiguous chunks of memory is much faster than sending detached elements. This is called coalesced memory and it is very important when designing kernels. We’ll always try to make sure threads within the same block:
17 Between global memory and shared memory.
- Access the same stuff from the global memory: so they can amortize the shared memory.
- This stuff is coalesced (continuous) in global memory, both when reading and when writing.
In the next post, we’ll take on the journey of optimizing matrix multiplication with CUDA. Through this example we’ll see the effect this design choices can have.
Conclusion
Today, we did this stuff:
- Contrasted CPU and GPU architectures
- Walked through the CUDA execution hierarchy (threads, blocks, grids)
- Saw a minimal example of offloading work to the device
- Noted some important hardware considerations when writing CUDA code.
In following posts we’ll go deeper learning how to optimize and profile code execution.