This learning resource assumes that you understand what GPGPU is.
If not, find out more about GPGPU at se-edu's learning resource on GPGPU.
CUDA is a parallel computing platform and programming model from NVIDIA.[1]
It allows us to use a CUDA-enabled GPU for GPGPU. The list of GPUs that support CUDA can be found here: https://developer.nvidia.com/cuda-gpus.
Refers to CPU and its memory
Code that runs on CPU
Refers to GPU and its memory
Code that runs on GPU
CUDA GPUs have many parallel processors grouped into SMs
Each SM can run multiple concurrent thread blocks that execute independently from each other.
CUDA C extends C by allowing the programmer to define C functions, called kernels, that are run by the GPU.
When called, kernels are executed N
times in parallel by N
different CUDA threads, as opposed to only once like regular C functions.[2]
CUDA executes kernels using a grid of blocks of threads.
An execution of a kernel with a given index.
Each thread uses its index to access elements in array such that the collection of all threads cooperatively processes the entire data set.[3]
A logical group of threads.
The threads in a block can be arranged into 1D, 2D, or 3D arrays, and can be identified using 1D (x
), 2D (x
, y
), or 3D (x
, y
, z
) indexes, respectively.
Each block is atomically assigned to and run by a single SM. A single block will never be split up, and it will never be processed by more than one SM. However, note that it isn't a one-to-one relationship, and one SM can concurrently process more than one block. There is no synchronization between blocks, and they are executed independently from each other.[3]
A logical group of blocks.
The blocks in a grid can be arranged into 1D, or 2D arrays, and can be identified using 1D (x
), or 2D (x
, y
) indexes, respectively.
When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.[4]
A kernel is defined using the __global__
declaration specifier. This tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.
The number of CUDA threads that execute that kernel for a given kernel call is specified using the <<<...>>>
execution configuration.[2] A kernel called myKernel
can be launched with the following syntax in C:
myKernel<<<numBlocks, numThreadsPerBlock>>>(args)
Definitions:
numBlocks
refers to the number of thread blocks in the grid.
numThreadsPerBlock
refers to the number of threads in each thread block.
Note that different GPUs work optimally with different dimension parameters.[5]
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx
variable.
As an illustration, the following sample code adds two vectors A
and B
of size N
and stores the result into vector results
:[2]
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* results)
{
int i = threadIdx.x;
results[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation of 1D block with N threads
VecAdd<<<1, N>>>(A, B, results);
...
}
gridDim
Dimensions of the grid in blocks.
First dimension of blocks is accessed with gridDim.x
, and second dimension is accessed with gridDim.y
. gridDim.z
is unused as blocks can either be 1D or 2D only.
blockDim
Dimensions of the block in threads.
First dimension of threads is accessed with gridDim.x
, second dimension is accessed with gridDim.y
, and third dimension is accessed with gridDim.z
.
blockIdx
Block index within the grid. Is grid-level unique.
threadIdx
Thread index within the block. Is block-level unique.
For instance, the index of a thread in a 1D grid of a 1D block of threads can be determined using
blockIdx.x * blockDim.x + threadIdx.x
Refer to the excellent tutorial by the Chief Technologist for GPU Computing Software at NVIDIA here: An Even Easier Introduction to CUDA.
Should you have trouble understanding any terms, you can refer to section 1.2 of this guide on CUDA terminologies.
PyCUDA lets you access NVIDIA‘s CUDA parallel computation API from Python.[6]
Not entirely so.
While kernels will have to be written in CUDA C, PyCUDA has helper interfaces that make it easier to write CUDA code (e.g. with memory management and cleanup). However, with the right helpers, you can write CUDA code without needing to write any C code at all. For instance, see this example from the Wiki that performs a 2D fast Fourier transform (FFT) without any CUDA C code.[7]
Thus, if you're not familiar with C, it would be easier to learn about CUDA through PyCUDA, especially if you have prior experience with Python.
Official installation instructions here: https://wiki.tiker.net/PyCuda/Installation
Device to host, i.e. GPU to CPU.
Host to device, i.e. CPU to GPU.
Official tutorial here: https://documen.tician.de/pycuda/tutorial.html
Note:
While still useful and relevant, this tutorial was originally written in 2008, and contains some outdated information.
Specifically, the point about how "most nVidia [sic] devices only support single precision" is incorrect. All NVIDIA GPUs since GT200 (released in Sep 2008), with compute capability 1.3 or higher, provide hardware support for double precision floating point values and operations.[8]
You can find out the compute capability of your NVIDIA card from the official website or this table from Wikipedia.
You can also find out more about the features and specifications of each compute capability version from the official documentation or these tables from Wikipedia.
OpenCL (Open Computing Language) is an open standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms.[9]
First and foremost, the most distinctive difference is in how CUDA is proprietary to NVIDIA, whereas OpenCL, while owned by the Khronos Group, is open-sourced. The implications are that if you have an NVIDIA GPU, it supports both CUDA and OpenCL, but if not, and you're using an AMD GPU, it only supports OpenCL.
Additionally, they use different terms. For instance, what NIVIDIA refers to as a "thread", OpenCL refers to as a "work item".
This topic is widely covered. Here are some good references on what the differences are, and what the implications are when choosing one over the other:
[1]: https://www.nvidia.com/object/cuda_home_new.html
[2]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#kernels
[3]: https://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/
[4]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#hardware-implementation
[5]: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm
[6]: https://mathema.tician.de/software/pycuda/
[7]: https://stackoverflow.com/a/5957647
[8]: https://developer.nvidia.com/cuda-faq#Programming
[9]: https://www.khronos.org/opencl/