An introduction to CUDA in Python (Part 1)

Coding directly in Python functions that will be executed on GPU may allow to remove bottlenecks while keeping the code short and simple. In this introduction, we show one way to use CUDA in Python, and explain some basic principles of CUDA programming.

We choose to use the Open Source package Numba. Numba is a just-in-time compiler for Python that allows in particular to write CUDA kernels. Numba is freely available at https://github.com/numba/numba.

Note that there are other packages, such as PyCUDA, that also allow to launch CUDA kernels in Python.

Preliminary

Let’s define first some vocabulary:

a CUDA kernel is a function that is executed on the GPU,
the GPU and its memory are called the device,
the CPU and its memory are called the host.

Now let’s check the versions used in this notebook:

import sys
import numba 
import numpy

print("Python version:", sys.version)
print("Numba version:", numba.__version__)
print("Numpy version:", numpy.__version__)

Python version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
Numba version: 0.38.1+1.gc42707d0f.dirty
Numpy version: 1.14.5

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Launching our first CUDA kernel

Numba’s cuda module interacts with Python through numpy arrays. Therefore we have to import both numpy as well as the cuda module:

from numba import cuda
import numpy as np

Let’s start by writing a function that adds 0.5 to each cell of an (1D) array. To tell Python that a function is a CUDA kernel, simply add @cuda.jit before the definition. Below is our first CUDA kernel!

@cuda.jit
def cudakernel0(array):
    for i in range(array.size):
        array[i] += 0.5

Note that inside the definition of a CUDA kernel, only a subset of the Python language is allowed. The whole list is available at: http://numba.pydata.org/numba-doc/latest/cuda/cudapysupported.html.

Now let’s launch our kernel!

array = np.array([0, 1], np.float32)
print('Initial array:', array)

print('Kernel launch: cudakernel0[1, 1](array)')
cudakernel0[1, 1](array)

print('Updated array:',array)

Initial array: [0. 1.]
Kernel launch: cudakernel0[1, 1](array)
Updated array: [0.5 1.5]

More about kernel launch

You can see that we simply launched the previous kernel using the command cudakernel0[1, 1](array). But what is the meaning of [1, 1] after the kernel name?

On the GPU, the computations are executed in separate blocks, and each of these blocks uses many threads.
The set of all the blocks is called a grid. Therefore the grid size is equal to the number of blocks used during a computation.
Furthermore, each block uses the same number of threads. We call this number the block size.

To launch a kernel over multiple threads, we use the following syntax:

kernelname[gridsize, blocksize](arguments)

This is equivalent to:

kernelname[number_of_blocks, number_of_threads_per_block](arguments)

The total number of threads used by such kernel launch is by definition: \[ number\_of\_threads\_per\_block \times number\_of\_blocks = blocksize \times gridsize \]

Therefore in the previous example, [1, 1] indicates that we are using only one thread to launch the kernel. That’s not really smart since the purpose of using CUDA kernels is to launch computations over a very big number of threads.

Let’s test what happens if we launch this kernel with more threads:

array = np.array([0, 1], np.float32)
print('Initial array:', array)

gridsize = 1024
blocksize = 1024
print("Grid size: {}, Block size: {}".format(gridsize, blocksize))

print("Total number of threads:", gridsize * blocksize)

print('Kernel launch: cudakernel0[gridsize, blocksize](array)')
cudakernel0[gridsize, blocksize](array)

print('Updated array:',array)

Initial array: [0. 1.]
Grid size: 1024, Block size: 1024
Total number of threads: 1048576
Kernel launch: cudakernel0[gridsize, blocksize](array)
Updated array: [56.5 56. ]

The result when using so many threads is quite weird. Let’s execute the cell above a few more times… The result is not always the same!

The above kernel is launched with a big number of threads. But each thread tries to add 0.5 to exactly the same array. Since the threads are launched in parallel, it is impossible to know at what moment exactly the threads read or write a value in the array.
Let’s consider two threads \(t_0\) and \(t_1\) among these 1048576 threads. If \(t_0\) and \(t_1\) read the array at the same time, their output will be the same. However, if \(t_0\) writes its output just before \(t_1\) reads the array, the output of \(t_1\) will be different.

That’s not a problem specific to CUDA programming but to parallel programming in general. We should modify our function to work correctly with multiple threads.

Toward parallel programming

We are now going to write a kernel better adapted to parallel programming. A way to proceed is to assign each thread to update one array cell, and therefore use as many threads as the array size. For that, we will use the cuda.grid() function that returns the absolute position of the current thread inside the whole grid.

@cuda.jit
def cudakernel1(array):
    thread_position = cuda.grid(1)
    array[thread_position] += 0.5

We used in this kernel the argument 1 in the cuda.grid() function. We will explain later, in Part 2, why we used this value.

To update an array of size 2, we use 2 threads as follow:

array = np.array([0, 1], np.float32)
print('Initial array:', array)

print('Kernel launch: cudakernel1[1, 2](array)')
cudakernel1[1, 2](array)

print('Updated array:',array)

Initial array: [0. 1.]
Kernel launch: cudakernel1[1, 2](array)
Updated array: [0.5 1.5]

Note that if we launch this kernel with a grid of only one thread, only the first cell of the array will be updated:

array = np.array([0, 1], np.float32)
print('Initial array:', array)

print('Kernel launch: cudakernel1[1, 1](array)')
cudakernel1[1, 1](array)

print('Updated array:',array)

Initial array: [0. 1.]
Kernel launch: cudakernel1[1, 1](array)
Updated array: [0.5 1. ]

Let’s test this kernel with an array of size \(2^{20}\). We therefore need a total number of threads of \(2^{20} = 1024 \times 1024\).

array = np.zeros(1024 * 1024, np.float32)
print('Initial array:', array)

print('Kernel launch: cudakernel1[1024, 1024](array)')
cudakernel1[1024, 1024](array)

print('Updated array:', array)

# Since it is a huge array, let's check that the result is correct:
print('The result is correct:', np.all(array == np.zeros(1024 * 1024, np.float32) + 0.5))

Initial array: [0. 0. 0. ... 0. 0. 0.]
Kernel launch: cudakernel1[1024, 1024](array)
Updated array: [0.5 0.5 0.5 ... 0.5 0.5 0.5]
The result is correct: True

Note that there are some limitations on the number of threads depending on the hardware. For recent GPUs, the maximum number of threads by block is usually equal to 1024.

This is the end of the first part of this introduction, but you can learn more about CUDA programming in Python by reading Part 2!