An introduction to CUDA in Python (Part 2)

In the first part of this introduction, we saw how to launch a CUDA kernel in Python using the Open Source just-in-time compiler Numba. In this part, we will learn more about CUDA kernels.

Thread Indexing

Let’s get back to our latest kernel:

from numba import cuda
import numpy as np

@cuda.jit
def cudakernel1(array):
    thread_position = cuda.grid(1)
    array[thread_position] += 0.5

We got the thread position using cuda.grid(1). cuda.grid() is a convenience function provided by Numba.

In CUDA, blocks and grids are actually three dimensional. Each block has dimensions (cuda.blockDim.x, cuda.blockDim.y, cuda.blockDim.z) and the grid has dimensions (cuda.gridDim.x, cuda.gridDim.y, cuda.gridDim.z).

This means that each block has: \[number\_of\_threads\_per\_block = cuda.blockDim.x \times cuda.blockDim.y \times cuda.blockDim.z\]

And the grid has: \[number\_of\_blocks = cuda.gridDim.x \times cuda.gridDim.y \times cuda.gridDim.z.\]

The generic way to define the number of threads while launching a kernel is actually:

kernel_name[(griddimx, griddimy, griddimz), (blockdimx, blockdimy, blockdimz)](arguments)

Launching a kernel specifying only two integers like we did in Part 1, e.g. in cudakernel1[1024, 1024](array), is equivalent to launching a kernel with y and z dimensions equal to 1, e.g. cudakernel1[(1024, 1, 1), (1024, 1, 1)](array).

CUDA provides the following values to identify each thread:

cuda.threadIdx.x, cuda.threadIdx.y, cuda.threadIdx.z that give the (x, y, z) positions of the current thread inside the current block,
cuda.blockIdx.x, cuda.blockIdx.y, cuda.blockIdx.z that give the (x, y, z) positions of the current block inside the grid.

Therefore, the absolute position of a thread inside the grid is given by:

(cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x, 
 cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y,
 cuda.blockIdx.z * cuda.blockDim.z + cuda.threadIdx.z)

The convenience function cuda.grid(3) returns this whole tuple, whereas cuda.grid(2) returns the tuple (cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x, cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y) and cuda.grid(1) returns the integer cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x. The argument is the dimension in which we are working.

Therefore our previous kernel is equivalent to:

@cuda.jit
def cudakernel1b(array):
    thread_position = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    array[thread_position] += 0.5

Timing comparison on polynomial evaluation

Let’s now write a kernel that requires a bit more computations. We are going to evaluate a polynomial function on each cell of an array.

On CPU with Numpy only

In Numpy, a polynomial can be evaluated over an array thanks to the function polyval. Be careful that there are two functions polyval in Numpy, one is numpy.polyval and the other one numpy.polynomial.polynomial.polyval, and they are not the same! In particular they don’t take the same arguments!

a = np.random.randn(2048 * 1024).astype(np.float32)  # our array
p = np.float32(range(1, 10))   # the coefficients of a polynomial in descending order
c = p[::-1] # the coefficients of the same polynomial in ascending order

# Let's evaluate the polynomial X**8 + 2 X**7 + 3 X**6 + ... + 8 X + 9 on each cell of the array:
%timeit np.polyval(p, a)  # coefficients first, in descending order, array next
%timeit np.polynomial.polynomial.polyval(a, c) # array first, coefficients next, in ascending order

# Let's check if the two functions provide the same result:
print('Maximum absolute difference:', np.max(np.abs(np.polyval(p, a) - np.polynomial.polynomial.polyval(a, c))))

# Finally, let's check if we can improve the speed of the second function by using its argument 'tensor'.
# Indeed we don't need to use tensors, so let's set it to False.
%timeit  np.polynomial.polynomial.polyval(a, c, tensor= False)

27.1 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.5 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Maximum absolute difference: 0.0
31.4 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

When providing arrays of float32, the timings are equivalent, around 30ms per loop.

On GPU with CUDA

Below is a simple kernel to evaluate a polynomial over an array:

@cuda.jit
def cuda_polyval(result, array, coeffs):
    # Evaluate a polynomial function over an array with Horner's method.
    # The coefficients are given in descending order.
    i = cuda.grid(1) # equivalent to i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    val = coeffs[0]
    for coeff in coeffs[1:]:
        val = val * array[i] + coeff
    result[i] = val

Let’s test our function:

array = np.random.randn(2048 * 1024).astype(np.float32)
coeffs = np.float32(range(1, 10))
result = np.empty_like(array)
cuda_polyval[2048, 1024](result, array, coeffs)
numpy_result = np.polyval(coeffs, array)
print('Maximum relative error compared to numpy.polyval:', np.max(np.abs(numpy_result - result) / np.abs(numpy_result)))

Maximum relative error compared to numpy.polyval: 5.870621e-07

Ok, it works! Let’s time it:

%timeit cuda_polyval[2048, 1024](result, array, coeffs)

7.38 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is approximately \(4\times\) faster than the Numpy version. That’s not really impressive since the Numpy version runs on only one CPU core. However, in this time is included the transfer of the data from the host memory to the device memory before the computation and the transfer from the device memory back to the host memory at the end of the computation.

Let’s now time the computation on the GPU only. We first transfer all the data to the device using the function cuda.to_device and we next execute the kernel with the device data as arguments:

d_array = cuda.to_device(array)
d_coeffs = cuda.to_device(coeffs)
d_result = cuda.to_device(result)

%timeit cuda_polyval[2048, 1024](d_result, d_array, d_coeffs)

359 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The computation is very fast: when we previously executed the kernel on the host arrays with cuda_polyval[2048, 1024](result, array, coeffs), only 5% of the time was spent on kernel computation, the 95% remaining were data transfer between host and device and their API calls!

Let’s confirm this result by timing together the transfer of data to the device and the computation, but without including the transfer back of the data to the host. Since the computational time on the device is less than 0.5ms and the total time when executing on the host arrays is around 7.5ms, we expect the timing to be around 4ms.

%timeit \
d_array = cuda.to_device(array);\
d_coeffs = cuda.to_device(coeffs);\
d_result = cuda.to_device(result);\
cuda_polyval[2048, 1024](d_result, d_array, d_coeffs)

4.25 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ok, everything looks fine!

On CPU with Numba

One difference between evaluating the kernel execution time on the device and on the host using Numpy is that Numpy does not preallocate memory for the result. To have a better comparison, let’s write an host function that uses a preallocated array. We are going to use numba.jit:

from numba import jit

@jit
def host_polyval(result, array, coeffs):
    for i in range(len(array)):
        val = coeffs[0]
        for coeff in coeffs[1:]:
            val = val * array[i] + coeff
        result[i] = val

Let’s check that the result is correct:

host_polyval(result, array, coeffs)
print('Maximum absolute difference with numpy.polyval:', np.max(np.abs(np.polyval(coeffs, array) - result)))

Maximum absolute difference with numpy.polyval: 0.0

Let’s now time it:

%timeit host_polyval(result, array, coeffs)

36.3 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

We saw that on the host our computation takes around 30ms, whereas on the device it takes around 0.4ms. That’s a huge difference!

However, we also saw that sending the data from the host to the device and from the device to the host is very slow. When we take these transfers into consideration, executing our computation on the device is only four times faster than executing on the host.

We should avoid as much as possible data transfers between host and device since they tend to cancel the advantages of using a CUDA device.

Learn more about CUDA in Python in Part 3 of this introduction!