In the first part of this introduction, we saw how to launch a CUDA kernel in Python using the Open Source just-in-time compiler Numba. In this part, we will learn more about CUDA kernels.
Let’s get back to our latest kernel:
from numba import cuda import numpy as np @cuda.jit def cudakernel1(array): thread_position = cuda.grid(1) array[thread_position] += 0.5
We got the thread position using
cuda.grid() is a convenience function provided by Numba.
In CUDA, blocks and grids are actually three dimensional.
Each block has dimensions
(cuda.blockDim.x, cuda.blockDim.y, cuda.blockDim.z) and the grid has dimensions
(cuda.gridDim.x, cuda.gridDim.y, cuda.gridDim.z).
This means that each block has: \[number\_of\_threads\_per\_block = cuda.blockDim.x \times cuda.blockDim.y \times cuda.blockDim.z\]
And the grid has: \[number\_of\_blocks = cuda.gridDim.x \times cuda.gridDim.y \times cuda.gridDim.z.\]
The generic way to define the number of threads while launching a kernel is actually:
kernel_name[(griddimx, griddimy, griddimz), (blockdimx, blockdimy, blockdimz)](arguments)
Launching a kernel specifying only two integers like we did in Part 1, e.g. in
cudakernel1[1024, 1024](array), is equivalent to launching a kernel with y and z dimensions equal to 1, e.g.
cudakernel1[(1024, 1, 1), (1024, 1, 1)](array).
CUDA provides the following values to identify each thread:
cuda.threadIdx.x, cuda.threadIdx.y, cuda.threadIdx.zthat give the (x, y, z) positions of the current thread inside the current block,
cuda.blockIdx.x, cuda.blockIdx.y, cuda.blockIdx.zthat give the (x, y, z) positions of the current block inside the grid.
Therefore, the absolute position of a thread inside the grid is given by:
(cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x, cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y, cuda.blockIdx.z * cuda.blockDim.z + cuda.threadIdx.z)
The convenience function
cuda.grid(3) returns this whole tuple, whereas
cuda.grid(2) returns the tuple
(cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x, cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y) and
cuda.grid(1) returns the integer
cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x. The argument is the dimension in which we are working.
Therefore our previous kernel is equivalent to:
@cuda.jit def cudakernel1b(array): thread_position = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x array[thread_position] += 0.5
Timing comparison on polynomial evaluation
Let’s now write a kernel that requires a bit more computations. We are going to evaluate a polynomial function on each cell of an array.
On CPU with Numpy only
In Numpy, a polynomial can be evaluated over an array thanks to the function
polyval. Be careful that there are two functions
polyval in Numpy, one is
numpy.polyval and the other one
numpy.polynomial.polynomial.polyval, and they are not the same! In particular they don’t take the same arguments!
a = np.random.randn(2048 * 1024).astype(np.float32) # our array p = np.float32(range(1, 10)) # the coefficients of a polynomial in descending order c = p[::-1] # the coefficients of the same polynomial in ascending order # Let's evaluate the polynomial X**8 + 2 X**7 + 3 X**6 + ... + 8 X + 9 on each cell of the array: %timeit np.polyval(p, a) # coefficients first, in descending order, array next %timeit np.polynomial.polynomial.polyval(a, c) # array first, coefficients next, in ascending order # Let's check if the two functions provide the same result: print('Maximum absolute difference:', np.max(np.abs(np.polyval(p, a) - np.polynomial.polynomial.polyval(a, c)))) # Finally, let's check if we can improve the speed of the second function by using its argument 'tensor'. # Indeed we don't need to use tensors, so let's set it to False. %timeit np.polynomial.polynomial.polyval(a, c, tensor= False)
27.1 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 30.5 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Maximum absolute difference: 0.0 31.4 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
When providing arrays of float32, the timings are equivalent, around 30ms per loop.
On GPU with CUDA
Below is a simple kernel to evaluate a polynomial over an array:
@cuda.jit def cuda_polyval(result, array, coeffs): # Evaluate a polynomial function over an array with Horner's method. # The coefficients are given in descending order. i = cuda.grid(1) # equivalent to i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x val = coeffs for coeff in coeffs[1:]: val = val * array[i] + coeff result[i] = val
Let’s test our function:
array = np.random.randn(2048 * 1024).astype(np.float32) coeffs = np.float32(range(1, 10)) result = np.empty_like(array) cuda_polyval[2048, 1024](result, array, coeffs) numpy_result = np.polyval(coeffs, array) print('Maximum relative error compared to numpy.polyval:', np.max(np.abs(numpy_result - result) / np.abs(numpy_result)))
Maximum relative error compared to numpy.polyval: 5.870621e-07
Ok, it works! Let’s time it:
%timeit cuda_polyval[2048, 1024](result, array, coeffs)
7.38 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is approximately \(4\times\) faster than the Numpy version. That’s not really impressive since the Numpy version runs on only one CPU core. However, in this time is included the transfer of the data from the host memory to the device memory before the computation and the transfer from the device memory back to the host memory at the end of the computation.
Let’s now time the computation on the GPU only. We first transfer all the data to the device using the function
cuda.to_device and we next execute the kernel with the device data as arguments:
d_array = cuda.to_device(array) d_coeffs = cuda.to_device(coeffs) d_result = cuda.to_device(result)
%timeit cuda_polyval[2048, 1024](d_result, d_array, d_coeffs)
359 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The computation is very fast: when we previously executed the kernel on the host arrays with
cuda_polyval[2048, 1024](result, array, coeffs), only 5% of the time was spent on kernel computation, the 95% remaining were data transfer between host and device and their API calls!
Let’s confirm this result by timing together the transfer of data to the device and the computation, but without including the transfer back of the data to the host. Since the computational time on the device is less than 0.5ms and the total time when executing on the host arrays is around 7.5ms, we expect the timing to be around 4ms.
%timeit \ d_array = cuda.to_device(array);\ d_coeffs = cuda.to_device(coeffs);\ d_result = cuda.to_device(result);\ cuda_polyval[2048, 1024](d_result, d_array, d_coeffs)
4.25 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ok, everything looks fine!
On CPU with Numba
One difference between evaluating the kernel execution time on the device and on the host using Numpy is that Numpy does not preallocate memory for the result. To have a better comparison, let’s write an host function that uses a preallocated array. We are going to use
from numba import jit @jit def host_polyval(result, array, coeffs): for i in range(len(array)): val = coeffs for coeff in coeffs[1:]: val = val * array[i] + coeff result[i] = val
Let’s check that the result is correct:
host_polyval(result, array, coeffs) print('Maximum absolute difference with numpy.polyval:', np.max(np.abs(np.polyval(coeffs, array) - result)))
Maximum absolute difference with numpy.polyval: 0.0
Let’s now time it:
%timeit host_polyval(result, array, coeffs)
36.3 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
We saw that on the host our computation takes around 30ms, whereas on the device it takes around 0.4ms. That’s a huge difference!
However, we also saw that sending the data from the host to the device and from the device to the host is very slow. When we take these transfers into consideration, executing our computation on the device is only four times faster than executing on the host.
We should avoid as much as possible data transfers between host and device since they tend to cancel the advantages of using a CUDA device.
Learn more about CUDA in Python in Part 3 of this introduction!