In Part 4 of this introduction, we saw that the performance of our convolution kernel is limited by memory bandwidth. We are going to see how to improve performance by using shared memory.
In this part, we will learn how to profile a CUDA kernel using both nvprof and nvvp, the Visual Profiler. We will use the convolution kernel from Part 3, and discover thanks to profiling how to improve it.
This is the third part of an introduction to CUDA in Python. If you missed the beginning, you are welcome to go back to Part 1 or Part 2. In this third part, we are going to write a convolution kernel to filter an image.
In the first part of this introduction, we saw how to launch a CUDA kernel in Python using the Open Source just-in-time compiler Numba. In this part, we will learn more about CUDA kernels.
Coding directly in Python functions that will be executed on GPU may allow to remove bottlenecks while keeping the code short and simple. In this introduction, we show one way to use CUDA in Python, and explain some basic principles of CUDA programming.