Hands-On GPU Programming with Python and CUDA
eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Partager le livre
  1. 310 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

À propos de ce livre

Build real-world applications with Python 2.7, CUDA 9, and CUDA 10. We suggest the use of Python 2.7 over Python 3.x, since Python 2.7 has stable support across all the libraries we use in this book.

Key Features

  • Expand your background in GPU programming—PyCUDA, scikit-cuda, and Nsight
  • Effectively use CUDA libraries such as cuBLAS, cuFFT, and cuSolver
  • Apply GPU programming to modern data science applications

Book Description

Hands-On GPU Programming with Python and CUDA hits the ground running: you'll start by learning how to apply Amdahl's Law, use a code profiler to identify bottlenecks in your Python code, and set up an appropriate GPU programming environment. You'll then see how to "query" the GPU's features and copy arrays of data to and from the GPU's own memory.

As you make your way through the book, you'll launch code directly onto the GPU and write full blown GPU kernels and device functions in CUDA C. You'll get to grips with profiling GPU code effectively and fully test and debug your code using Nsight IDE. Next, you'll explore some of the more well-known NVIDIA libraries, such as cuFFT and cuBLAS.

With a solid background in place, you will now apply your new-found knowledge to develop your very own GPU-based deep neural network from scratch. You'll then explore advanced topics, such as warp shuffling, dynamic parallelism, and PTX assembly. In the final chapter, you'll see some topics and applications related to GPU programming that you may wish to pursue, including AI, graphics, and blockchain.

By the end of this book, you will be able to apply GPU programming to problems related to data science and high-performance computing.

What you will learn

  • Launch GPU code directly from Python
  • Write effective and efficient GPU kernels and device functions
  • Use libraries such as cuFFT, cuBLAS, and cuSolver
  • Debug and profile your code with Nsight and Visual Profiler
  • Apply GPU programming to datascience problems
  • Build a GPU-based deep neuralnetwork from scratch
  • Explore advanced GPU hardware features, such as warp shuffling

Who this book is for

Hands-On GPU Programming with Python and CUDA is for developers and data scientists who want to learn the basics of effective GPU programming to improve performance using Python code. You should have an understanding of first-year college or university-level engineering mathematics and physics, and have some experience with Python as well as in any C-based programming language such as C, C++, Go, or Java.

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Hands-On GPU Programming with Python and CUDA est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Hands-On GPU Programming with Python and CUDA par Dr. Brian Tuomanen en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Programming in Python. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Année
2018
ISBN
9781788995221

Kernels, Threads, Blocks, and Grids

In this chapter, we'll see how to write effective CUDA kernels. In GPU programming, a kernel (which we interchangeably use with terms such as CUDA kernel or kernel function) is a parallel function that can be launched directly from the host (the CPU) onto the device (the GPU), while a device function is a function that can only be called from a kernel function or another device function. (Generally speaking, device functions look and act like normal serial C/C++ functions, only they are running on the GPU and are called in parallel from kernels.)
We'll then get an understanding of how CUDA uses the notion of threads, blocks, and grids to abstract away some of the underlying technical details of the GPU (such as cores, warps, and streaming multiprocessors, which we'll cover later in this book), and how we can use these notions to ease the cognitive overhead in parallel programming. We'll learn about thread synchronization (both block-level and grid-level), and intra-thread communication in CUDA using both global and shared memory. Finally, we'll delve into the technical details of how to implement our own parallel prefix type algorithms on the GPU (that is, the scan/reduce type functions we covered in the last chapter), which allow us to put all of the principles we'll learn in this chapter into practice.
The learning outcomes for this chapter are as follows:
  • Understanding the difference between a kernel and a device function
  • How to compile and launch a kernel in PyCUDA and use a device function within a kernel
  • Effectively using threads, blocks, and grids in the context of launching a kernel and how to use threadIdx and blockIdx within a kernel
  • How and why to synchronize threads within a kernel, using both __syncthreads() for synchronizing all threads among a single block and the host to synchronize all threads among an entire grid of blocks
  • How to use device global and shared memory for intra-thread communication
  • How to use all of our newly acquired knowledge about kernels to properly implement a GPU version of the parallel prefix sum

Technical requirements

A Linux or Windows 10 PC with a modern NVIDIA GPU (2016 onward) is required for this chapter, with all necessary GPU drivers and the CUDA Toolkit (9.0 onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) with the PyCUDA module is also required.
This chapter's code is also available on GitHub at:
https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA
For more information about the prerequisites, check the Preface of this book; for the software and hardware requirements, check the README section in https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA.

Kernels

As in the last chapter, we'll be learning how to write CUDA kernel functions as inline CUDA C in our Python code and launch them onto our GPU using PyCUDA. In the last chapter, we used templates provided by PyCUDA to write kernels that fall into particular design patterns; in contrast, we'll now see how to write our own kernels from the ground up, so that we can write a versatile variety of kernels that may not fall into any particular design pattern covered by PyCUDA, and so that we may get a more fine-tuned control over our kernels. Of course, these gains will come at the expense of greater complexity in programming; we'll especially have to get an understanding of threads, blocks, and grids and their role in kernels, as well as how to synchronize the threads in which our kernel is executing, as well as understand how to exchange data among threads.
Let's start simple and try to re-create some of the element-wise operations we saw in the last chapter, but this time without using the ElementwiseKernel function; we'll now be using the SourceModule function. This is a very powerful function in PyCUDA that allows us to build a kernel from scratch, so as usual it's best to start simple.

The PyCUDA SourceModule function

We'll use the SourceModule function from PyCUDA to compile raw inline CUDA C code into usable kernels that we can launch from Python. We should note that SourceModule actually compiles code into a CUDA module, this is like a Python module or Windows DLL, only it contains a collection of compiled CUDA code. This means we'll have to "pull out" a reference to the kernel we want to use with PyCUDA's get_function, before we can actually launch it. Let's start with a basic example of how to use a CUDA kernel with SourceModule.
As before, we'll start with making one of the most simple kernel functions possible—one that multiplies a vector by a scalar. We'll start with the imports:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule
Now we can immediately dive into writing our kernel:
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
So, let's stop and contrast this with how it was done in ElementwiseKernel. First, when we declare a kernel function in CUDA C proper, we precede it with the __global__ keyword. This will distinguish the function as a kernel to the compiler. We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter. We can declare the parameters as we would with any standard C function: first we have outvec, which will be our output scaled vector, which is of course a floating-point array pointer. Next, we have scalar, which is represented with a mere float; notice that this is not a pointer! If we wish to pass simple singleton input values to our kernel, we can always do so without using pointers. Finally, we have our input vector, vec, which is of course another floating-point array pointer.
Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.
Let's peer into the kernel before we continue with testing it. We recall that ElementwiseKernel automatically parallelized over multiple GPU threads by a value, i, which was set for us by PyCUDA; the identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.
threadIdx is used to tell each individual thread its identity. This is usually used to determine an index for what values should be processed on the input and output data arrays. (This can also be used for assigning particular threads different tasks than others with standard C control flow statements such as if or switch.)
Now, we are ready to perform our scalar multiplication in parallel as before: outvec[i] = scalar*vec[i];.
Now, let's test this code: we first must pull out a reference to our compiled kernel function from the CUDA module we just compiled with SourceModule. We can get this kernel reference with Python's get_function as follows:
scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")
Now, we have to put some data on the GPU to actually test our kernel. Let's set up a floating-point array of 512 random values, and then copy these into an array in the GPU's global memory using the gpuarray.to_gpu function. (We're going to multiply this random vector by a scalar both on th...

Table des matiĂšres