Hands-On GPU Programming with Python and CUDA
eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Condividi libro
  1. 310 pagine
  2. English
  3. ePUB (disponibile sull'app)
  4. Disponibile su iOS e Android
eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Dettagli del libro
Anteprima del libro
Indice dei contenuti
Citazioni

Informazioni sul libro

Build real-world applications with Python 2.7, CUDA 9, and CUDA 10. We suggest the use of Python 2.7 over Python 3.x, since Python 2.7 has stable support across all the libraries we use in this book.

Key Features

  • Expand your background in GPU programming—PyCUDA, scikit-cuda, and Nsight
  • Effectively use CUDA libraries such as cuBLAS, cuFFT, and cuSolver
  • Apply GPU programming to modern data science applications

Book Description

Hands-On GPU Programming with Python and CUDA hits the ground running: you'll start by learning how to apply Amdahl's Law, use a code profiler to identify bottlenecks in your Python code, and set up an appropriate GPU programming environment. You'll then see how to "query" the GPU's features and copy arrays of data to and from the GPU's own memory.

As you make your way through the book, you'll launch code directly onto the GPU and write full blown GPU kernels and device functions in CUDA C. You'll get to grips with profiling GPU code effectively and fully test and debug your code using Nsight IDE. Next, you'll explore some of the more well-known NVIDIA libraries, such as cuFFT and cuBLAS.

With a solid background in place, you will now apply your new-found knowledge to develop your very own GPU-based deep neural network from scratch. You'll then explore advanced topics, such as warp shuffling, dynamic parallelism, and PTX assembly. In the final chapter, you'll see some topics and applications related to GPU programming that you may wish to pursue, including AI, graphics, and blockchain.

By the end of this book, you will be able to apply GPU programming to problems related to data science and high-performance computing.

What you will learn

  • Launch GPU code directly from Python
  • Write effective and efficient GPU kernels and device functions
  • Use libraries such as cuFFT, cuBLAS, and cuSolver
  • Debug and profile your code with Nsight and Visual Profiler
  • Apply GPU programming to datascience problems
  • Build a GPU-based deep neuralnetwork from scratch
  • Explore advanced GPU hardware features, such as warp shuffling

Who this book is for

Hands-On GPU Programming with Python and CUDA is for developers and data scientists who want to learn the basics of effective GPU programming to improve performance using Python code. You should have an understanding of first-year college or university-level engineering mathematics and physics, and have some experience with Python as well as in any C-based programming language such as C, C++, Go, or Java.

Domande frequenti

Come faccio ad annullare l'abbonamento?
È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui
È possibile scaricare libri? Se sì, come?
Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui
Che differenza c'è tra i piani?
Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.
Cos'è Perlego?
Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.
Perlego supporta la sintesi vocale?
Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.
Hands-On GPU Programming with Python and CUDA è disponibile online in formato PDF/ePub?
Sì, puoi accedere a Hands-On GPU Programming with Python and CUDA di Dr. Brian Tuomanen in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Computer Science e Programming in Python. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Anno
2018
ISBN
9781788995221
Edizione
1

Kernels, Threads, Blocks, and Grids

In this chapter, we'll see how to write effective CUDA kernels. In GPU programming, a kernel (which we interchangeably use with terms such as CUDA kernel or kernel function) is a parallel function that can be launched directly from the host (the CPU) onto the device (the GPU), while a device function is a function that can only be called from a kernel function or another device function. (Generally speaking, device functions look and act like normal serial C/C++ functions, only they are running on the GPU and are called in parallel from kernels.)
We'll then get an understanding of how CUDA uses the notion of threads, blocks, and grids to abstract away some of the underlying technical details of the GPU (such as cores, warps, and streaming multiprocessors, which we'll cover later in this book), and how we can use these notions to ease the cognitive overhead in parallel programming. We'll learn about thread synchronization (both block-level and grid-level), and intra-thread communication in CUDA using both global and shared memory. Finally, we'll delve into the technical details of how to implement our own parallel prefix type algorithms on the GPU (that is, the scan/reduce type functions we covered in the last chapter), which allow us to put all of the principles we'll learn in this chapter into practice.
The learning outcomes for this chapter are as follows:
  • Understanding the difference between a kernel and a device function
  • How to compile and launch a kernel in PyCUDA and use a device function within a kernel
  • Effectively using threads, blocks, and grids in the context of launching a kernel and how to use threadIdx and blockIdx within a kernel
  • How and why to synchronize threads within a kernel, using both __syncthreads() for synchronizing all threads among a single block and the host to synchronize all threads among an entire grid of blocks
  • How to use device global and shared memory for intra-thread communication
  • How to use all of our newly acquired knowledge about kernels to properly implement a GPU version of the parallel prefix sum

Technical requirements

A Linux or Windows 10 PC with a modern NVIDIA GPU (2016 onward) is required for this chapter, with all necessary GPU drivers and the CUDA Toolkit (9.0 onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) with the PyCUDA module is also required.
This chapter's code is also available on GitHub at:
https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA
For more information about the prerequisites, check the Preface of this book; for the software and hardware requirements, check the README section in https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA.

Kernels

As in the last chapter, we'll be learning how to write CUDA kernel functions as inline CUDA C in our Python code and launch them onto our GPU using PyCUDA. In the last chapter, we used templates provided by PyCUDA to write kernels that fall into particular design patterns; in contrast, we'll now see how to write our own kernels from the ground up, so that we can write a versatile variety of kernels that may not fall into any particular design pattern covered by PyCUDA, and so that we may get a more fine-tuned control over our kernels. Of course, these gains will come at the expense of greater complexity in programming; we'll especially have to get an understanding of threads, blocks, and grids and their role in kernels, as well as how to synchronize the threads in which our kernel is executing, as well as understand how to exchange data among threads.
Let's start simple and try to re-create some of the element-wise operations we saw in the last chapter, but this time without using the ElementwiseKernel function; we'll now be using the SourceModule function. This is a very powerful function in PyCUDA that allows us to build a kernel from scratch, so as usual it's best to start simple.

The PyCUDA SourceModule function

We'll use the SourceModule function from PyCUDA to compile raw inline CUDA C code into usable kernels that we can launch from Python. We should note that SourceModule actually compiles code into a CUDA module, this is like a Python module or Windows DLL, only it contains a collection of compiled CUDA code. This means we'll have to "pull out" a reference to the kernel we want to use with PyCUDA's get_function, before we can actually launch it. Let's start with a basic example of how to use a CUDA kernel with SourceModule.
As before, we'll start with making one of the most simple kernel functions possible—one that multiplies a vector by a scalar. We'll start with the imports:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule
Now we can immediately dive into writing our kernel:
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
So, let's stop and contrast this with how it was done in ElementwiseKernel. First, when we declare a kernel function in CUDA C proper, we precede it with the __global__ keyword. This will distinguish the function as a kernel to the compiler. We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter. We can declare the parameters as we would with any standard C function: first we have outvec, which will be our output scaled vector, which is of course a floating-point array pointer. Next, we have scalar, which is represented with a mere float; notice that this is not a pointer! If we wish to pass simple singleton input values to our kernel, we can always do so without using pointers. Finally, we have our input vector, vec, which is of course another floating-point array pointer.
Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.
Let's peer into the kernel before we continue with testing it. We recall that ElementwiseKernel automatically parallelized over multiple GPU threads by a value, i, which was set for us by PyCUDA; the identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.
threadIdx is used to tell each individual thread its identity. This is usually used to determine an index for what values should be processed on the input and output data arrays. (This can also be used for assigning particular threads different tasks than others with standard C control flow statements such as if or switch.)
Now, we are ready to perform our scalar multiplication in parallel as before: outvec[i] = scalar*vec[i];.
Now, let's test this code: we first must pull out a reference to our compiled kernel function from the CUDA module we just compiled with SourceModule. We can get this kernel reference with Python's get_function as follows:
scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")
Now, we have to put some data on the GPU to actually test our kernel. Let's set up a floating-point array of 512 random values, and then copy these into an array in the GPU's global memory using the gpuarray.to_gpu function. (We're going to multiply this random vector by a scalar both on th...

Indice dei contenuti