eBook - ePub

Hands-On GPU Programming with Python and CUDA

Name: Hands-On GPU Programming with Python and CUDA
Author: Dr. Brian Tuomanen

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Partager le livre

310 pages
English
ePUB (adapté aux mobiles)
Disponible sur iOS et Android

eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Détails du livre

Aperçu du livre

Table des matières

Citations

À propos de ce livre

Build real-world applications with Python 2.7, CUDA 9, and CUDA 10. We suggest the use of Python 2.7 over Python 3.x, since Python 2.7 has stable support across all the libraries we use in this book.

Key Features

Expand your background in GPU programming—PyCUDA, scikit-cuda, and Nsight
Effectively use CUDA libraries such as cuBLAS, cuFFT, and cuSolver
Apply GPU programming to modern data science applications

Book Description

Hands-On GPU Programming with Python and CUDA hits the ground running: you'll start by learning how to apply Amdahl's Law, use a code profiler to identify bottlenecks in your Python code, and set up an appropriate GPU programming environment. You'll then see how to "query" the GPU's features and copy arrays of data to and from the GPU's own memory.

As you make your way through the book, you'll launch code directly onto the GPU and write full blown GPU kernels and device functions in CUDA C. You'll get to grips with profiling GPU code effectively and fully test and debug your code using Nsight IDE. Next, you'll explore some of the more well-known NVIDIA libraries, such as cuFFT and cuBLAS.

With a solid background in place, you will now apply your new-found knowledge to develop your very own GPU-based deep neural network from scratch. You'll then explore advanced topics, such as warp shuffling, dynamic parallelism, and PTX assembly. In the final chapter, you'll see some topics and applications related to GPU programming that you may wish to pursue, including AI, graphics, and blockchain.

By the end of this book, you will be able to apply GPU programming to problems related to data science and high-performance computing.

What you will learn

Launch GPU code directly from Python
Write effective and efficient GPU kernels and device functions
Use libraries such as cuFFT, cuBLAS, and cuSolver
Debug and profile your code with Nsight and Visual Profiler
Apply GPU programming to datascience problems
Build a GPU-based deep neuralnetwork from scratch
Explore advanced GPU hardware features, such as warp shuffling

Who this book is for

Hands-On GPU Programming with Python and CUDA is for developers and data scientists who want to learn the basics of effective GPU programming to improve performance using Python code. You should have an understanding of first-year college or university-level engineering mathematics and physics, and have some experience with Python as well as in any C-based programming language such as C, C++, Go, or Java.

Foire aux questions

Comment puis-je résilier mon abonnement ?

Il vous suffit de vous rendre dans la section compte dans paramètres et de cliquer sur « Résilier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez résilié votre abonnement, il restera actif pour le reste de la période pour laquelle vous avez payé. Découvrez-en plus ici.

Puis-je / comment puis-je télécharger des livres ?

Pour le moment, tous nos livres en format ePub adaptés aux mobiles peuvent être téléchargés via l’application. La plupart de nos PDF sont également disponibles en téléchargement et les autres seront téléchargeables très prochainement. Découvrez-en plus ici.

Quelle est la différence entre les formules tarifaires ?

Les deux abonnements vous donnent un accès complet à la bibliothèque et à toutes les fonctionnalités de Perlego. Les seules différences sont les tarifs ainsi que la période d’abonnement : avec l’abonnement annuel, vous économiserez environ 30 % par rapport à 12 mois d’abonnement mensuel.

Qu’est-ce que Perlego ?

Nous sommes un service d’abonnement à des ouvrages universitaires en ligne, où vous pouvez accéder à toute une bibliothèque pour un prix inférieur à celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! Découvrez-en plus ici.

Prenez-vous en charge la synthèse vocale ?

Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte à haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accélérer ou le ralentir. Découvrez-en plus ici.

Est-ce que Hands-On GPU Programming with Python and CUDA est un PDF/ePUB en ligne ?

Oui, vous pouvez accéder à Hands-On GPU Programming with Python and CUDA par Dr. Brian Tuomanen en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Programming in Python. Nous disposons de plus d’un million d’ouvrages à découvrir dans notre catalogue.

Informations

Éditeur

Packt Publishing

Année

2018

ISBN

9781788995221

Édition

Sujet

Computer Science

Sous-sujet

Programming in Python

Kernels, Threads, Blocks, and Grids

In this chapter, we'll see how to write effective CUDA kernels. In GPU programming, a kernel (which we interchangeably use with terms such as CUDA kernel or kernel function) is a parallel function that can be launched directly from the host (the CPU) onto the device (the GPU), while a device function is a function that can only be called from a kernel function or another device function. (Generally speaking, device functions look and act like normal serial C/C++ functions, only they are running on the GPU and are called in parallel from kernels.)

We'll then get an understanding of how CUDA uses the notion of threads, blocks, and grids to abstract away some of the underlying technical details of the GPU (such as cores, warps, and streaming multiprocessors, which we'll cover later in this book), and how we can use these notions to ease the cognitive overhead in parallel programming. We'll learn about thread synchronization (both block-level and grid-level), and intra-thread communication in CUDA using both global and shared memory. Finally, we'll delve into the technical details of how to implement our own parallel prefix type algorithms on the GPU (that is, the scan/reduce type functions we covered in the last chapter), which allow us to put all of the principles we'll learn in this chapter into practice.

The learning outcomes for this chapter are as follows:

Understanding the difference between a kernel and a device function
How to compile and launch a kernel in PyCUDA and use a device function within a kernel
Effectively using threads, blocks, and grids in the context of launching a kernel and how to use threadIdx and blockIdx within a kernel
How and why to synchronize threads within a kernel, using both __syncthreads() for synchronizing all threads among a single block and the host to synchronize all threads among an entire grid of blocks

How to use device global and shared memory for intra-thread communication
How to use all of our newly acquired knowledge about kernels to properly implement a GPU version of the parallel prefix sum

Technical requirements

A Linux or Windows 10 PC with a modern NVIDIA GPU (2016 onward) is required for this chapter, with all necessary GPU drivers and the CUDA Toolkit (9.0 onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) with the PyCUDA module is also required.

This chapter's code is also available on GitHub at:

https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA

For more information about the prerequisites, check the Preface of this book; for the software and hardware requirements, check the README section in https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA.

Kernels

As in the last chapter, we'll be learning how to write CUDA kernel functions as inline CUDA C in our Python code and launch them onto our GPU using PyCUDA. In the last chapter, we used templates provided by PyCUDA to write kernels that fall into particular design patterns; in contrast, we'll now see how to write our own kernels from the ground up, so that we can write a versatile variety of kernels that may not fall into any particular design pattern covered by PyCUDA, and so that we may get a more fine-tuned control over our kernels. Of course, these gains will come at the expense of greater complexity in programming; we'll especially have to get an understanding of threads, blocks, and grids and their role in kernels, as well as how to synchronize the threads in which our kernel is executing, as well as understand how to exchange data among threads.

Let's start simple and try to re-create some of the element-wise operations we saw in the last chapter, but this time without using the ElementwiseKernel function; we'll now be using the SourceModule function. This is a very powerful function in PyCUDA that allows us to build a kernel from scratch, so as usual it's best to start simple.

The PyCUDA SourceModule function

We'll use the SourceModule function from PyCUDA to compile raw inline CUDA C code into usable kernels that we can launch from Python. We should note that SourceModule actually compiles code into a CUDA module, this is like a Python module or Windows DLL, only it contains a collection of compiled CUDA code. This means we'll have to "pull out" a reference to the kernel we want to use with PyCUDA's get_function, before we can actually launch it. Let's start with a basic example of how to use a CUDA kernel with SourceModule.

As before, we'll start with making one of the most simple kernel functions possible—one that multiplies a vector by a scalar. We'll start with the imports:

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule

Now we can immediately dive into writing our kernel:

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
 int i = threadIdx.x;
 outvec[i] = scalar*vec[i];
}
""")

So, let's stop and contrast this with how it was done in ElementwiseKernel. First, when we declare a kernel function in CUDA C proper, we precede it with the __global__ keyword. This will distinguish the function as a kernel to the compiler. We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter. We can declare the parameters as we would with any standard C function: first we have outvec, which will be our output scaled vector, which is of course a floating-point array pointer. Next, we have scalar, which is represented with a mere float; notice that this is not a pointer! If we wish to pass simple singleton input values to our kernel, we can always do so without using pointers. Finally, we have our input vector, vec, which is of course another floating-point array pointer.

Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.

Let's peer into the kernel before we continue with testing it. We recall that ElementwiseKernel automatically parallelized over multiple GPU threads by a value, i, which was set for us by PyCUDA; the identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.

threadIdx is used to tell each individual thread its identity. This is usually used to determine an index for what values should be processed on the input and output data arrays. (This can also be used for assigning particular threads different tasks than others with standard C control flow statements such as if or switch.)

Now, we are ready to perform our scalar multiplication in parallel as before: outvec[i] = scalar*vec[i];.

Now, let's test this code: we first must pull out a reference to our compiled kernel function from the CUDA module we just compiled with SourceModule. We can get this kernel reference with Python's get_function as follows:

scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")

Now, we have to put some data on the GPU to actually test our kernel. Let's set up a floating-point array of 512 random values, and then copy these into an array in the GPU's global memory using the gpuarray.to_gpu function. (We're going to multiply this random vector by a scalar both on th...