eBook - ePub

Hands-On GPU Programming with Python and CUDA

Name: Hands-On GPU Programming with Python and CUDA
Author: Dr. Brian Tuomanen

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Share book

310 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hands-On GPU Programming with Python and CUDA

Explore high-performance parallel computing with CUDA

Dr. Brian Tuomanen

Book details

Book preview

Table of contents

Citations

About This Book

Build real-world applications with Python 2.7, CUDA 9, and CUDA 10. We suggest the use of Python 2.7 over Python 3.x, since Python 2.7 has stable support across all the libraries we use in this book.

Key Features

Expand your background in GPU programming—PyCUDA, scikit-cuda, and Nsight
Effectively use CUDA libraries such as cuBLAS, cuFFT, and cuSolver
Apply GPU programming to modern data science applications

Book Description

Hands-On GPU Programming with Python and CUDA hits the ground running: you'll start by learning how to apply Amdahl's Law, use a code profiler to identify bottlenecks in your Python code, and set up an appropriate GPU programming environment. You'll then see how to "query" the GPU's features and copy arrays of data to and from the GPU's own memory.

As you make your way through the book, you'll launch code directly onto the GPU and write full blown GPU kernels and device functions in CUDA C. You'll get to grips with profiling GPU code effectively and fully test and debug your code using Nsight IDE. Next, you'll explore some of the more well-known NVIDIA libraries, such as cuFFT and cuBLAS.

With a solid background in place, you will now apply your new-found knowledge to develop your very own GPU-based deep neural network from scratch. You'll then explore advanced topics, such as warp shuffling, dynamic parallelism, and PTX assembly. In the final chapter, you'll see some topics and applications related to GPU programming that you may wish to pursue, including AI, graphics, and blockchain.

By the end of this book, you will be able to apply GPU programming to problems related to data science and high-performance computing.

What you will learn

Launch GPU code directly from Python
Write effective and efficient GPU kernels and device functions
Use libraries such as cuFFT, cuBLAS, and cuSolver
Debug and profile your code with Nsight and Visual Profiler
Apply GPU programming to datascience problems
Build a GPU-based deep neuralnetwork from scratch
Explore advanced GPU hardware features, such as warp shuffling

Who this book is for

Hands-On GPU Programming with Python and CUDA is for developers and data scientists who want to learn the basics of effective GPU programming to improve performance using Python code. You should have an understanding of first-year college or university-level engineering mathematics and physics, and have some experience with Python as well as in any C-based programming language such as C, C++, Go, or Java.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Hands-On GPU Programming with Python and CUDA an online PDF/ePUB?

Yes, you can access Hands-On GPU Programming with Python and CUDA by Dr. Brian Tuomanen in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming in Python. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781788995221

Edition

Topic

Computer Science

Subtopic

Programming in Python

Index

Computer Science

Kernels, Threads, Blocks, and Grids

In this chapter, we'll see how to write effective CUDA kernels. In GPU programming, a kernel (which we interchangeably use with terms such as CUDA kernel or kernel function) is a parallel function that can be launched directly from the host (the CPU) onto the device (the GPU), while a device function is a function that can only be called from a kernel function or another device function. (Generally speaking, device functions look and act like normal serial C/C++ functions, only they are running on the GPU and are called in parallel from kernels.)

We'll then get an understanding of how CUDA uses the notion of threads, blocks, and grids to abstract away some of the underlying technical details of the GPU (such as cores, warps, and streaming multiprocessors, which we'll cover later in this book), and how we can use these notions to ease the cognitive overhead in parallel programming. We'll learn about thread synchronization (both block-level and grid-level), and intra-thread communication in CUDA using both global and shared memory. Finally, we'll delve into the technical details of how to implement our own parallel prefix type algorithms on the GPU (that is, the scan/reduce type functions we covered in the last chapter), which allow us to put all of the principles we'll learn in this chapter into practice.

The learning outcomes for this chapter are as follows:

Understanding the difference between a kernel and a device function
How to compile and launch a kernel in PyCUDA and use a device function within a kernel
Effectively using threads, blocks, and grids in the context of launching a kernel and how to use threadIdx and blockIdx within a kernel
How and why to synchronize threads within a kernel, using both __syncthreads() for synchronizing all threads among a single block and the host to synchronize all threads among an entire grid of blocks

How to use device global and shared memory for intra-thread communication
How to use all of our newly acquired knowledge about kernels to properly implement a GPU version of the parallel prefix sum

Technical requirements

A Linux or Windows 10 PC with a modern NVIDIA GPU (2016 onward) is required for this chapter, with all necessary GPU drivers and the CUDA Toolkit (9.0 onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) with the PyCUDA module is also required.

This chapter's code is also available on GitHub at:

https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA

For more information about the prerequisites, check the Preface of this book; for the software and hardware requirements, check the README section in https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA.

Kernels

As in the last chapter, we'll be learning how to write CUDA kernel functions as inline CUDA C in our Python code and launch them onto our GPU using PyCUDA. In the last chapter, we used templates provided by PyCUDA to write kernels that fall into particular design patterns; in contrast, we'll now see how to write our own kernels from the ground up, so that we can write a versatile variety of kernels that may not fall into any particular design pattern covered by PyCUDA, and so that we may get a more fine-tuned control over our kernels. Of course, these gains will come at the expense of greater complexity in programming; we'll especially have to get an understanding of threads, blocks, and grids and their role in kernels, as well as how to synchronize the threads in which our kernel is executing, as well as understand how to exchange data among threads.

Let's start simple and try to re-create some of the element-wise operations we saw in the last chapter, but this time without using the ElementwiseKernel function; we'll now be using the SourceModule function. This is a very powerful function in PyCUDA that allows us to build a kernel from scratch, so as usual it's best to start simple.

The PyCUDA SourceModule function

We'll use the SourceModule function from PyCUDA to compile raw inline CUDA C code into usable kernels that we can launch from Python. We should note that SourceModule actually compiles code into a CUDA module, this is like a Python module or Windows DLL, only it contains a collection of compiled CUDA code. This means we'll have to "pull out" a reference to the kernel we want to use with PyCUDA's get_function, before we can actually launch it. Let's start with a basic example of how to use a CUDA kernel with SourceModule.

As before, we'll start with making one of the most simple kernel functions possible—one that multiplies a vector by a scalar. We'll start with the imports:

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule

Now we can immediately dive into writing our kernel:

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
 int i = threadIdx.x;
 outvec[i] = scalar*vec[i];
}
""")

So, let's stop and contrast this with how it was done in ElementwiseKernel. First, when we declare a kernel function in CUDA C proper, we precede it with the __global__ keyword. This will distinguish the function as a kernel to the compiler. We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter. We can declare the parameters as we would with any standard C function: first we have outvec, which will be our output scaled vector, which is of course a floating-point array pointer. Next, we have scalar, which is represented with a mere float; notice that this is not a pointer! If we wish to pass simple singleton input values to our kernel, we can always do so without using pointers. Finally, we have our input vector, vec, which is of course another floating-point array pointer.

Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory.

Let's peer into the kernel before we continue with testing it. We recall that ElementwiseKernel automatically parallelized over multiple GPU threads by a value, i, which was set for us by PyCUDA; the identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.

threadIdx is used to tell each individual thread its identity. This is usually used to determine an index for what values should be processed on the input and output data arrays. (This can also be used for assigning particular threads different tasks than others with standard C control flow statements such as if or switch.)

Now, we are ready to perform our scalar multiplication in parallel as before: outvec[i] = scalar*vec[i];.

Now, let's test this code: we first must pull out a reference to our compiled kernel function from the CUDA module we just compiled with SourceModule. We can get this kernel reference with Python's get_function as follows:

scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")

Now, we have to put some data on the GPU to actually test our kernel. Let's set up a floating-point array of 512 random values, and then copy these into an array in the GPU's global memory using the gpuarray.to_gpu function. (We're going to multiply this random vector by a scalar both on th...