Part+I+-+GPGPU+Basics

⇤ Intro | You Are Here | → Application | → Improvements | → Reflections | ⇥ Resources

=__Part I - Basic Information on How it Works:__=

What is CUDA?
CUDA, which stands for "Compute Unified Device Architecture," is "codesigned hardware and software to expose the computational horsepower of NVIDA GPUs for GPU computing." [Note: There are other GPUs that exist with other CUDA-like languages, but my use of NVIDIA GPU necessitates my use of CUDA.] The CUDA programming model is such that the "GPU is a compute device which:
 * serves as a coprocessor for the host CPU
 * has its own device memory on the card
 * executes many threads in parallel"

CUDA's Version of C:
In this GPGPU environment, there is interaction between the CPU (or host) and the GPU (the device). In this different environment, both host and device have memory as well as their own operations to perform. For parallel parts of an application, the GPU can execute these via kernels - only one kernel on the GPU can run at a time, but each kernel can execute on many threads - in fact, thousands of threads. Because of the parallel thread environment, the CUDA software development kit (SDK) contains a CUDA runtime application programming interface (API). The following is a relatively brief breakdown of the types of function calls that one can do, but by no means is the description exhaustive. There exist many more types of functions and even many variations of the ones below - a great resource is the CUDA Programming Guide, and another resource is the CUDA Reference Manual.

A GPU uses grids, which is formed from multiple blocks. Each block has a certain number of threads. For instance, in order to create a three-dimensional grid of a certain int number of blocks (say, a variable named numBlocks) and each block has int number of threads per block (variable named numThreadsPerBlock), one would call the following pair of commands: code format="c" dim3 dimGrid(numBlocks); dim3 dimBlock(numThreadsPerBlock); code

Objects blockIdx and blockDim refer to the block ID number and the number of threads per block (respectively). Blocks can be two-dimensional - that is, with x- and y- components which are denoted as blockIdx.x or blockIdx.y. Threads also are assigned thread ID numbers, threadIdx, which can be up to three-dimensional. Thus, one might use threadIdx.x, threadIdx.y, and threadIdx.z. When no component is explicitly stated, it is the equivalent of using simply the one-dimensional x-component.

The following snapshots are cropped slides taken from NVIDIA's CUDA Programming Guide, which graphically depict the relationship between the grids, blocks, and threads:

Some functions address device-to-host and host-to-device operations:
 * cudaMemcpy: copy memory from device to host

Some functions are essentially functional copies of their C counterpart, except that they act on the device, instead of the host:
 * cudaMalloc: similar to standard C malloc
 * cudaFree: similar to C's free

Specific functions address the fact that there are multiple blocks and multiple threads:
 * helloWorld<<>>(devicememory) launches the kernel [but with a differing kernel names)
 * cudaThreadSynchronize: blocks new kernel commands until the device has completed

It is worthwhile to note that CUDA C actually supports some asynchronous concurrent execution, by calling a device function and returning control to the host before the device has completed its action. While the asynchronous functions are mostly limited to special memory calls, they do also include the launches of kernels. Because of the asynchronous capabilities, one must program their application so that it checks for errors after some set of asynchronous functions. The CUDA programming guide states, the "only way to check for asynchronous errors just after some asynchronoush function call is therefore to synchronize just after the call by calling cudaThreadSynchronize (or by using any other synchronization mechanisms described in Section 3.2.6) and checking the error code returned by cudaThreadSynchronize." Thus, one might use the cudaThreadSynchronize followed by a method that displays the last error, if any returned by cudaThreadSynchronize. code format="c" // block until the device has completed cudaThreadSynchronize;

// check if kernel execution generated an error checkCUDAError("kernel execution"); code

The supporting function is defined as follows: code format="c" void checkCUDAError(const char *msg) {   cudaError_t err = cudaGetLastError; if( cudaSuccess != err) {       fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); exit(-1); } } code

This brings us into the subject of errors. For dealing with CUDA C errors, there exist the following elements:
 * cudaError_t is a CUDA error object
 * cudaGetLastError returns cudaError_t error object
 * cudaGetErrorString(cudaError_t error)

Interestingly, CUDA C maintains a runtime error variable that is reset by the method cudaGetLastError. Thus, if the application calls cudaGetLastError before an asynchronous function, then the variable is reset and a call to cudaGetLastError after the thread synchronization easily narrows down the section of code where the error occurred.

Professor Peter Kogge of University of Notre Dame noted that NVCC does not handle exceptions well. One way around this is to put the CUDA kernel and caller in a .cu file - that is, a CUDA code file, put the prototype of the caller in a normal C header file, and then use the caller in the rest of the project.

Additional libraries are also available for use with CUDA C. These include CUBLAS, which is the CUDA port of the BLAS (Basic Linear Algebra Subprograms) library, and CUFFT, which is the CUDA Fast Fourier Transform (FFT) library - of course, for using FFT's (up to 3D).

GPU Memory
The design and performance of a program depends upon using the different types of memory available in the most effective manner. Since there is shared, local, and global memory which all reside on the GPU, it is important to know which can be used when. The global memory is accessible, as one might imagine, on the global scale - any thread can access it. The memory that is used by threads within one block, is called shared memory and only accessible by threads within the block. This limitation can be beneficial for performance though, since the calls to global memory take longer. The trade-off's must be weighed during design such that one determines what information threads need to really access often versus what needs to be accessed by all threads, and design accordingly. Then, there is also local memory, which belongs to each thread. While I did not focus beyond those types of memory, there are also two additional read-only memory spaces accessible by all threads: constant and texture memories.

PREVIOUS | NEXT

⇤ Intro | You Are Here | → Application | → Improvements | → Reflections | ⇥ Resources