Part+III+-+GPGPU+Improvements

⇤ Intro | ← Basics | ← Application | You Are Here | → Reflections | ⇥ Resources

=__Part III - Potential Improvements:__= In the following sections, I discuss potential improvements on the code and ideas presented here. In particular, I discuss how genetic algorithms running on GPU could potentially be optimized, followed by a discussion on using genetic algorithms in combination with other evolutionary algorithms in order to take advantage of both GPU and CPU computing capabilities.

Theoretical Optimization of Code:
Ideally, as many threads as possible should be running at all times, until a solution is reached. That entails optimizing the code such that the work is divided among grids and blocks appropriately. At this point in time, I would have to determine how many blocks is optimal, along with how many threads per block. As I apply GPGPU to genetic algorithms, the whole population structure will be chosen carefully so that the search model takes advantage of as many threads as possible.

Parallel Processes and Genetic Algorithms:
In __Parallel Metaheuristics: A New Class of Algorithms__, Enrique Alba discusses in great length the types and pros/cons of parallel implementations of various metaheuristics. Of course, it is important to include his work here because he discusses possible implementations of evolutionary algorithms and, in particular, genetic algorithms. While parallel computing seems to most often refer to multi-core processing with CPU, the same principles can be applied to parallel-running threads of a GPU grid.

Alba says that in metaheuristics, there must be a balance between "diversification" of a neighborhood - i.e., exploring a neighborhood - versus "intensification" - i.e., exploiting accumulated search experience to focus in on a particular part of the neighborhood. (Alba, p. 7) Thus, he argues that it is important to quickly identify regions in the search space that have high quality solutions (exploration) and not waste time in other regions. He states, "a successful hybridization is obtained by the integration of single-point search algorithms [aka trajectory methods] in population-based ones." (Alba, p. 8) Termination criteria of such a hybridization may include "maximum CPU time, a maximum number of iterations, a solution s of sufficient quality, or reaching the maximum number of iterations without improvement." While I expected to terminate at a maximum number of iterations or after finding a sufficient solution, I had never before thought to include the other termination conditions, even though they make sense if the search algorithm is simply taking too long - perhaps becuase it is stuck in a solution space that is a local maximum of fitness, but not the global maximum.

Thus, not only is there great potential when using GPGPU for genetic atilgorithms, but also potential for using hybridized evolutionary algorithms. Splitting the computationally-intensive work among GPU threads might have a sizable increase in computational power when compared to using simply multi-core CPU.

Better Hardware? / Higher Compute Capability:
One additional option is to use an even faster GPU, such as NVIDIA's Tesla. With the GPU I use, I am limited by what NVIDIA calls the "compute capability," which is directly tied to which GPU one is using. Currently, the "version" of capability that I can use is 1.2, whereas the most advanced NVIDIA hardware runs on compute capability 2.0. One major advance in 2.0 is the complementary double-precision floating point performance, as opposed to the sole single-precision floating point performance offered by GPU of compute capability 1.x. In addition, some devices of compute capability 2.0 can execute up to four kernels concurrently, whereas compute capability 1.x can only run one kernel.

Interestingly, I don't have to wait to program in compute capability 2.0. If I want to hit the ground running as soon as I get access to compute capability 2.0 hardware, then I need to write programs now that are compiled into what is called PTX, or essentially lower-level assembly language that can be compiled "just-in-time" for any device. As the programming guide states, "Any application that wants to run on future device architectures must load PTX, not binary code. This is because binary code is architecture-specific and therefore incompatible with future architectures, whereas PTX code is compiled to binary code at load time by the driver." Whereas the CUDA runtime API addresses language needed to compile and output a binary file (basically the finished product), the CUDA driver API addresses the need to allow applications to benefit from the latest compiler improvements by only writing it to assembly language and not compiling until the device needs to do so. I see this as a major mechanism to preventing programmers from having to rewrite their application for each newer device - and therefore, for each major revision of compute capability. Forget backwards-compatibility. We have forward-compatibility.

PREVIOUS | NEXT

⇤ Intro | ← Basics | ← Application | You Are Here | → Reflections | ⇥ Resources