Page 106 - 49A Field Guide to Genetic Programming
P. 106

92                         10 Fast and Distributed Genetic Programming


            that could be compiled for the GPU on the host PC. The compiled pro-
            grams were transferred one at a time to a GPU for fitness evaluation. Both
            groups obtained impressive speedups by running many test cases in parallel.
               Langdon and Banzhaf (2008) and Langdon and Harrison (2008) created
            a SIMD interpreter (Juille and Pollack, 1996) using RapidMind’s GNU C++
            OpenGL framework to simultaneously run up to a quarter of a million GP
                                                  5
            trees on an NVIDIA GPU (see Figure 10.1). As discussed in Section 7.1.2,
            GP trees can be linearised. This avoids pointers and yields a very compact
            data structure; reducing the amount of memory needed in turn facilitates
            the use of large populations. To avoid recursive calls in the interpreter,
            Langdon used reverse polish notation (RPN), i.e., a post-fix rather than
            a pre-fix notation. Only small modifications are needed to crossover and
            mutation so that they act directly on the RPN expressions. This means the
            same representation is used on both the host and the GPU. Almost a billion
            GP primitives can be interpreted by a single graphics card per second. In
            both Cartesian and tree-based GP the genetic operations are done by the
            host CPU. Wong, Wong, and Fok (2005) showed, for a genetic algorithm,
            these too can be done by the GPU.
               Although each of the GPU’s processors may be individually quite fast
            and the manufacturers claim huge aggregate FLOPS ratings, the GPUs are
            optimised for graphics work. In practice, it is hard to keep all the processors
            fully loaded. Nevertheless 30 GFLOPS has been achieved (Langdon and
            Harrison, 2008). Given the differences in CPU and GPU architectures and
            clock speeds, often the speedup from using a GPU rather than the host
            CPU is the most useful statistic. This is obviously determined by many
            factors, including the relative importance of amount of computation and
            size of data. The measured RPN tree speedups were 7.6-fold (Langdon and
            Harrison, 2008) and 12.6-fold (Langdon and Banzhaf, 2008).

            10.4.3   GP on FPGAs

            Field programmable gate arrays (FPGAs) are chips which contain large ar-
            rays of simple logic processing units whose functionality and connectivity
            can be changed via software in microseconds by simply writing a configu-
            ration into a static memory. Once an FPGA is configured it can update
            all of its thousands of logic elements in parallel at the clock speed of the
            circuit. Although an FPGA’s clock speed is often an order of magnitude
            slower than that of a modern CPU, its massive parallelism makes it a very
            powerful computational device. Because of this and of their flexibility there
            has been significant interest in using FPGAs in GP.
               Work has ranged from the use of FPGAs to speed up fitness evaluation

               5
               Bigger populations, e.g. five million programs (Langdon and Harrison, 2008), are
            possible by loading them onto the GPU in 256k units.
   101   102   103   104   105   106   107   108   109   110   111