Page 115 - 49A Field Guide to Genetic Programming
P. 115
11.3 Bloat 101
11.3 Bloat
Starting in the early 1990s, researchers began to notice that in addition to
progressively increasing their mean and best fitness, GP populations also
showed certain other dynamics. In particular, it was noted that very often
the average size (number of nodes) of the programs in a population, after a
certain number of generations in which it was largely static, at some point
would start growing at a rapid pace. Typically the increase in program size
was not accompanied by any corresponding increase in fitness. The origin
of this phenomenon, which is known as bloat, has effectively been a mystery
for over a decade.
Note that there are situations where one would expect to see program
growth as part of the process of solving a problem. For example, GP runs
typically start from populations of small random programs, and it may be
necessary for the programs to grow in complexity for them to be able to
comply with all the fitness cases (a situation which often arises in continuous
symbolic regression problems). So, we should not equate growth with bloat
and we should define bloat as program growth without (significant) return in
terms of fitness.
Bloat is not only surprising, it also has significant practical effects: large
programs are computationally expensive to evolve and later use, can be hard
to interpret, and may exhibit poor generalisation. For these reasons bloat
has been a subject of intense study in GP. Over the years, many theories
have been proposed to explain various aspects of bloat, and while great
strides have been made, we still lack a single, universally-accepted unifying
theory to explain the broad range of empirical observations. We review the
key theoretical results on bloat in Section 11.3.1.
While discussions on the causes of bloat were going on, practitioners have
still had to face the reality of combating bloat in their runs. Consequently,
a variety of effective practical techniques have been proposed to counteract
bloat. We review these in Section 11.3.2, where we will particularly focus on
the parsimony pressure method (Koza, 1992; Zhang and M¨uhlenbein, 1993,
1995; Zhang et al., 1997), which is perhaps the simplest and most frequently
used method to control bloat in genetic programming.
11.3.1 Bloat in Theory
As mentioned above, there are several theories of bloat. Let us start by
looking at three of the oldest ones: the replication accuracy theory, the
removal bias theory and the nature of program search spaces theory.