Page 84 - 49A Field Guide to Genetic Programming
P. 84
70 8 Probabilistic Genetic Programming
Different EDAs use different models for the probability distribution that
controls the sampling (see (Larra˜naga, 2002; Larra˜naga and Lozano, 2002)
for more information). For example, population-based incremental learning
(PBIL) (Baluja and Caruana, 1995) and the uniform multivariate distribu-
tion algorithm (UMDA) (M¨uhlenbein and Mahnig, 1999a,b) assume that
each variable is independent of the other variables. Consequently, these al-
gorithms need to store and adjust only a linear array of probabilities, one for
each variable. This works well for problems with weak interactions between
variables. Since no relationship between the variables is stored or learned,
however, PBIL and UMDA may have difficulties solving problems where the
interactions between variables are significant.
Naturally, higher order models are possible. For example, the MIMIC
algorithm of de Bonet, Isbell, and Viola (1997) uses second-order statis-
tics. It is also possible to use flexible models where interactions of dif-
ferent orders are captured. The Bayesian optimisation algorithm (BOA)
(Pelikan, Goldberg, and Cant´u-Paz, 1999) uses baysian networks to rep-
resent generic sampling distributions, while the extended compact genetic
algorithm (eCGA) (Harik, 1999) clusters genes into groups where the genes
in each group are assumed to be linked but groups are considered inde-
pendent. The sampling distribution is then taken to be the product of the
distributions modelling the groups.
EDAs have been very successful. However, they are often unable to rep-
resent both the overall structure of the distribution and its local details,
typically being more successful at the former. This is because EDAs rep-
resent the sampling distribution using models with an, inevitably, limited
number of degrees of freedom. For example, suppose the optimal sampling
distribution has multiple peaks, corresponding to different local optima, sep-
arated by large unfit areas. Then, an EDA can either decide to represent
only one peak, or to represent all of them together with the unfit areas. If
the EDA chooses the wrong local peak this may lead to it getting stuck and
not finding the global optimum. Conversely if it takes a wider view, this
leads to wasting many trials sampling irrelevant poor solutions.
Consider, for example, a scenario where there are five binary variables,
x 1 , x 2 , x 3 , x 4 and x 5 , and two promising regions: one near the string of all
zeros, i.e., (x 1 , x 2 , x 3 , x 4 , x 5 ) = (0, 0, 0, 0, 0), and the other near the string
of all ones, i.e., (x 1 , x 2 , x 3 , x 4 , x 5 ) = (1, 1, 1, 1, 1). One option for a (simple)
EDA is to focus on one of the two regions, e.g., setting the variables x i
to 0 with high probability (say, 90%). This, however, fails to explore the
other region, and risks missing the global optimum. The other option is to
maintain both regions as possibilities by setting all the probabilities to 50%,
i.e., each of the variables x i is as likely to be 0 as 1. These probabilities will
generate samples in both of the promising regions. For example, the strings
(0, 0, 0, 0, 0) and (1, 1, 1, 1, 1) will each be generated with a 3.125% proba-