Page 128 - 49A Field Guide to Genetic Programming
P. 128
114 12 Applications
goal is to find a function whose output has some desired property, e.g., the
function matches some target values (as in the example given in Section 4.1).
This is generally known as a symbolic regression problem.
Many people are familiar with the notion of regression. Regression means
finding the coefficients of a predefined function such that the function best
fits some data. A problem with regression analysis is that, if the fit is not
good, the experimenter has to keep trying different functions by hand until
a good model for the data is found. Not only is this laborious, but also
the results of the analysis depend very much on the skills and inventiveness
of the experimenter. Furthermore, even expert users tend to have strong
mental biases when choosing functions to fit. For example, in many applica-
tion areas there is a considerable tradition of using only linear or quadratic
models, even when the data might be better fit by a more complex model.
Symbolic regression attempts to go beyond this. It consists of finding
a function that fits the given data points without making any assumptions
about the structure of that function. Since GP makes no such assumption,
it is well suited to this sort of discovery task. Symbolic regression was one
of the earliest applications of GP (Koza, 1992), and continues to be widely
studied (Cai, Pacheco-Vega, Sen, and Yang, 2006; Gustafson, Burke, and
Krasnogor, 2005; Keijzer, 2004; Lew, Spencer, Scarpa, Worden, Rutherford,
and Hemez, 2006).
The steps necessary to solve symbolic regression problems include the five
preparatory steps mentioned in Chapter 2. We practiced them in the exam-
ple in Chapter 4, which was an instance of a symbolic regression problem.
There is an important difference here, however: the data points provided in
Chapter 4 were computed using a simple formula, while in most realistic sit-
uations each point represents the measured values taken by some variables
at a certain time in some dynamic process, in a repetition of an experiment,
and so on. So, the collection of an appropriate set of data points for symbolic
regression is an important and sometimes complex task.
For instance, consider the case of using GP to evolve a soft sensor (Jor-
daan, Kordon, Chiang, and Smits, 2004). The intent is to evolve a function
that will provide a reasonable estimate of what a sensor (in an industrial
production facility) would report, based on data from other actual sensors
in the system. This is typically done in cases where placing an actual sensor
in that location would be difficult or expensive. However, it is necessary to
place at least one instance of such a sensor in a working system in order to
collect the data needed to train and test the GP system. Once the sensor
is placed, one would collect the values reported by that sensor and by all
the other real sensors that are available to the evolved function, at various
times, covering the various conditions under which the evolved system will
be expected to act.
Such experimental data typically come in large tables where numerous