Page 129 - 49A Field Guide to Genetic Programming
P. 129
12.2 Curve Fitting, Data Modelling and Symbolic Regression 115
quantities are reported. Usually we know which variable we want to predict
(e.g., the soft sensor value), and which other quantities we can use to make
the prediction (e.g., the hard sensor values). If this is not known, then
experimenters must decide which are going to be their dependent variables
before applying GP. Sometimes, in practical situations, the data tables
include hundreds or even thousands of variables. It is well known that
in these cases the efficiency and effectiveness of any machine learning or
program induction method, including GP, can dramatically drop as most of
the variables are typically redundant or irrelevant. This forces the system
to waste considerable energy on isolating the key features. To avoid this,
it is necessary to perform some form of feature selection, i.e., we need to
decide which independent variables to keep and which to leave out. There
are many techniques to do this, but these are beyond the scope of this book.
However, it is worth noting that GP itself can be used to do feature selection
as shown in (Langdon and Buxton, 2004).
There are problems where more than one output (prediction) is required.
For example, Table 12.1 shows a data set with four variables controlled
during data collection (left) and six dependent variables (right). The data
were collected for the purpose of solving an inverse kinematics problem in the
Elvis robot (Langdon and Nordin, 2001). The robot is shown in Figure 12.1
during the acquisition of a data sample. The roles of the independent and
dependent variables are swapped when GP is given the task of controlling
the arm given data from the robot’s eyes.
There are several GP techniques which might be used to deal with ap-
plications where multiple outputs are required: GP individuals including
multiple trees (as in Figure 2.2, page 11), linear GP with multiple output
registers (see Section 7.1), graph-based GP with multiple output nodes (see
Section 7.2), a single GP tree with primitives operating on vectors, and so
forth.
Once a suitable data set is available, its independent variables must all
be represented in the primitive set. What other terminals and functions are
included depends very much on the type of the data being processed (are
they numeric? are they strings? etc.) and is often guided by the information
available to the experimenter and the process that generated the data. If
something is known (or strongly suspected) about the desired structure of
the function to be evolved, it may be very beneficial to use this information
(or to apply some constraints, like those discussed in Section 6.2). For
example, if the data are known to be periodic, then the function set should
probably include something like the sine function.
What is common to virtually all symbolic regression problems is that
the fitness function must measure how close the outputs produced by each
program are to the values of the dependent variables, when the correspond-
ing values of the independent ones are used as inputs for the program. So,