Page 129 - 49A Field Guide to Genetic Programming
P. 129

12.2 Curve Fitting, Data Modelling and Symbolic Regression    115


            quantities are reported. Usually we know which variable we want to predict
            (e.g., the soft sensor value), and which other quantities we can use to make
            the prediction (e.g., the hard sensor values). If this is not known, then
            experimenters must decide which are going to be their dependent variables
            before applying GP. Sometimes, in practical situations, the data tables
            include hundreds or even thousands of variables. It is well known that
            in these cases the efficiency and effectiveness of any machine learning or
            program induction method, including GP, can dramatically drop as most of
            the variables are typically redundant or irrelevant. This forces the system
            to waste considerable energy on isolating the key features. To avoid this,
            it is necessary to perform some form of feature selection, i.e., we need to
            decide which independent variables to keep and which to leave out. There
            are many techniques to do this, but these are beyond the scope of this book.
            However, it is worth noting that GP itself can be used to do feature selection
            as shown in (Langdon and Buxton, 2004).
               There are problems where more than one output (prediction) is required.
            For example, Table 12.1 shows a data set with four variables controlled
            during data collection (left) and six dependent variables (right). The data
            were collected for the purpose of solving an inverse kinematics problem in the
            Elvis robot (Langdon and Nordin, 2001). The robot is shown in Figure 12.1
            during the acquisition of a data sample. The roles of the independent and
            dependent variables are swapped when GP is given the task of controlling
            the arm given data from the robot’s eyes.
               There are several GP techniques which might be used to deal with ap-
            plications where multiple outputs are required: GP individuals including
            multiple trees (as in Figure 2.2, page 11), linear GP with multiple output
            registers (see Section 7.1), graph-based GP with multiple output nodes (see
            Section 7.2), a single GP tree with primitives operating on vectors, and so
            forth.
               Once a suitable data set is available, its independent variables must all
            be represented in the primitive set. What other terminals and functions are
            included depends very much on the type of the data being processed (are
            they numeric? are they strings? etc.) and is often guided by the information
            available to the experimenter and the process that generated the data. If
            something is known (or strongly suspected) about the desired structure of
            the function to be evolved, it may be very beneficial to use this information
            (or to apply some constraints, like those discussed in Section 6.2). For
            example, if the data are known to be periodic, then the function set should
            probably include something like the sine function.
               What is common to virtually all symbolic regression problems is that
            the fitness function must measure how close the outputs produced by each
            program are to the values of the dependent variables, when the correspond-
            ing values of the independent ones are used as inputs for the program. So,
   124   125   126   127   128   129   130   131   132   133   134