#include <src/TwoLayerPerceptron.h>
It uses a custom transfer function I devised for this project (though of course I was not the first one to think about it, see below). f(x) = x / (1 + abs(x)) This function is sigmoid-like, -1 / +1 bounded, continuous to any order, and much faster to compute than tanh. It is also unfortunately slower to converge, see the note below.
Many neural network packages use tanh, to the point it has become the standard function to use. It has the very nice property of having tanh' = 1 - tanh^2. Thus, backpropagation can be made faster by reusing previous computations for the derivative, and very often the goal is to best train the network so this is a very desirable property. Other use the sigmoid function s(x) = 1 / (1+exp(-x)), where s' = s*(1-s), for the same reason.
However, in this project, the neural network is used primarily in forward mode, and backpropagation/explicit training is only used to provide a starting point for the genetic algorithm to work on.
Thus, the most important point is to have the fastest transfer function: we are going to apply it a lot, nhidden+noutput times per agent, and for many agents! On the other hand, it doesn't matter much if we couldn't re-use some computations in backpropagation: the learning phase is done only once, and is fast enough as it is with the function above anyway.
As a matter of fact, with the function above, we CAN reuse the results exactly the same way as for tanh: f'(x) = 1 / (1 + abs(x))^2 = (1 - abs(f)) ^2.
Note: It seems this function was first found by David Elliott, in "A Better Activation Function for Artificial Neural Networks". See http://www.dontveter.com/bpr/activate.html for a discussion of its merits and drawbacks relativly to other activation functions. In short, this function is much slower to converge to a data than tanh. But numerical experiments in this project have shown that it is close to what the sigmoid does for our purpose. And since once again all we want here is a good starting point for the genetic algorithm, we don't care much.
The training is a very simple on-line gradient descent. All you have to do is provide an input and a target data to the train function, plus a learning rate (control how steep is the gradient descent, 0.2 is already quite big). This will train the network to minimize the (half) sum of square error on the outputs. You can also use your own error function, in a three-step algorithm. It works as follow:
You can repeatedly do this on all the data you want to learn, in turn.
This MLP can also be used for batch learning, where you would provide a large input/output data set to train at once. To do this use backPropagate on the first mapping, then call batchBackPropagateAccumulate to accumulate the gradients for all other mappings. In the end, call batchBackPropagateTerminate with the total number of mappings. You can then use the learn algorithm as above.
For advanced learning techniques, you may consider looking at the CheapMatrix framework I created for the occasion. You'll find a scaled conjugate gradient algorithm, which converges faster than this simple gradient descent.
Public Member Functions | |
TwoLayerPerceptron (int ninput, int nhidden, int noutput) | |
Creates a network with the given dimensions. | |
TwoLayerPerceptron (const TwoLayerPerceptron &tlp) | |
Copy constructor: create this network from the other. | |
TwoLayerPerceptron & | operator= (const TwoLayerPerceptron &tlp) |
Copy all the values in the other network The networks must be the same size. | |
virtual void | computeOutput (const double *input, double *output) |
Gets the output corresponding to this input vector. | |
virtual void | backPropagate (const double *input, const double *output, const double *gradout) |
Backpropagates the error function gradients in the network This function pre-supposes the current internal values of the hidden units match the given input to output mapping. | |
virtual void | batchBackPropagateAccumulate (const double *input, const double *output, const double *gradout) |
Batch backpropagation accumulates all gradients from all mappings one by one. | |
virtual void | batchBackPropagateTerminate (int nmappings) |
Batch backpropagation accumulates all gradients from all mappings one by one. | |
virtual void | learn (double learningRate=defaultLearningRate) |
Very simple gradient descent, by the given amount. | |
virtual double | train (const double *input, const double *target, double learningRate=defaultLearningRate) |
Commodity function to train the network for the given target using the very common 'half sum of square of the output' error. | |
virtual void | getHidden (double *hidden) |
Read-write accessors to hidden values allow to store the results of computeOutput for later backpropagation. | |
virtual void | setHidden (const double *hidden) |
Read-write accessors to hidden values allow to store the results of computeOutput for later backpropagation. | |
int | getNInput () |
Read-only accessors. | |
int | getNOutput () |
int | getNHidden () |
virtual void | mutate (double ihwRate, double ihwJitter, double howRate, double howJitter, double hbRate, double hbJitter, double obRate, double obJitter) |
Mutate this network weights and biases with the given parameters. | |
Static Public Attributes | |
static double(* | transfer )(double) = &defaultTransfer |
Set a transfer function. | |
static double(* | transferDerivativeAsF )(double) = &defaultTransferDerivativeAsF |
Set the derivative of the transfer function, expressed in terms of the original function. | |
static const double | defaultLearningRate = 0.1 |
Default learning rate for the training by gradient descent. Default is 0.1. | |
Protected Attributes | |
int | ninput |
int | nhidden |
int | noutput |
int | nih |
int | nho |
double * | ihw |
double * | how |
double * | ihwg |
double * | howg |
double * | hb |
double * | ob |
double * | hbg |
double * | obg |
double * | hv |
Friends | |
std::ostream & | operator<< (std::ostream &os, const TwoLayerPerceptron &tlp) |
std::istream & | operator>> (std::istream &is, TwoLayerPerceptron &tlp) |
|
Creates a network with the given dimensions. The weights are initially set to random values using a normal random distribution, scaled by the layer dimensions. The Utility random methods are used, so you can set the seed for reproducible results |
|
Backpropagates the error function gradients in the network This function pre-supposes the current internal values of the hidden units match the given input to output mapping. This is the case is computeOutput was called previously to this function. This is usually necessary anyway to compute the error gradient, so it isn't a big requirement.
|
|
Batch backpropagation accumulates all gradients from all mappings one by one. Use backPropagate on the first mapping, then call this function to accumulate the gradients for all other mappings. In the end, call batchBackPropagateTerminate with the total number of mappings. This function pre-supposes the current internal values of the hidden units match the given input to output mapping. This is the case is computeOutput was called previously to this function. This is usually necessary anyway to compute the error gradient, so it isn't a big requirement.
|
|
Batch backpropagation accumulates all gradients from all mappings one by one. Use backPropagate on the first mapping, then call batchBackPropagateAccumulate to accumulate the gradients for all other mappings. In the end, call this function with the total number of mappings. This function pre-supposes the current internal values of the hidden units match the given input to output mapping. This is the case is computeOutput was called previously to this function. This is usually necessary anyway to compute the error gradient, so it isn't a big requirement.
|
|
Very simple gradient descent, by the given amount. Uses the current gradients to update the weights and bias.
|
|
Mutate this network weights and biases with the given parameters. This has nothing to do in this generic class, but I'm too lazy to split it cleanly. For each "ihw" input-to-hidden weight, each "how" hidden-to-output weight each "hb" hidden bias, each "ob" output bias, the following parameters apply.
|
|
Commodity function to train the network for the given target using the very common 'half sum of square of the output' error. You may call this function repeatedly to train the network, checking the results till you're satisfied
|
|
Set a transfer function. Default is a custom transfer function: f(x) = x / (1 + abs(x)) This function is sigmoid-like, -1 / +1 bounded, continuous to any order, and much faster to compute than tanh. It is also unfortunately slower to converge, but about the same speed as the sigmoid function for this project according to preliminary experiments. |
|
Set the derivative of the transfer function, expressed in terms of the original function. This is the differential equation relating f' and f. Such an equation does not always exist, but when it does, it provides a big boost for the backpropagation. In practice in neural networks, only such functions are thus used. Sorry, but this class does not handle the more generic case. Default is a custom transfer function I devised especially for this project: f' = (1 - abs(f)) ^2 f'(x) = 1 / (1 + abs(x))^2 |