Yesterday I was making experiments on using neuroevolution for adaptive increase of dimension of the features space. The idea is simple: According to the Cover’s theorem (1965) the probability to randomly ‘place’ separating hyperplane, which minimizes the number of ambiguous/incorrect recognitions, is increasing with the growth of dimensionality. In other words, theoretically, increasing the features space dimensionality should ease the recognition.
Since it’s not known what dimensionality to target and how the original features space should be transformed I did experiments when artificial neural network (ANN), which is used for recognition, consists of two parts:
1. ANN-1 which is trained by an evolutionary algorithm.
2. ANN-2 which is trained using RProp algorithm from the Encog library (see http://www.heatonresearch.com/encog).
Output signals for ANN-1 are input signals for ANN-2 and thus one can treat these networks as two parts of the one and the same neural network.
The error criterion for evolutionary training of ANN-1 was the largest error, obtained after training of three ANN-2 for 50 epochs. I chose to pick the largest error to avoid the overfitting effect.
The first results were not very encouraging (see here http://qai.narod.ru/Workshop/Tsoy_cai2010.pdf (in Russian)), I didn’t get the improvement for all test problems (from the Proben1 suit, see http://www.filewatcher.com/b/ftp/ftp.cs.cuhk.hk/pub/proben1.0.0.html) and when this method worked better the average classification accuracy wasn’t near the best results obtained so far.
I was pretty puzzled by this and decided to change the ANN-1’s error criterion to pursue minimal correlation between outputs of ANN-1. And mistakenly set evolutionary algorithm to maximize this criterion. And suddenly I started obtaining good results. For example, the table below contains classification errors on a test data set (mean (stdvar)) for several problems,
Problem | Ref. Proben1 results | Prev. best result | Cur. best result (maximization of signals correlation) | Cur. best result (minimization of signals correlation, sigmoid nodes at ANN-1 output) | Cur. best result (minimization of signals correlation, linear nodes at ANN-1 output) |
Cancer1 | 1,38 (0,49) | 2,07 (0,73) | 1,26 (0,45) | 1,03 (0,53) | 2,53 (0,30) |
Horse1 | 29,19 (2,62) | 34,40 (3,84) | 28,35 (2,25) | 28,02 (3,03) | 24,51 (0,53) |
The first column contains best results obtained by Lutz Prechelt using hand-tuned neural structures. The second columns shows experimental results with old fitness function (published in http://qai.narod.ru/Workshop/Tsoy_cai2010.pdf), while the last 3 columns contain new results and in most cases considered so far, it’s possible to obtain better results than the old ones.
There are only two problems are shown (cancer1 and horse1), because they caused problems for algorithm with old fitness function, which minimizes ANN-2 training error directly. There are some interesting regularities in results when the number of nodes on ANN-1 output is changing, but I’d better leave them for the next post when I’ll have more statistics.
Meanwhile I’d like to discuss possible causes of these results. First of all why increasing correlation between outputs lead to better results? My intuition tells me that increasing correlation tends to lead to dimensionality reduction, which can be useful according to numerous results obtained by other researchers. And this can be particularly important when the number of input signals is rather large. For example, in horse1 problem there were 58 input signals and the best obtained results corresponds to cases when ANN-1 had only 14 outputs, which means that input features space was reduced by a factor of ~4. When number of outputs of ANN-1 for this problem was increased to 29 or more the results become worse (in some cases I obtained testing errors >60%).
From the other hand for ANN-1 having more outputs than inputs can be useful and cancer1 problem is a good example here. The best results obtained for ANN-1 with 22 and 27 outputs while the number of inputs for this problem is 9.
It’s quite indicative that ANN-1 with linear output nodes didn’t get the improvement in the cancer1 problem since in this case the real dimensionality of ANN-1 output didn’t exceed that of the input. The reason is simple ANN-1 with linear output signals performs linear transform of the input signal, which can’t increase the dimensionality, while ANN-1 with sigmoid outputs can increase the number of features space dimensions and this indeed leads to the improvement.
Obtained results show that varying number of outputs of ANN-1 dynamically, during the evolutionary algorithm run, it’s possible to obtain quite effective classification algorithm.
Another good result is that such approach demonstrates practical need in neuroevolutionary algorithms, which allow variation of the number of outputs. It’s good because I could not figure out earlier a nice example for such algorithms, except of impossible (?) for implementation A-Life and adaptive behavior problems, when the environment changes dynamically and it requires emergence of new limbs/effectors.
A good question here is where this method more effective, than 'conventional' approach with 'toggling' input features on and off using genetic algorithm with binary encoding. Think this may be a theme for a good research paper.