Wednesday, December 7, 2011

Gradiention Day

This is a sketch for a review on meta-heuristic tricks for learning properties of a search space. More formally, a lot of advanced (and not so much) algorithms try to figure out an information about local gradient for more efficient evolutionary search. This is surely feasible, because (ideally) the gradient shows the best direction to move towards local optimum and most numeric optimization methods use this information. The question about how to get away from a local optimum concerns another (although associated) problem and is not considered.

It's worth noting, that getting gradient info for noisy functions can be difficult and requires some statistical processing, which may result in several alternatives for the best direction (not obligatory collinear). Probably fuzzy logic could be of help here. I also assume that objective function has local trend, i.e. that for each point *there is* the best movement direction, which is at least good for some not very small neighborhood, because otherwise finding gradient doesn't make sense. One can say that this means that the function can be approximated by some smooth function with the same global optimum.

We can find 2 groups of methods, which learn gradient information:
1. Explicit gradient learning. There are formulae for gradient approximation, which are used during recombination or variation operations.
2. Implicit gradient learning. These methods simply do not tell directly that they are learning gradient, but they are sort of 'I'll-know-it-when-I-see-it'.

Examples of the first group are: EA with direction crossover and Pseudo-Gradient, CMA-ES, many real-valued EDAs and (h)BOA algorithms.

For the second group examples are: Swarm Optimization Methods, Differential Evolution.

In my opinion, the meta-heuristic method can be aknowledged as 'gradient learning' if there is sort of memory for the best movement direction.

Where are conventional real-coded EAs? In my current understanding it should be placed into the 2nd group. Here is my motivation: many EA operators do not look for gradient, but simply search for better solutions in vicinity of parent individuals. Most operators do not 'remember' good searching directions. But there is a selection, which selects in which direction the population evolves and thus follows the gradient. Probably many EAs should even be treated as 'implicitly implicit' gradient learning methods :).

Up to now only continuous spaces were considered, because gradient is natural there. What about binary spaces? Or spaces of graphs, computer programs, agents policies or neural networks? I believe that the gradient notion can still be applied there as long as there is a proper metric over the set of representations.

Recall that there's a large class of finite-difference methods, which quintify the space making it non-continous, but can still be succesfully applied for continous modelling problems. It can be also argued that one of the most used variant of the Taylor expansion has linear form from which the approximate gradient can be calculated.

So the largest problem for defining gradient in non-continous spaces is setting a good metric. By this 'good' word I mean that it should not be just everything, which satisfies metric properties, but it also should reflect the problem's properties. For example, setting metric for directed weighted graphs for TSP problem is not the same as setting metric for directed weighted graphs for evolution of neural networks. That's the issue. An interesting question: if a kind of 'approximate' metrics exists, which is metrics, but without the whole pack of desired problem dependent properties?

Monday, December 5, 2011

On representation of candidate-solutions

Recently I've come up with an idea that traditional approach for programming EAs may have a strong alternative.

In a traditional way we are dancing from the EA perspective, that is, there are individuals = candidate-solutions, they form population, which is modified by EA, consisting of various operators and a (posssibly hardcoded) scheme for their application sequence. Problem domain is somewhat alien and is communicated via fitness function. If some unusual representation of the solution is used (like Gray-coded strings for real-valued optimization problem), then genotype->phenotype transform should be made to evaluate an individual's performance. This transform is often made somewhere inside a fitness function, which means that fitness function can be divided into 2 sections:
  1. Transform of the solution to the representation, which can be evaluated, even if a single method is called.
  2. Evaluation itself.
Many implementations consider that problem lives in a problem environment, which is often convenient for control, adaptive behaviour and A-Life applications, but candidate-solutions are still are part of the EA.

We can say that here fitness function is a 'friend' for EA, which means that EA can work with any representation and the function performs all the necessary transformations itself or that individuals can convert themselves to different phenotypes (arrays, matrices, networks etc.). It's very convenient for EA, but not very convenient for the problem, we have to track whether individual can be transformed to the proper representation for fitness evaluation.

Nothing is wrong with this approach, it's good, but there's another decomposition variant.

Suppose that candidate-solutions are independent things. That is they have two sides (or faces/facets): one for EA and the other for a problem. The first side can be binary chromosome, or Lisp expression, or some graph representation, whatever the EA operators can work with. The second side is for the problem: real-valued vector, neural network, agent's policy representation, whatever the fitness function should handle.

In other words, candidate solutions are no longer parts of EA (although they can still be named individuals, particles, and so on), neither they are parts of a problem. This can be convenient for both EA and a problem, because there's no need to worry about conversions and transforms. When EA side is set, the problem's side can be updated at once, or on the first demand (implementing so called 'lazy' programming concept). If 1-to-1 mapping between genotype and phenotype exists, then this transform can be reversed and if problem's side is updated, for example by some local search algorithm, the EA side can be updated as well.

There're no miracles, it just means that all conversions are localized and they should be implemented explicitely, but neither EA nor problem's side should worry how it's done and by what means.

If genotype and phenotype are the same (real-valued EAs, direct graphs representation etc.) the individuals can be made either one-sided (I would call it 'Moebius representation' ;)) or two-sided, but with no difference between sides.

And as adopted, some pros and cons of this approach:

Pros:
  • Can appear as two distinct objects at the same time.
  • Potentially faster (because phenotype can be computed when genotype is changed)
  • More natural (genotype + phenotype)
  • Objective function does not need to support different encodings, it obtains the object it's meant to evaluate --> Easier to adapt for 3rd party applications
  • Possibly easier to parallelize because problem side representations are independent and there very small chance of their conflicts.
Cons:
  • Requires more memory, because problem-side representations are stored for each EA-side representation.
  • Adds extra interaction layer between EA and a problem.
  • Probably more code to write due to more complex structure of candidate-solutions.
  • The candidate-solutions become 'heavier'.
Anyway, I think this idea of two-sided representations is interesting.

P.S. Oh, it's possible that I've just rediscovered someone else's idea. Yeah. What if Plato was right? ;)

Thursday, August 11, 2011

A few hints on scripting

Currently one of my tasks is creation a scripting engine for .NET which would satisfy the following demands:

  • Extendable syntaxis (ability to add new, possibly very exotic, syntaxical constructions).
  • Dynamical typization (in order to simplify the scripting writing process, because language is oriented on non-programmers as well).
  • Full execution control (for external control over the script execution).
Although the best solution would be to find an existing implementation and use it, but googling lead to either Lua/Perl/Python interopearbility (which doesn't satisfy the 2nd criterion) or to use of the .NET classes for dynamical compilation of the code.

After some time I did almost all the work by myself and here are some advices, which are hopefully useful (unless you had a compilers theory course and a good practice and already know much more):
  • For quick introduction in scripting read correspondent chapters in books on games programming. They will give the baseline terminology and the overall scheme of what-to-do given a script text. Books on compilers theory are good, but if you need a hands-on approach then they probably contain too much theoretical information.
  • Do not try to solve all problems at once like "Ok, I've got a script and now I'm simply going to process and execute the raw source. Nothing serious really, because I can easily see assignments, function calls, conditions etc., so writing a program, which would do the same, is a piece of cake". This is a suicide, because a problem is much harder than it seems. Just think about code like: a = sqrt(abs(c - d[i + 1])).
  • Simplified (a bit) source processing procedure includes: (1) removal of comments; (1a) concatenation of the whole script into a single string; (2) tokenization (breaking the script into atomic entries like operators, brackets, identifiers etc.); (3) creation of a parse tree (http://en.wikipedia.org/wiki/Parse_tree), which define a kind of 'execution atoms', and the order at which they should be processed; (4) optimization of the parse tree.
  • It's convenient to create a special class for tokens with fields like Contents and Type. The latter defines what kind of token it is (keyword, operator, bracket and so on). This will make life much easier on the parse tree building step.
  • BNFs (http://en.wikipedia.org/wiki/Backus-Naur_Form) are really-really-really useful things, especially when describing a formalized grammar. You can represent a BNF as collection of arrays consisting of tokens. Big advantage is that BNFs has inherent recursion property and allow reuse of one BNFs as part of other BNFs (so the good KISS and DRY principles hold).
  • BNFs should be applied for building of a parse tree given a collection of tokens. You just start from the very beginning with BNF for the entire script and use recursion. The complex part is that it's not known beforehand, which BNF suits current collection of tokens the best, so you'll have to try any suitable BNF you have, which start from the same token. But there is a way to make the problem easier, see advices below.
  • Handling recursively embedded BNF is not a simple thing and your brains will boil, so be very careful with this. Use loop and class invariants (http://en.wikipedia.org/wiki/Loop_invariant) for functions to control the process. But you'll be surprised how little code it takes to make things work.
  • Use dynamic programming (http://en.wikipedia.org/wiki/Dynamic_programming) approach for finding BNF, which suits current list of tokens best. This means that if you found that some subset of tokens is described by some BNF then you can mark the first token in this subset so that not to consider it again and again when trying other BNFs. Otherwise you'll end up redoing some work over and over and it's very likely that you'll get a stack overflow error.
  • Probably the most amasing thing about BNFs is that when you need to extend a syntaxis, you'll just need to extend the set of BNFs used during a parse tree building procedure. And taking that BNFs can be read from some file, it's even possible to extend the grammar *without* recompilation of the program (in case you've written everything correctly, use loop and class invariants!).
  • The drawback of using BNFs is that arithmetic and logical exressions are written in a 'direct' representation, not in the Polish notation (http://en.wikipedia.org/wiki/Polish_notation), hence optimization of the tree is required, because otherwise you'll have to make the transform to the Polish notation at each scripting execution, which is really unwise.
  • Parse tree is enough for execution of the script.
  • It's convenient to keep additional information in parse tree nodes about their BNF (if any). This information can be used to handle execution of conditoinals etc. Moreover you can specify special external handlers, which are called by BNF names. Hence it's possible to obtain a control over script execution, which may be convenient for setting external break points or specific execution rules.
  • Script execution is guided in terms of the program state (http://en.wikipedia.org/wiki/Program_state), which include (among others) global and local variables data. You can add any extra information you need for the program state (call stack, current parse tree node, environment variables etc.).
Good news is that my implementation (not final though) of the scripting engine is available as a part of the MentalAlchemy project (http://code.google.com/p/mentalalchemy/). See evooelements subproject, molecules section. It is still lacking some things like local program states and handling of some BNFs, but I'm planning to finish them in 1-2 weeks.

Monday, April 18, 2011

Two diagonals

There is a matrix bidiagonalization algorithm in a good book Golub G.H., Van Loan C.F. Matrix Computation. 2nd edition. The John Hopkins Univ. Press, 1989, which relies on the Householder's reflections to create zeroes below the lower triangle part of any matrix and above the upper subdiagonal so that the resulting matrix would be something like:
This matrix is usually be used for the Singular Value Decomposition by elimination of values above the main diagonal (so that the resulting matrix performes a required scaling). There are two interesting moments:


------------------------------------
1. The algorithm 5.4.2, which performs bidiagonalization, in the book seems a bit wrong. Namely two lines can be omitted without harming the result. Those are: A(j+1:m,j) = v(j+1:m) and A(j,j+1:n) = v(j+2:m)' (in Matlab notation). These lines perform copying of the Householder's vector into the matrix, but this is not required because aff ected column and row already nulli fied by the Housholder transform and are not changed as the algorithm goes on. So these copyings simply do something wrong. Here is an example from the book: for input matrix

the following bidiagonalized matrix should be obtained:
But implementation of the algorithm yields the following result:

Note that values on the main diagonal and the upper sub-diagonal are almost the same, while there are a lot of "undesired" non-zeros as well (the di fferences are due to use of single-precision calculations). There are also signs misplaced, but I haven't found the source of this error (?) yet.

Now removing the copying of the Householder's vectors we obtain:


------------------------------------

2. I suppose that there is a good way for calculation of the matrix rank using this bidiagonalization procedure. There a lot of ways to calculate the rank, the simpliest one is to use Gauss eliminations, but it's not numerically stable, and a more reliable and convenient way is to use SVD and to calculate singular values larger than some threshold. One of the typical ways to calculate SVD is to use Golub-Reinsch algorithm, which firstly transforms the input matrix to the bidiagonal form (!) and then eliminates non-zeroes above the main diagonal. My idea is that the values on the main diagonal of bidiagonalized matrix can tell the number of singular values. If it's true then the matrix rank calculation can be performed via using bidiagonalization and then calculating non-zero values on the main diagonal. For the example above the initial matrix rank is 2 and the number of non-zeroes on the main diagonal of bidiagonalized form is also two.
------------------------------------


I may be wrong for these two notes, so any comments and critics are wellcome :).


Wednesday, February 16, 2011

Neureka

Yesterday I was making experiments on using neuroevolution for adaptive increase of dimension of the features space. The idea is simple: According to the Cover’s theorem (1965) the probability to randomly ‘place’ separating hyperplane, which minimizes the number of ambiguous/incorrect recognitions, is increasing with the growth of dimensionality. In other words, theoretically, increasing the features space dimensionality should ease the recognition.

Since it’s not known what dimensionality to target and how the original features space should be transformed I did experiments when artificial neural network (ANN), which is used for recognition, consists of two parts:
1.      ANN-1 which is trained by an evolutionary algorithm.
2.      ANN-2 which is trained using RProp algorithm from the Encog library (see http://www.heatonresearch.com/encog).
Output signals for ANN-1 are input signals for ANN-2 and thus one can treat these networks as two parts of the one and the same neural network.

The error criterion for evolutionary training of ANN-1 was the largest error, obtained after training of three ANN-2 for 50 epochs. I chose to pick the largest error to avoid the overfitting effect.

The first results were not very encouraging (see here http://qai.narod.ru/Workshop/Tsoy_cai2010.pdf (in Russian)), I didn’t get the improvement for all test problems (from the Proben1 suit, see http://www.filewatcher.com/b/ftp/ftp.cs.cuhk.hk/pub/proben1.0.0.html) and when this method worked better the average classification accuracy wasn’t near the best results obtained so far.

I was pretty puzzled by this and decided to change the ANN-1’s error criterion to pursue minimal correlation between outputs of ANN-1. And mistakenly set evolutionary algorithm to maximize this criterion. And suddenly I started obtaining good results. For example, the table below contains classification errors on a test data set (mean (stdvar)) for several problems,

Problem
Ref. Proben1 results
Prev. best result
Cur. best result (maximization of signals correlation)
Cur. best result (minimization of signals correlation, sigmoid nodes at ANN-1 output)
Cur. best result (minimization of signals correlation, linear nodes at ANN-1 output)
Cancer1
1,38 (0,49)
2,07 (0,73)
1,26 (0,45)
1,03 (0,53)
2,53 (0,30)
Horse1
29,19 (2,62)
34,40 (3,84)
28,35 (2,25)
28,02 (3,03)
24,51 (0,53)

The first column contains best results obtained by Lutz Prechelt using hand-tuned neural structures. The second columns shows experimental results with old fitness function (published in http://qai.narod.ru/Workshop/Tsoy_cai2010.pdf), while the last 3 columns contain new results and in most cases considered so far, it’s possible to obtain better results than the old ones.

There are only two problems are shown (cancer1 and horse1), because they caused problems for algorithm with old fitness function, which minimizes ANN-2 training error directly. There are some interesting regularities in results when the number of nodes on ANN-1 output is changing, but I’d better leave them for the next post when I’ll have more statistics.

Meanwhile I’d like to discuss possible causes of these results. First of all why increasing correlation between outputs lead to better results? My intuition tells me that increasing correlation tends to lead to dimensionality reduction, which can be useful according to numerous results obtained by other researchers. And this can be particularly important when the number of input signals is rather large. For example, in horse1 problem there were 58 input signals and the best obtained results corresponds to cases when ANN-1 had only 14 outputs, which means that input features space was reduced by a factor of ~4. When number of outputs of ANN-1 for this problem was increased to 29 or more the results become worse (in some cases I obtained testing errors >60%).

From the other hand for ANN-1 having more outputs than inputs can be useful and cancer1 problem is a good example here. The best results obtained for ANN-1 with 22 and 27 outputs while the number of inputs for this problem is 9.

It’s quite indicative that ANN-1 with linear output nodes didn’t get the improvement in the cancer1 problem since in this case the real dimensionality of ANN-1 output didn’t exceed that of the input. The reason is simple ANN-1 with linear output signals performs linear transform of the input signal, which can’t increase the dimensionality, while ANN-1 with sigmoid outputs can increase the number of features space dimensions and this indeed leads to the improvement.

Obtained results show that varying number of outputs of ANN-1 dynamically, during the evolutionary algorithm run, it’s possible to obtain quite effective classification algorithm.

Another good result is that such approach demonstrates practical need in neuroevolutionary algorithms, which allow variation of the number of outputs. It’s good because I could not figure out earlier a nice example for such algorithms, except of impossible (?) for implementation A-Life and adaptive behavior problems, when the environment changes dynamically and it requires emergence of new limbs/effectors.

A good question here is where this method more effective, than 'conventional' approach with 'toggling' input features on and off using genetic algorithm with binary encoding. Think this may be a theme for a good research paper.

Friday, February 11, 2011

Digging for Roots

There is an interesting way of finding polynomial roots, which concerns calculation of eigenvalues for the following matrix (http://mathworld.wolfram.com/PolynomialRoots.html):


After eigenvalues have been found the roots are simply computed as 1/lambda.

This method is rather slow especially for large matrices, but it finds all roots, including complex ones, which is hard to do using conventional scheme "root localization + application of numeric method for finding exact root value".

It's interesting that it's possible to obtain a matrix, which eigenvalues are the same as the polynomial's roots. This matrix is:


It's easy to check that AB = E, where E - is identity martix.