Wednesday, December 7, 2011

Gradiention Day

This is a sketch for a review on meta-heuristic tricks for learning properties of a search space. More formally, a lot of advanced (and not so much) algorithms try to figure out an information about local gradient for more efficient evolutionary search. This is surely feasible, because (ideally) the gradient shows the best direction to move towards local optimum and most numeric optimization methods use this information. The question about how to get away from a local optimum concerns another (although associated) problem and is not considered.

It's worth noting, that getting gradient info for noisy functions can be difficult and requires some statistical processing, which may result in several alternatives for the best direction (not obligatory collinear). Probably fuzzy logic could be of help here. I also assume that objective function has local trend, i.e. that for each point *there is* the best movement direction, which is at least good for some not very small neighborhood, because otherwise finding gradient doesn't make sense. One can say that this means that the function can be approximated by some smooth function with the same global optimum.

We can find 2 groups of methods, which learn gradient information:
1. Explicit gradient learning. There are formulae for gradient approximation, which are used during recombination or variation operations.
2. Implicit gradient learning. These methods simply do not tell directly that they are learning gradient, but they are sort of 'I'll-know-it-when-I-see-it'.

Examples of the first group are: EA with direction crossover and Pseudo-Gradient, CMA-ES, many real-valued EDAs and (h)BOA algorithms.

For the second group examples are: Swarm Optimization Methods, Differential Evolution.

In my opinion, the meta-heuristic method can be aknowledged as 'gradient learning' if there is sort of memory for the best movement direction.

Where are conventional real-coded EAs? In my current understanding it should be placed into the 2nd group. Here is my motivation: many EA operators do not look for gradient, but simply search for better solutions in vicinity of parent individuals. Most operators do not 'remember' good searching directions. But there is a selection, which selects in which direction the population evolves and thus follows the gradient. Probably many EAs should even be treated as 'implicitly implicit' gradient learning methods :).

Up to now only continuous spaces were considered, because gradient is natural there. What about binary spaces? Or spaces of graphs, computer programs, agents policies or neural networks? I believe that the gradient notion can still be applied there as long as there is a proper metric over the set of representations.

Recall that there's a large class of finite-difference methods, which quintify the space making it non-continous, but can still be succesfully applied for continous modelling problems. It can be also argued that one of the most used variant of the Taylor expansion has linear form from which the approximate gradient can be calculated.

So the largest problem for defining gradient in non-continous spaces is setting a good metric. By this 'good' word I mean that it should not be just everything, which satisfies metric properties, but it also should reflect the problem's properties. For example, setting metric for directed weighted graphs for TSP problem is not the same as setting metric for directed weighted graphs for evolution of neural networks. That's the issue. An interesting question: if a kind of 'approximate' metrics exists, which is metrics, but without the whole pack of desired problem dependent properties?

Monday, December 5, 2011

On representation of candidate-solutions

Recently I've come up with an idea that traditional approach for programming EAs may have a strong alternative.

In a traditional way we are dancing from the EA perspective, that is, there are individuals = candidate-solutions, they form population, which is modified by EA, consisting of various operators and a (posssibly hardcoded) scheme for their application sequence. Problem domain is somewhat alien and is communicated via fitness function. If some unusual representation of the solution is used (like Gray-coded strings for real-valued optimization problem), then genotype->phenotype transform should be made to evaluate an individual's performance. This transform is often made somewhere inside a fitness function, which means that fitness function can be divided into 2 sections:
  1. Transform of the solution to the representation, which can be evaluated, even if a single method is called.
  2. Evaluation itself.
Many implementations consider that problem lives in a problem environment, which is often convenient for control, adaptive behaviour and A-Life applications, but candidate-solutions are still are part of the EA.

We can say that here fitness function is a 'friend' for EA, which means that EA can work with any representation and the function performs all the necessary transformations itself or that individuals can convert themselves to different phenotypes (arrays, matrices, networks etc.). It's very convenient for EA, but not very convenient for the problem, we have to track whether individual can be transformed to the proper representation for fitness evaluation.

Nothing is wrong with this approach, it's good, but there's another decomposition variant.

Suppose that candidate-solutions are independent things. That is they have two sides (or faces/facets): one for EA and the other for a problem. The first side can be binary chromosome, or Lisp expression, or some graph representation, whatever the EA operators can work with. The second side is for the problem: real-valued vector, neural network, agent's policy representation, whatever the fitness function should handle.

In other words, candidate solutions are no longer parts of EA (although they can still be named individuals, particles, and so on), neither they are parts of a problem. This can be convenient for both EA and a problem, because there's no need to worry about conversions and transforms. When EA side is set, the problem's side can be updated at once, or on the first demand (implementing so called 'lazy' programming concept). If 1-to-1 mapping between genotype and phenotype exists, then this transform can be reversed and if problem's side is updated, for example by some local search algorithm, the EA side can be updated as well.

There're no miracles, it just means that all conversions are localized and they should be implemented explicitely, but neither EA nor problem's side should worry how it's done and by what means.

If genotype and phenotype are the same (real-valued EAs, direct graphs representation etc.) the individuals can be made either one-sided (I would call it 'Moebius representation' ;)) or two-sided, but with no difference between sides.

And as adopted, some pros and cons of this approach:

Pros:
  • Can appear as two distinct objects at the same time.
  • Potentially faster (because phenotype can be computed when genotype is changed)
  • More natural (genotype + phenotype)
  • Objective function does not need to support different encodings, it obtains the object it's meant to evaluate --> Easier to adapt for 3rd party applications
  • Possibly easier to parallelize because problem side representations are independent and there very small chance of their conflicts.
Cons:
  • Requires more memory, because problem-side representations are stored for each EA-side representation.
  • Adds extra interaction layer between EA and a problem.
  • Probably more code to write due to more complex structure of candidate-solutions.
  • The candidate-solutions become 'heavier'.
Anyway, I think this idea of two-sided representations is interesting.

P.S. Oh, it's possible that I've just rediscovered someone else's idea. Yeah. What if Plato was right? ;)