This is a sketch for a review on meta-heuristic tricks for learning properties of a search space. More formally, a lot of advanced (and not so much) algorithms try to figure out an information about local gradient for more efficient evolutionary search. This is surely feasible, because (ideally) the gradient shows the best direction to move towards local optimum and most numeric optimization methods use this information. The question about how to get away from a local optimum concerns another (although associated) problem and is not considered.
It's worth noting, that getting gradient info for noisy functions can be difficult and requires some statistical processing, which may result in several alternatives for the best direction (not obligatory collinear). Probably fuzzy logic could be of help here. I also assume that objective function has local trend, i.e. that for each point *there is* the best movement direction, which is at least good for some not very small neighborhood, because otherwise finding gradient doesn't make sense. One can say that this means that the function can be approximated by some smooth function with the same global optimum.
We can find 2 groups of methods, which learn gradient information:
1. Explicit gradient learning. There are formulae for gradient approximation, which are used during recombination or variation operations.
2. Implicit gradient learning. These methods simply do not tell directly that they are learning gradient, but they are sort of 'I'll-know-it-when-I-see-it'.
Examples of the first group are: EA with direction crossover and Pseudo-Gradient, CMA-ES, many real-valued EDAs and (h)BOA algorithms.
For the second group examples are: Swarm Optimization Methods, Differential Evolution.
In my opinion, the meta-heuristic method can be aknowledged as 'gradient learning' if there is sort of memory for the best movement direction.
Where are conventional real-coded EAs? In my current understanding it should be placed into the 2nd group. Here is my motivation: many EA operators do not look for gradient, but simply search for better solutions in vicinity of parent individuals. Most operators do not 'remember' good searching directions. But there is a selection, which selects in which direction the population evolves and thus follows the gradient. Probably many EAs should even be treated as 'implicitly implicit' gradient learning methods :).
Up to now only continuous spaces were considered, because gradient is natural there. What about binary spaces? Or spaces of graphs, computer programs, agents policies or neural networks? I believe that the gradient notion can still be applied there as long as there is a proper metric over the set of representations.
Recall that there's a large class of finite-difference methods, which quintify the space making it non-continous, but can still be succesfully applied for continous modelling problems. It can be also argued that one of the most used variant of the Taylor expansion has linear form from which the approximate gradient can be calculated.
So the largest problem for defining gradient in non-continous spaces is setting a good metric. By this 'good' word I mean that it should not be just everything, which satisfies metric properties, but it also should reflect the problem's properties. For example, setting metric for directed weighted graphs for TSP problem is not the same as setting metric for directed weighted graphs for evolution of neural networks. That's the issue. An interesting question: if a kind of 'approximate' metrics exists, which is metrics, but without the whole pack of desired problem dependent properties?
It's worth noting, that getting gradient info for noisy functions can be difficult and requires some statistical processing, which may result in several alternatives for the best direction (not obligatory collinear). Probably fuzzy logic could be of help here. I also assume that objective function has local trend, i.e. that for each point *there is* the best movement direction, which is at least good for some not very small neighborhood, because otherwise finding gradient doesn't make sense. One can say that this means that the function can be approximated by some smooth function with the same global optimum.
We can find 2 groups of methods, which learn gradient information:
1. Explicit gradient learning. There are formulae for gradient approximation, which are used during recombination or variation operations.
2. Implicit gradient learning. These methods simply do not tell directly that they are learning gradient, but they are sort of 'I'll-know-it-when-I-see-it'.
Examples of the first group are: EA with direction crossover and Pseudo-Gradient, CMA-ES, many real-valued EDAs and (h)BOA algorithms.
For the second group examples are: Swarm Optimization Methods, Differential Evolution.
In my opinion, the meta-heuristic method can be aknowledged as 'gradient learning' if there is sort of memory for the best movement direction.
Where are conventional real-coded EAs? In my current understanding it should be placed into the 2nd group. Here is my motivation: many EA operators do not look for gradient, but simply search for better solutions in vicinity of parent individuals. Most operators do not 'remember' good searching directions. But there is a selection, which selects in which direction the population evolves and thus follows the gradient. Probably many EAs should even be treated as 'implicitly implicit' gradient learning methods :).
Up to now only continuous spaces were considered, because gradient is natural there. What about binary spaces? Or spaces of graphs, computer programs, agents policies or neural networks? I believe that the gradient notion can still be applied there as long as there is a proper metric over the set of representations.
Recall that there's a large class of finite-difference methods, which quintify the space making it non-continous, but can still be succesfully applied for continous modelling problems. It can be also argued that one of the most used variant of the Taylor expansion has linear form from which the approximate gradient can be calculated.
So the largest problem for defining gradient in non-continous spaces is setting a good metric. By this 'good' word I mean that it should not be just everything, which satisfies metric properties, but it also should reflect the problem's properties. For example, setting metric for directed weighted graphs for TSP problem is not the same as setting metric for directed weighted graphs for evolution of neural networks. That's the issue. An interesting question: if a kind of 'approximate' metrics exists, which is metrics, but without the whole pack of desired problem dependent properties?
No comments:
Post a Comment