Hey Habr!
We rarely dare to post here translations of texts from two years ago, without code and with a clearly academic focus - but today we will make an exception. We hope that the dilemma posed in the title of the article worries many of our readers, and that you have already read the fundamental work on evolutionary strategies with which this post argues in the original or you will read now. Welcome to the cat!
In March 2017, OpenAI created a buzz in the deep learning community by publishing the article β
Evolutionary Strategies
The main thesis of the OpenAI article was that instead of using reinforcement learning in combination with traditional backpropagation, they successfully trained a neural network to solve complex problems using the so-called "evolutionary strategy" (ES). Such an ES approach is to maintain a network-wide distribution of weights, and many agents are involved, working in parallel and using parameters selected from this distribution. Each agent operates in its own environment, and upon completion of a given number of episodes or stages of an episode, the algorithm returns a cumulative reward, expressed as a fitness score. Given this value, the distribution of parameters can be shifted towards more successful agents, depriving the less successful ones. By repeating such an operation millions of times with the participation of hundreds of agents, it is possible to move the distribution of weights to a space that will allow us to formulate a quality policy for agents to solve their task. Indeed, the results presented in the article are impressive: it is shown that if you run a thousand agents in parallel, then anthropomorphic locomotion on two legs can be learned in less than half an hour (whereas even the most advanced RL methods require more than one hour). For more detailed information, I recommend reading the excellent
Different Strategies for Teaching Anthropomorphic Upright Walking Learned from OpenAI's ES Method.
Black box
The great benefit of this method is that it can be easily parallelized. While RL methods, such as A3C, require information to be exchanged between worker threads and a parameter server, an ES only needs validation scores and generalized parameter distribution information. It is precisely because of this simplicity that this method far outperforms modern RL methods in terms of scaling capabilities. However, all this is not in vain: you have to optimize the network according to the black box principle. In this case, the βblack boxβ means that during training, the internal structure of the network is completely ignored, and only the overall result (reward per episode) is used, and it depends on it whether the weights of a particular network will be inherited by subsequent generations. In situations where we don't get much feedback from the environmentβand in many traditional RL tasks, the reward stream is very sparseβthe problem goes from being a "partly black box" to a "completely black box." In this case, it is possible to seriously improve performance, so, of course, such a compromise is justified. "Who needs gradients if they're hopelessly noisy anyway?" is the general opinion.
However, in situations where the feedback is more active, things start to go wrong for the ES. The OpenAI team describes how a simple MNIST classification network was trained using ES, and this time the training was 1000 times slower. The fact is that the gradient signal in image classification is extremely informative on how to teach the network a better classification. Thus, the problem is not so much with the RL technique, but with sparse rewards in environments that give noisy gradients.
The solution found by nature
If you try to learn from the example of nature, thinking about ways to develop AI, then in some cases AI can be represented as
Having considered the intellectual behavior of mammals, we see that it is formed as a result of a complex mutual influence of two closely interconnected processes: learning from experience ΠΈ learning by doing. The former is often identified with evolution by natural selection, but here I am using a broader term to include epigenetics, microbiomes, and other mechanisms that enable the exchange of experience between organisms that are not genetically related to each other. The second process, learning by doing, is all the information that an animal manages to learn throughout life, and this information is directly due to the interaction of this animal with the outside world. This category includes everything from learning to recognize objects to mastering the communication inherent in the learning process.
Roughly speaking, these two processes occurring in nature can be compared with two options for optimizing neural networks. Evolutionary strategies, where information about gradients is used to update information about an organism, are approaching learning from experience. Similarly, gradient methods, where the acquisition of this or that experience leads to certain changes in the behavior of the agent, are comparable to learning from experience. If one thinks about the kinds of intellectual behavior or the abilities that each of these two approaches develops in animals, such a comparison becomes more pronounced. In both cases, "evolutionary methods" promote the study of reactive behaviors that allow the development of a certain fitness (enough to stay alive). Learning to walk or escape from captivity is in many cases equivalent to more "instinctive" behaviors, "hard-wired" in many animals at the genetic level. In addition, this example confirms that evolutionary methods are applicable in cases where the reward signal is extremely rare (such, for example, the fact of successful rearing of a cub). In such a case, it is impossible to relate the reward to any specific set of actions that may have taken place many years before the occurrence of this fact. On the other hand, if we consider the case in which ES fails, namely image classification, the results are remarkably comparable to the results of animal learning achieved in countless behavioral psychological experiments conducted over 100+ years.
Animal learning
The methods used in reinforcement learning are in many cases taken directly from the psychological literature on
The central role of predictive learning in experiential learning changes the dynamics described above in the most significant way. The signal that was previously considered very sparse (episodic reward) turns out to be very dense. Theoretically, the situation is something like this: at each moment in time, the mammalian brain calculates the results based on a complex stream of sensory stimuli and actions, while the animal is simply immersed in this stream. In this case, the final behavior of the animal gives a dense signal, which must be guided in correcting predictions and developing behavior. The brain uses all these signals to optimize predictions (and, accordingly, the quality of actions performed) in the future. An overview of this approach is given in the excellent book β
Richer Training of Neural Networks
Based on the principles of higher nervous activity inherent in the mammalian brain, which is constantly engaged in prediction, some progress has recently been made in reinforcement learning, which now takes into account the importance of such predictions. Right off the bat, I can recommend you two similar jobs:
In both of these papers, the authors supplement the typical default policy of their neural networks with the results of predictions regarding the state of the environment in the future. In the first article, prediction is applied to a variety of measurement variables, and in the second, changes in the environment and the behavior of the agent as such. In both cases, the sparse signal associated with positive reinforcement becomes much richer and more informative, providing both accelerated learning and the acquisition of more complex behavior patterns. Such enhancements are only available with gradient signal methods, not with black box methods such as ES.
In addition, learning by doing and gradient methods are much more effective. Even in those cases when it was possible to study a particular problem using the ES method faster than using reinforcement learning, the gain was achieved due to the fact that many times more data was involved in the ES strategy than with RL. Reflecting in this case on the principles of learning from animals, we note that the result of learning from someone else's example manifests itself after many generations, while sometimes a single event experienced from one's own experience is enough for an animal to learn a lesson forever. While similar
So why not combine them?
Probably much of this article could have left the impression that I am advocating RL methods. However, in fact, I believe that in the long run, the best solution is a combination of both methods, so that each is used in the situations in which it is best suited. Obviously, in the case of many reactive policies or in situations with very sparse positive reinforcement signals, the ES wins, especially if you have the computing power at your disposal, on which you can run massively parallel learning. On the other hand, gradient methods using reinforcement learning or supervised learning will be useful when we have a lot of feedback available and the problem needs to be learned quickly and with less data.
Turning to nature, we find that the first method, in essence, lays the foundation for the second. That is why, in the course of evolution, mammals have developed a brain that allows them to learn extremely effectively from the material of complex signals coming from the environment. So the question remains open. Perhaps evolutionary strategies will help us invent effective learning architectures that will be useful for gradient learning methods as well. After all, the solution found by nature is indeed very successful.
Source: habr.com