🥇Reinforcement learning or evolutionary strategies? — Both that, and another

Hey Habr!

We rarely dare to post here translations of texts from two years ago, without code and with a clearly academic focus - but today we will make an exception. We hope that the dilemma posed in the title of the article worries many of our readers, and that you have already read the fundamental work on evolutionary strategies with which this post argues in the original or you will read now. Welcome to the cat!

In March 2017, OpenAI created a buzz in the deep learning community by publishing the article “Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” In this paper, impressive results were described in favor of the fact that the light did not converge on reinforcement learning (RL), and it is advisable to try other methods when training complex neural networks. Then a discussion broke out about the importance of reinforcement learning and how it deserves the status of a “mandatory” technology in learning to solve problems. Here I want to speak about the fact that these two technologies should not be considered as competing, one of which is clearly better than the other; on the contrary, they ultimately complement each other. Indeed, if you think a little about what is required to create general AI and such systems that throughout their existence would be capable of learning, judging and planning, then we will almost certainly come to the conclusion that this or that combined solution will be required for this. By the way, it was nature who came to a combined solution, endowing mammals and other higher animals with complex intelligence in the course of evolution.

Evolutionary Strategies

The main thesis of the OpenAI article was that instead of using reinforcement learning in combination with traditional backpropagation, they successfully trained a neural network to solve complex problems using the so-called "evolutionary strategy" (ES). Such an ES approach is to maintain a network-wide distribution of weights, and many agents are involved, working in parallel and using parameters selected from this distribution. Each agent operates in its own environment, and upon completion of a given number of episodes or stages of an episode, the algorithm returns a cumulative reward, expressed as a fitness score. Given this value, the distribution of parameters can be shifted towards more successful agents, depriving the less successful ones. By repeating such an operation millions of times with the participation of hundreds of agents, it is possible to move the distribution of weights to a space that will allow us to formulate a quality policy for agents to solve their task. Indeed, the results presented in the article are impressive: it is shown that if you run a thousand agents in parallel, then anthropomorphic locomotion on two legs can be learned in less than half an hour (whereas even the most advanced RL methods require more than one hour). For more detailed information, I recommend reading the excellent post from the authors of the experiment, as well as scientific article.

Different Strategies for Teaching Anthropomorphic Upright Walking Learned from OpenAI's ES Method.

Black box

The great benefit of this method is that it can be easily parallelized. While RL methods, such as A3C, require information to be exchanged between worker threads and a parameter server, an ES only needs validation scores and generalized parameter distribution information. It is precisely because of this simplicity that this method far outperforms modern RL methods in terms of scaling capabilities. However, all this is not in vain: you have to optimize the network according to the black box principle. In this case, the “black box” means that during training, the internal structure of the network is completely ignored, and only the overall result (reward per episode) is used, and it depends on it whether the weights of a particular network will be inherited by subsequent generations. In situations where we don't get much feedback from the environment—and in many traditional RL tasks, the reward stream is very sparse—the problem goes from being a "partly black box" to a "completely black box." In this case, it is possible to seriously improve performance, so, of course, such a compromise is justified. "Who needs gradients if they're hopelessly noisy anyway?" is the general opinion.

However, in situations where the feedback is more active, things start to go wrong for the ES. The OpenAI team describes how a simple MNIST classification network was trained using ES, and this time the training was 1000 times slower. The fact is that the gradient signal in image classification is extremely informative on how to teach the network a better classification. Thus, the problem is not so much with the RL technique, but with sparse rewards in environments that give noisy gradients.

The solution found by nature

If you try to learn from the example of nature, thinking about ways to develop AI, then in some cases AI can be represented as problem-based approach. After all, nature operates within limits that computer scientists simply don't have. There is an opinion that a purely theoretical approach to solving a particular problem can provide more effective solutions than empirical alternatives. However, I still believe that it would be worthwhile to check how a dynamic system operating under certain constraints (the Earth) formed agents (animals, in particular, mammals) capable of flexible and complex behavior. While some of these limitations don't apply in the simulated worlds of data science, others are just fine.

Having considered the intellectual behavior of mammals, we see that it is formed as a result of a complex mutual influence of two closely interconnected processes: learning from experience и learning by doing. The former is often identified with evolution by natural selection, but here I am using a broader term to include epigenetics, microbiomes, and other mechanisms that enable the exchange of experience between organisms that are not genetically related to each other. The second process, learning by doing, is all the information that an animal manages to learn throughout life, and this information is directly due to the interaction of this animal with the outside world. This category includes everything from learning to recognize objects to mastering the communication inherent in the learning process.

Roughly speaking, these two processes occurring in nature can be compared with two options for optimizing neural networks. Evolutionary strategies, where information about gradients is used to update information about an organism, are approaching learning from experience. Similarly, gradient methods, where the acquisition of this or that experience leads to certain changes in the behavior of the agent, are comparable to learning from experience. If one thinks about the kinds of intellectual behavior or the abilities that each of these two approaches develops in animals, such a comparison becomes more pronounced. In both cases, "evolutionary methods" promote the study of reactive behaviors that allow the development of a certain fitness (enough to stay alive). Learning to walk or escape from captivity is in many cases equivalent to more "instinctive" behaviors, "hard-wired" in many animals at the genetic level. In addition, this example confirms that evolutionary methods are applicable in cases where the reward signal is extremely rare (such, for example, the fact of successful rearing of a cub). In such a case, it is impossible to relate the reward to any specific set of actions that may have taken place many years before the occurrence of this fact. On the other hand, if we consider the case in which ES fails, namely image classification, the results are remarkably comparable to the results of animal learning achieved in countless behavioral psychological experiments conducted over 100+ years.

Animal learning

The methods used in reinforcement learning are in many cases taken directly from the psychological literature on operant conditioning, and operant conditioning was studied on the material of animal psychology. By the way, Richard Sutton, one of the two founders of reinforcement learning, has a bachelor's degree in psychology. In the context of operant conditioning, animals learn to associate reward or punishment with specific behavioral patterns. Trainers and researchers can manipulate this reward association in one way or another, provoking animals to exhibit intelligence or certain behaviors. However, operant conditioning used in animal research is nothing more than a more pronounced form of the very conditioning that animals learn from throughout their lives. We constantly receive signals of positive reinforcement from the environment and adjust our behavior accordingly. Indeed, many neuroscientists and cognitivists believe that humans and other animals actually operate even one level higher and are constantly learning to predict the outcomes of their behavior in future situations in anticipation of potential rewards.

The central role of predictive learning in experiential learning changes the dynamics described above in the most significant way. The signal that was previously considered very sparse (episodic reward) turns out to be very dense. Theoretically, the situation is something like this: at each moment in time, the mammalian brain calculates the results based on a complex stream of sensory stimuli and actions, while the animal is simply immersed in this stream. In this case, the final behavior of the animal gives a dense signal, which must be guided in correcting predictions and developing behavior. The brain uses all these signals to optimize predictions (and, accordingly, the quality of actions performed) in the future. An overview of this approach is given in the excellent book “Surfing Uncertainty” cognitive scientist and philosopher Andy Clark. If such reasoning is extrapolated to the training of artificial agents, then reinforcement learning reveals a fundamental flaw: the signal used in this paradigm turns out to be hopelessly weak compared to what it could be (or should be). In cases where it is impossible to increase the saturation of the signal (perhaps because it is by definition weak, or associated with low-level reactivity), it is probably better to prefer a training method that is well parallelized, for example, ES.

Richer Training of Neural Networks

Based on the principles of higher nervous activity inherent in the mammalian brain, which is constantly engaged in prediction, some progress has recently been made in reinforcement learning, which now takes into account the importance of such predictions. Right off the bat, I can recommend you two similar jobs:

In both of these papers, the authors supplement the typical default policy of their neural networks with the results of predictions regarding the state of the environment in the future. In the first article, prediction is applied to a variety of measurement variables, and in the second, changes in the environment and the behavior of the agent as such. In both cases, the sparse signal associated with positive reinforcement becomes much richer and more informative, providing both accelerated learning and the acquisition of more complex behavior patterns. Such enhancements are only available with gradient signal methods, not with black box methods such as ES.

In addition, learning by doing and gradient methods are much more effective. Even in those cases when it was possible to study a particular problem using the ES method faster than using reinforcement learning, the gain was achieved due to the fact that many times more data was involved in the ES strategy than with RL. Reflecting in this case on the principles of learning from animals, we note that the result of learning from someone else's example manifests itself after many generations, while sometimes a single event experienced from one's own experience is enough for an animal to learn a lesson forever. While similar learning without examples while it does not quite fit into traditional gradient methods, it is much more intelligible than ES. There are, for example, approaches such as neural episodic control, where Q-values are stored during training, after which the program checks against them before taking actions. It turns out a gradient method that allows you to learn how to solve problems much faster than before. In an article on neural episodic control, the authors mention the human hippocampus, which is able to retain information about an event even after a single experience and, therefore, plays critical role in the process of remembering. Such mechanisms require access to the internal organization of the agent, which is also by definition impossible in the ES paradigm.

So why not combine them?

Probably much of this article could have left the impression that I am advocating RL methods. However, in fact, I believe that in the long run, the best solution is a combination of both methods, so that each is used in the situations in which it is best suited. Obviously, in the case of many reactive policies or in situations with very sparse positive reinforcement signals, the ES wins, especially if you have the computing power at your disposal, on which you can run massively parallel learning. On the other hand, gradient methods using reinforcement learning or supervised learning will be useful when we have a lot of feedback available and the problem needs to be learned quickly and with less data.

Turning to nature, we find that the first method, in essence, lays the foundation for the second. That is why, in the course of evolution, mammals have developed a brain that allows them to learn extremely effectively from the material of complex signals coming from the environment. So the question remains open. Perhaps evolutionary strategies will help us invent effective learning architectures that will be useful for gradient learning methods as well. After all, the solution found by nature is indeed very successful.

Source: habr.com

Reinforcement learning or evolutionary strategies? - Both