We have seen agents playing Atari games or Alpha Go, doing financial trading and modeling natural language. After watching reinforcement learning agents doing great in some domains let me tell you the real story here.
In this episode I want to shine some light on reinforcement learning (RL) and some of the limitations that every practitioner should consider before taking certain directions.
Reinforcement learning seems to work so well. What is wrong with reinforcement learning?
In this episode I speak about sample inefficiency, reward functions, number of actions and states and optimisation.
Sample inefficiency
Reinforcement learning needs a ton of data or epochs. This is equivalent to thousands of computing hours in a simulator. Such a long time is necessary to learn what humans can usually grasp in a few hours.
For example, Rainbow DQN plays a number of games with the same engine and picks the best algorithm as a comparison. Such algorithm requires 44 million frames to learn play with superhuman capabilities. RainbowDQN passes the 100% threshold (just above human capabilities) at about 18 million frames. In other words this is about 83 hours of play experience. To this number one should add the time for training the model.
That’s a lot of time! Especially when one considers that Atari games can be picked by a teenager within a few minutes.
The necessity of a reward function
For reinforcement learning to do the right thing, one must design a proper reward function. Such a function must capture exactly what the designer want the reinforcement learning agent to solve.
In simulated environments like the Atari video games it is relatively easy to design a reward function that captures what the agent is supposed to do. As a matter of fact, this helps the agent find the optimal policy straightforwardly.
However, in realistic cases, designing a reward function is not as easy.
For instance, there are many cases in which the agent faces multiple objectives and many other times the algorithm designer is not even aware of the new objectives the agent might be dealing with during the game.
Reward functions that are too simple, will hardly capture the possible scenarios of the game/problem to solve.
In other words, this makes it impossible for the agent to learn anything useful.
Other times the agent learns to maximise the result by performing actions that were not thought as legit in the first place.
Number of possible actions is usually finite
In the most common version of reinforcement learning, action sets are discrete. In many realistic use case, agents perform actions in a continuous space. Making a discrete action, continuous is not only non-trivial but will also increase the number of (discrete) actions the agent will have to deal with during policy optimisation. This in turn, affects training time and performance.
Local optima are harder than those found in deep learning
If one thinks that it can be hard for Stochastic Gradient Descent to get out of local optima, then she should think twice. The level of complexity reached by the optimisation procedure of reinforcement learning is much higher than deep learning. It is much more difficult for a reinforcement learning agent escape local optima.
This is due to the design of the reward function and to the state-action estimator itself (which is a deep neural network).
Do you agree about what is wrong with reinforcement learning?
Don’t forget to join the conversation on our new Discord channel. See you there!