KURT DRIESSENS and SASO DZEROSKI (2004) "Integrating Guidance into Relational Reinforcement Learning" Machine Learning, 57, pp.271-304,
==================================================================================================
<a possible brief explanation of RL in OHP>


<in Abstract>
Reinforcement learning, and Q-learning in particular, encounter two major problems when dealing with large state spaces. First, learning the Q-function in tabular form may be infeasible because of the excessive amount of memory needed to store the table, and because the Q-function only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. ...

--------------------------------------------------------------------------------------------------
"In reinforcement learning, an agent tries to learn a policy, i.e., how to select an action in a given state of the environment, so that it maximizes the total amount of reward it receives when interacting with the environment."
---
"Q-learning (Watkins, 1989) is a form of reinforcement learning where the optimal policy is learned implicitly in the form of a Q-function, which takes a state-action pair as input and outputs the quality of the action in that state. The optimal action in a given state is then the action with the largest Q-value."
---
One of the main limitations of standard Q-learning is related to the number of different
state-action pairs that may exist. The Q-function can in principle be represented as a table
with one entry for each state-action pair.
---
"Using random exploration through the search space, rewards may simply never be encountered."
---
"Thus a mix between the classical unsupervised Q-learning and (supervised) behavioral cloning is obtained." <ideally unsupervised is wanted but actually supervised is needed more or less>