ohp
briefly explain what is RL
example
situation => points + fuel action =>
membrane-voltage action fire or not

subtitle
toward a more inttelligent AI with NN

<a possible brief explanation of RL in OHP>
"In reinforcement learning, an agent tries to learn a policy, i.e., how to select an action in a given state of the environment, so that it maximizes the total amount of reward it receives when interacting with the environment."
---
"Q-learning (Watkins, 1989) is a form of reinforcement learning where the optimal policy is learned implicitly in the form of a Q-function, which takes a state-action pair as input and outputs the quality of the action in that state. The optimal action in a given state is then the action with the largest Q-value."
---
One of the main limitations of standard Q-learning is related to the number of different
state-action pairs that may exist. The Q-function can in principle be represented as a table
with one entry for each state-action pair.
---
"Using random exploration through the search space, rewards may simply never be encountered."
---
"Thus a mix between the classical unsupervised Q-learning and (supervised) behavioral cloning is obtained." <ideally unsupervised is wanted but actually supervised is needed more or less>


GA
Agent with its chromosome decides actions (which way to go one step)
RL
Agent with its policy decides actions (which way to go one step)


in the sense that
agent follows its policy with a probability epsilon and at random with (1-epsilon)
result in not diterministic movement of the agent


abstract
key references