ohp briefly explain what is RL example situation => points + fuel action => membrane-voltage action fire or not subtitle toward a more inttelligent AI with NN "In reinforcement learning, an agent tries to learn a policy, i.e., how to select an action in a given state of the environment, so that it maximizes the total amount of reward it receives when interacting with the environment." --- "Q-learning (Watkins, 1989) is a form of reinforcement learning where the optimal policy is learned implicitly in the form of a Q-function, which takes a state-action pair as input and outputs the quality of the action in that state. The optimal action in a given state is then the action with the largest Q-value." --- One of the main limitations of standard Q-learning is related to the number of different state-action pairs that may exist. The Q-function can in principle be represented as a table with one entry for each state-action pair. --- "Using random exploration through the search space, rewards may simply never be encountered." --- "Thus a mix between the classical unsupervised Q-learning and (supervised) behavioral cloning is obtained." GA Agent with its chromosome decides actions (which way to go one step) RL Agent with its policy decides actions (which way to go one step) in the sense that agent follows its policy with a probability epsilon and at random with (1-epsilon) result in not diterministic movement of the agent abstract key references