Next: 6.6 Actor-Critic Methods Up: 6. Temporal-Difference Learning Previous: 6.4 Sarsa: On-Policy TD Contents
One of the most important breakthroughs in reinforcement learning was the
development of an off-policy TD control algorithm known as Q-learning
(Watkins, 1989). Its simplest form, one-step Q-learning, is defined by
|
(6.6) |
In this case, the
learned action-value function,
, directly approximates
, the optimal action-value function, independent of the policy being
followed. This dramatically simplifies the analysis of the algorithm and enabled
early convergence proofs. The policy still has an effect in that it determines
which state-action pairs are visited and updated. However, all that is required
for correct convergence is that all pairs continue to be updated. As we observed
in Chapter 5, this is a minimal requirement in the sense that any method
guaranteed to find optimal behavior in the general case must require it. Under
this assumption and a variant of the usual stochastic approximation conditions
on the sequence of step-size parameters,
has been shown to converge with probability 1 to
. The Q-learning algorithm is shown in procedural form in
Figure
6.12.
Figure 6.12: Q-learning: An off-policy
TD control algorithm.
|
What is the backup diagram for Q-learning? The rule (6.6)
updates a state-action pair, so the top node, the root of the backup, must be a
small, filled action node. The backup is also from action nodes,
maximizing over all those actions possible in the next state. Thus the bottom
nodes of the backup diagram should be all these action nodes. Finally, remember
that we indicate taking the maximum of these "next action" nodes with an arc
across them (Figure 3.7). Can you guess now what the diagram is? If so,
please do make a guess before turning to the answer in Figure
6.14.
Figure 6.13: The cliff-walking task.
The results are from a single run, but smoothed.
|
Figure 6.14: The backup diagram for
Q-learning.
|
Example 6.6: Cliff Walking This gridworld example
compares Sarsa and Q-learning, highlighting the difference between on-policy
(Sarsa) and off-policy (Q-learning) methods. Consider the gridworld shown in the
upper part of Figure
6.13. This is a standard undiscounted, episodic task, with start and goal
states, and the usual actions causing movement up, down, right, and left. Reward
is
on all transitions except those into the the region marked "The
Cliff." Stepping into this region incurs a reward of
and sends the agent instantly back to the start. The lower part of the
figure shows the performance of the Sarsa and Q-learning methods with
-greedy action selection,
. After an initial transient, Q-learning learns values for the optimal
policy, that which travels right along the edge of the cliff. Unfortunately,
this results in its occasionally falling off the cliff because of the
-greedy action selection. Sarsa, on the other hand, takes the action
selection into account and learns the longer but safer path through the upper
part of the grid. Although Q-learning actually learns the values of the optimal
policy, its on-line performance is worse than that of Sarsa, which learns the
roundabout policy. Of course, if
were gradually reduced, then both methods would asymptotically
converge to the optimal policy.
Exercise 6.9 Why is Q-learning considered an
off-policy control method?
Exercise 6.10 Consider the learning algorithm
that is just like Q-learning except that instead of the maximum over next
state-action pairs it uses the expected value, taking into account how likely
each action is under the current policy. That is, consider the algorithm
otherwise like Q-learning except with the update rule
Is this new method an on-policy or
off-policy method? What is the backup diagram for this algorithm? Given the same
amount of experience, would you expect this method to work better or worse than
Sarsa? What other considerations might impact the comparison of this method with
Sarsa?
Next: 6.6 Actor-Critic Methods Up: 6. Temporal-Difference Learning Previous: 6.4 Sarsa: On-Policy TD Contents
Mark Lee 2005-01-04