Related update
rules have been proposed in the past. For instance, the updates
used in the
adaptive search elements (ASEs) described in [4, 2, 1, 3] are of a
sim-
ilar form (see
also [25]). However, it is not known in what sense these update
rules optimize
performance. The update rule we present here is based on similar
foundations to
the REINFORCE class of algorithms introduced by Williams [27].
However, when
applied to spiking neurons such as those described here, REIN-
FORCE leads to
parameter updates in the steepest ascent direction in two limited
situations: when
the reward depends only on the current input to the neuron and
the neuron
outputs do not affect the statistical properties of the inputs, and
when
the reward
depends only on the sequence of inputs since the arrival of the
last
reward value.
Furthermore, in both cases the parameter updates must be carefully
synchronized
with the timing of the reward values, which is especially problem-
atic for
networks with more than one layer of neurons.
In Section 2, we
describe reinforcement learning problems, in which an agent
aims to maximize
the long-term average of a reward signal. Reinforcement
learn-
ing is a useful
abstraction that encompasses many diverse learning problems, such
as supervised
learning for pattern classification or predictive modelling, time
se-
ries prediction,
adaptive control, and game playing. We review the direct
rein-
forcement
learning algorithm we proposed in [5] and show in Section 3 that,
in
the case of
multiple independent agents cooperating to optimize performance,
the
algorithm
conveniently decomposes in such a way that the agents are able to
learn
independently
with no need for explicit communication.
In Section 4, we
consider a network of model neurons as a collection of agents
cooperating to
solve a reinforcement learning problem, and show that the direct
reinforcement
learning algorithm leads to a simple synaptic update rule, and that
the
decomposition property implies that only local information is needed for
the
updates. Section
5 discusses possible mechanisms for the synaptic update rule in
biological
neural networks.
The parsimony of
requiring only one simple mechanism to optimize param-
eters for many
diverse learning problems is appealing (cf [26]). In Section 6,
we present
results of simulation experiments, illustrating the performance of
this
update rule for
pattern recognition and adaptive control
problems.