R. V. Florian "Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity."
==================================================================================================

"In a previous preliminary study, we have derived analytically learning rules involving modulated STDP for networks of probabilistic integrate-and-fire neurons and tested them and some generalizations of them in simulations, in a biologicallyinspired context (Florian, 2005).

The derivations in the previous sections show that reinforcement learning algorithms that involve reward-modulated STDP can be justified analytically. We have also previously tested in simulation, in a biologically-inspired experiment, one of the derived algorithms, to demonstrate practically its efficacy (Florian, 2005)."

"More biologically-plausible simulations of reward-modulated STDP were presented elsewhere (Florian, 2005)"


     Florian, R. V. (2005), A reinforcement learning algorithm for spiking neural networks, in D. Zaharie, D. Petcu, V. Negru, T. Jebelean, G. Ciobanu, A. Cicorta，s, A. Abraham and M. Paprzycki, eds, ‘Proceedings of the Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2005)', IEEE Computer Society, Los Alamitos, CA, pp. 299--306. http://www.coneural.org/florian/papers/05_RL_for_spiking_NNs.php


--------------------------------------------------------------------------------------------------
The algorithm is derived as an application of the OLPOMDP reinforcement learning
algorithm (Baxter et al., 1999, 2001), an online variant of the GPOMDP algorithm
(Bartlett and Baxter, 1999a; Baxter and Bartlett, 2001).


     Baxter, J., Weaver, L. and Bartlett, P. L. (1999), Direct gradient-based reinforcement learning: II. Gradient ascent algorithms and experiments, Technical report, Australian National University, Research School of Information Sciences and Engineering. http://cs.anu.edu.au/~Lex.Weaver/pub_sem/publications/drlexp_99.pdf => done


     Baxter, J., Bartlett, P. L. and Weaver, L. (2001), ‘Experiments with infinite-horizon, policy-gradient estimation’, Journal of Artificial Intelligence Research 15, 351. 381. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/jair/OldFiles/OldFiles/pub/volume15/baxter01b.pdf


Results related to the convergence of OLPOMDP to local maxima have been obtained (Bartlett and Baxter, 2000a;

     Bartlett, P. and Baxter, J. (2000a), 
Stochastic optimization of controlled partially observableMarkov decision processes, Proceedings of the 39th IEEE Conference on Decision and Control.

Stochastic-optimization-of-controlled-partially-observableMarkov-decision-processes


It was shown that applying the algorithm to a system of interacting agents that
seek to maximize the same reward signal r (t ) is equivalent to applying the algorithmindependently
to each agent i (Bartlett and Baxter, 1999b, 2000b).


     Bartlett, P. L. and Baxter, J. (1999b), Hebbian synaptic modifications in spiking neurons
that learn, Technical report, Australian National University, Research School
of Information Sciences and Engineering.

Bartlett, P. L. and Baxter, J. (2000b), ‘A biologically plausible
and locally optimal learning algorithm for spiking neurons’,
http://arp.anu.edu.au/ftp/papers/jon/brains.pdf.gz.


"z0 is an eligibility trace (Sutton and Barto, 1998),


Sutton, R. S. and Barto, A. G. (1998), Reinforcement Learning: An Introduction, MIT
Press, Cambridge, MA.
http://www.cs.ualberta.ca/ml