R. V. Florian "Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity." ================================================================================================== "In a previous preliminary study, we have derived analytically learning rules involving modulated STDP for networks of probabilistic integrate-and-fire neurons and tested them and some generalizations of them in simulations, in a biologicallyinspired context (Florian, 2005). The derivations in the previous sections show that reinforcement learning algorithms that involve reward-modulated STDP can be justified analytically. We have also previously tested in simulation, in a biologically-inspired experiment, one of the derived algorithms, to demonstrate practically its efficacy (Florian, 2005)." "More biologically-plausible simulations of reward-modulated STDP were presented elsewhere (Florian, 2005)" Florian, R. V. (2005), A reinforcement learning algorithm for spiking neural networks, in D. Zaharie, D. Petcu, V. Negru, T. Jebelean, G. Ciobanu, A. CicortaCs, A. Abraham and M. Paprzycki, eds, eProceedings of the Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2005)', IEEE Computer Society, Los Alamitos, CA, pp. 299--306. http://www.coneural.org/florian/papers/05_RL_for_spiking_NNs.php -------------------------------------------------------------------------------------------------- The algorithm is derived as an application of the OLPOMDP reinforcement learning algorithm (Baxter et al., 1999, 2001), an online variant of the GPOMDP algorithm (Bartlett and Baxter, 1999a; Baxter and Bartlett, 2001). Baxter, J., Weaver, L. and Bartlett, P. L. (1999), Direct gradient-based reinforcement learning: II. Gradient ascent algorithms and experiments, Technical report, Australian National University, Research School of Information Sciences and Engineering. http://cs.anu.edu.au/~Lex.Weaver/pub_sem/publications/drlexp_99.pdf => done Baxter, J., Bartlett, P. L. and Weaver, L. (2001), eExperiments with infinite-horizon, policy-gradient estimationf, Journal of Artificial Intelligence Research 15, 351. 381. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/jair/OldFiles/OldFiles/pub/volume15/baxter01b.pdf Results related to the convergence of OLPOMDP to local maxima have been obtained (Bartlett and Baxter, 2000a; Bartlett, P. and Baxter, J. (2000a), Stochastic optimization of controlled partially observableMarkov decision processes, Proceedings of the 39th IEEE Conference on Decision and Control. Stochastic-optimization-of-controlled-partially-observableMarkov-decision-processes It was shown that applying the algorithm to a system of interacting agents that seek to maximize the same reward signal r (t ) is equivalent to applying the algorithmindependently to each agent i (Bartlett and Baxter, 1999b, 2000b). Bartlett, P. L. and Baxter, J. (1999b), Hebbian synaptic modifications in spiking neurons that learn, Technical report, Australian National University, Research School of Information Sciences and Engineering. Bartlett, P. L. and Baxter, J. (2000b), eA biologically plausible and locally optimal learning algorithm for spiking neuronsf, http://arp.anu.edu.au/ftp/papers/jon/brains.pdf.gz. "z0 is an eligibility trace (Sutton and Barto, 1998), Sutton, R. S. and Barto, A. G. (1998), Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA. http://www.cs.ualberta.ca/ml