Model-building adaptive critics for semi-Markov control

Gosavi, A.; Murray, S.; Hu, J.; Ghosh, S.

Artykuł - szczegóły

Tytuł artykułu

Model-building adaptive critics for semi-Markov control

Autorzy

Gosavi A. , Murray S. , Hu J. , Ghosh S.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Adaptive (or actor) critics are a class of reinforcement learning algorithms. Generally, in adaptive critics, one starts with randomized policies and gradually updates the probability of selecting actions until a deterministic policy is obtained. Classically, these algorithms have been studied for Markov decision processes under model-free updates. Algorithms that build the model are often more stable and require less training in comparison to their model-free counterparts. We propose a new model-building adaptive critic, which builds the model during the learning, for a discounted-reward semi-Markov decision process under some assumptions on the structure of the process. We illustrate the use of our algorithm with numerical results on a system with 10 states and a real-world case-study from management science.

Słowa kluczowe

adaptive critics learning algorithm semi-Markov process decision process

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2012

Tom

Vol. 2, No. 1

Strony

43--58

Opis fizyczny

Bibliogr. 50 poz., rys.

Twórcy

autor

Gosavi A.

gosavia@mst.edu

Department of Engineering Management and Systems Engineering Missouri University of Science and Technology, Rolla, MO 65401

autor

Murray S.

Department of Applied Mathematics and Statistics Stonybrook University, Stonybrook, New York 11794-3600

autor

Hu J.

Department of Applied Mathematics and Statistics Stonybrook University, Stonybrook, New York 11794-3600

autor

Ghosh S.

Department of Engineering Management and Systems Engineering Missouri University of Science and Technology, Rolla, MO 65401

Bibliografia

[1] P. Abbeel, A. Coates, T. Hunter, and A.Y. Ng. Autonomous autorotation of an rc helicopter,. In International Symposium on Robotics, 2008.
[2] A.G. Barto, S.J. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995.
[3] A.G. Barto, R.S. Sutton, and C.W. Anderson. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846, 1983.
[4] R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
[5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Massachusetts, 1995.
[6] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.
[7] S. Bhatnagar and J. R. Panigrahi. Actor-critic algorithms for hierarchical Markov decision processes. Automatica, 42:637–644, 2006.
[8] V. S. Borkar. Stochastic approximation with twotime scales. Systems and Control Letters, 29:291–294, 1997.
[9] S.J. Bradtke and M. Duff. Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, 1995.
[10] R.I. Brafman and M. Tennenholtz. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
[11] G.K. Chan and S. Asgarpoor. Optimal maintenance policy with Markov processes. Electric Power Systems Research, 76:452–456, 2006.
[12] H.S. Chang, M.C. Fu, J. Hu, and S.I. Marcus. Simulation-based algorithms for Markov decision processes. Springer, NY, 2007.
[13] D. Chen and K.S. Trivedi. Optimization for cindition-based maintenance with semi-Markov decision process. Reliability Engineering and System Safety, 90:25–29, 2005.
[14] T. K. Das and S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions on Quality and Reliability, 31. (in press).
[15] T.K. Das, A. Gosavi, S. Mahadevan, and N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45(4):560–574, 1999.
[16] C. Diuk, L. Li, and B.R. Leffler. The adaptive kmeteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.
[17] S. Ferrari and R. Stengel. Model-based adaptive critic designs. In Learning and Approximate Dynamic Programming (edited by J. Si, A. Barto, W. Powell, and D. Wunsch, Chapter 3). John Wiley and Sons, New York, NY, USA, 2005.
[18] A. Gosavi. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning. Kluwer Academic Publishers, Boston, MA, 2003.
[19] A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, 42:1321–1330, 2006.
[20] A. Gosavi. Adaptive critics for airline revenue management. In Conference Proceedings of the Production and Operations Management Society, Dallas, TX, 2007.
[21] A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of the Winter Simulation Conference, Austin, TX. IEEE, 2009.
[22] A. Gosavi. Model building for robust reinforcement learning. In Conference Proceedings of ANNIE. ASME Press, 2010.
[23] A. Gosavi. Target-sensitive control of Markov and semi-Markov processes. To appear in International Journal of Control, Automation, and Systems, 2011.
[24] A. Gosavi, S. Murray, and J. Hu. Model-building semi-Markov adaptive critics. In Proceedings of the IEEE Symposium on Computational Intelligence: ADPRL, Paris, France, 2011.
[25] R. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960.
[26] S. Ishii, W. Yoshida, and J. Yoshimoto. Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Networks, 15:665–687, 2002.
[27] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2):209–232, 2002.
[28] V.R. Konda and V. S. Borkar. Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94–123, 1999.
[29] R. Koppejan and S. Whiteson. Neuroevolutionary reinforcement learning for generalized helicopter control. In GECCO: Proceedings of the Genetic and Evolutionary Computation Conference, pages 145–152, 2009.
[30] K. Kulkarni, A. Gosavi, S. Murray, and K. Grantham. Semi-Markov adpative critic heuristics with application to airline revenue management. Journal of Control Theory and Applications (Special issue on Reinforcement Learning and Approximate Dynamic Programming), 9:421–430, 2011.
[31] K. McKone, R.G. Schhroeder, and K.O. Cua. Total productive maintenance: A contextual review. Journal of operations management, 17:123–144, 1999.
[32] J. Michels, A. Saxena, and A.Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005.
[33] A.W. Moore and C.G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993.
[34] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 17. MIT Press, 2004.
[35] W. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Wiley-Interscience, NJ, USA, 2007.
[36] J.R. Ramirez-Hernández and E. Fernandez. Control of a re-entrant line manufacturing model with a reinforcement learning approach. In Sixth International Conference on Machine Learning, pages 330–335. IEEE, 2007.
[37] H. Robbins and S. Monro. A stochasstic approximation method. Ann. Math. Statist., 22:400–407, 1951.
[38] C-R. Robelin and S.M. Madanat. Historydependent bridge deck maintenance and replacement optimization with Markov decision processes.Journal of Infrastructure Systems, 13(3):195–201, 2007.
[39] J. Si, A. Barto, W. Powell, and D. Wunsch. Learning and Approximate Dynamic Programming (Edited). John Wiley and Sons, New York, NY, USA, 2005.
[40] A.L. Strehl and M.L. Littman. A theoretical analysis of model-based interval estimation. In Proceedings of the 22th International Conference on Machine Learning, page 856863, 2005.
[41] R. Sutton and A. G. Barto. Reinforcement Learning. The MIT Press, Cambridge, Massachusetts, 1998.
[42] R.S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Workshop on Machine Learning, pages 216–224. Morgan Kaufmann, San Mateo, CA, 1990.
[43] C. Szepesvari. Algorithms for Reinforcement Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan Claypool Publishers, 2010.
[44] P. Tadepalli and D. Ok. Model-based average reward reinforcement learning algorithms. Artificial Intelligence, 100:177–224, 1998.
[45] H. van Seijen, S. Whiteson, H. van Hasselt, and M. Wiering. Exploiting best-match equations for efficient reinforcement learning. Journal of Machine Learning Research, 12:2045–2094, 2011.
[46] C.J. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, England, May 1989.
[47] P. J. Werbös. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man., and Cybernetics, 17:7–20, 1987.
[48] P.J. Werbös. A menu of designs for reinforcement learning over time. In Neural Networks for Control, pages 67–95. MIT Press, MA, 1990.
[49] M. A. Wiering, R. P. Salustowicz, and J. Schmidhuber. Model-based reinforcement learning for evolving soccer strategies. In Computational Intelligence in Games. Springer Verlag, 2001.
[50] W. Yoshida and S. Ishii. Model-based reinforcement learning: A computational model and an fMRI study. Neurocomputing, 63:253–269, 2005.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-5ed543c8-bf84-4dbd-a4e6-17a74540e47c