PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Powiadomienia systemowe
  • Sesja wygasła!
Tytuł artykułu

An active exploration method for data efficient reinforcement learning

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Reinforcement learning (RL) constitutes an effective method of controlling dynamic systems without prior knowledge. One of the most important and difficult problems in RL is the improvement of data efficiency. Probabilistic inference for learning control (PILCO) is a state-of-the-art data-efficient framework that uses a Gaussian process to model dynamic systems. However, it only focuses on optimizing cumulative rewards and does not consider the accuracy of a dynamic model, which is an important factor for controller learning. To further improve the data efficiency of PILCO, we propose its active exploration version (AEPILCO) that utilizes information entropy to describe samples. In the policy evaluation stage, we incorporate an information entropy criterion into long-term sample prediction. Through the informative policy evaluation function, our algorithm obtains informative policy parameters in the policy improvement stage. Using the policy parameters in the actual execution produces an informative sample set; this is helpful in learning an accurate dynamic model. Thus, the AEPILCOalgorithm improves data efficiency by learning an accurate dynamic model by actively selecting informative samples based on the information entropy criterion. We demonstrate the validity and efficiency of the proposed algorithm for several challenging controller problems involving a cart pole, a pendubot, a double pendulum, and a cart double pendulum. The AEPILCO algorithm can learn a controller using fewer trials compared to PILCO. This is verified through theoretical analysis and experimental results.
Rocznik
Strony
351--362
Opis fizyczny
Bibliogr. 20 poz., rys., tab., wykr.
Twórcy
  • School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China
autor
  • School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China
autor
  • School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China
  • School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China
  • School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China
Bibliografia
  • [1] Ahmed, N.A. and Gokhale, D. (1989). Entropy expressions and their estimators for multivariate distributions, IEEE Transactions on Information Theory 35(3): 688–692.
  • [2] Bagnell, J.A. and Schneider, J.G. (2001). Autonomous helicopter control using reinforcement learning policy search methods, IEEE International Conference on Robotics and Automation, Seoul, South Korea, Vol. 2, pp. 1615–1620.
  • [3] Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S. and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning, arXiv:1703.03078.
  • [4] Deisenroth, M.P., Fox, D. and Rasmussen, C.E. (2015). Gaussian processes for data-efficient learning in robotics and control, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(2): 408–423.
  • [5] Deisenroth, M. and Rasmussen, C.E. (2011). PILCO: A model-based and data-efficient approach to policy search, Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, pp. 465–472.
  • [6] Ebert, F., Finn, C., Lee, A.X. and Levine, S. (2017). Self-supervised visual planning with temporal skip connections, arXiv:1710.05268.
  • [7] Fabisch, A. and Metzen, J.H. (2014). Active contextual policy search, Journal of Machine Learning Research 15(1): 3371–3399.
  • [8] Finn, C. and Levine, S. (2016). Deep visual foresight for planning robot motion, arXiv:1610.00696.
  • [9] Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S. and Abbeel, P. (2015). Deep spatial autoencoders for visuomotor learning, arXiv:1509.06113.
  • [10] Gruslys, A., Azar, M.G., Bellemare, M.G. and Munos, R. (2017). The reactor: A sample-efficient actor-critic architecture, arXiv:1704.04651.
  • [11] Hayes, G. and Demiris, J. (1994). A robot controller using learning by imitation, International Symposium on Intelligent Robotic Systems 676(5): 1257–1274.
  • [12] Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016). End-to-end training of deep visuomotor policies, Journal of Machine Learning Research 17(1): 1334–1373.
  • [13] Nagabandi, A., Kahn, G., Fearing, R.S. and Levine, S. (2017). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning, arXiv:1708.02596.
  • [14] Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E. and Liang, E. (2006). Autonomous inverted helicopter flight via reinforcement learning, in M.H. Ang Jr. and O. Khatib (Eds.), Experimental Robotics IX, Springer, Berlin/Heidelberg, pp. 363–372.
  • [15] Pan, Y. and Theodorou, E.A. (2014). Probabilistic differential dynamic programming, Advances in Neural Information Processing Systems 3: 1907–1915.
  • [16] Pan, Y., Theodorou, E.A. and Kontitsis, M. (2015). Sample efficient path integral control under uncertainty, Advances in Neural Information Processing Systems 2015: 2314–2322.
  • [17] Price, B. and Boutilier, C. (2003). Accelerating reinforcement learning through implicit imitation, Journal of Artificial Intelligence Research 19: 569–629.
  • [18] Silver, D., Sutton, R.S. and Müller, M. (2008). Sample-based learning and search with permanent and transient memories, International Conference on Machine Learning, Helsinki, Finland, pp. 968–975.
  • [19] Sutton, R.S. (1988). Learning to predict by the methods of temporal differences, Machine Learning 3(1): 9–44.
  • [20] Sutton, R.S. (1991). Dyna, an integrated architecture for learning, planning, and reacting, ACM Sigart Bulletin 2(4): 160–163.
Uwagi
PL
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-393c13f5-f173-4048-8e16-b74dd336f2c9
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.