An active exploration method for data efficient reinforcement learning

Zhao, Dongfang; Liu, Jiafeng; Wu, Rui; Cheng, Dansong; Tang, Xianglong

doi:10.2478/amcs-2019-0026

Powiadomienia systemowe

Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

An active exploration method for data efficient reinforcement learning

Autorzy

Zhao Dongfang , Liu Jiafeng , Wu Rui , Cheng Dansong , Tang Xianglong

Treść / Zawartość

Pełne teksty:

11_zhao_liu_wu_an_active_exploration_method_2019_2.pdf

Pobierz

Identyfikatory

DOI

10.2478/amcs-2019-0026

Warianty tytułu

Języki publikacji

Abstrakty

Reinforcement learning (RL) constitutes an effective method of controlling dynamic systems without prior knowledge. One of the most important and difficult problems in RL is the improvement of data efficiency. Probabilistic inference for learning control (PILCO) is a state-of-the-art data-efficient framework that uses a Gaussian process to model dynamic systems. However, it only focuses on optimizing cumulative rewards and does not consider the accuracy of a dynamic model, which is an important factor for controller learning. To further improve the data efficiency of PILCO, we propose its active exploration version (AEPILCO) that utilizes information entropy to describe samples. In the policy evaluation stage, we incorporate an information entropy criterion into long-term sample prediction. Through the informative policy evaluation function, our algorithm obtains informative policy parameters in the policy improvement stage. Using the policy parameters in the actual execution produces an informative sample set; this is helpful in learning an accurate dynamic model. Thus, the AEPILCOalgorithm improves data efficiency by learning an accurate dynamic model by actively selecting informative samples based on the information entropy criterion. We demonstrate the validity and efficiency of the proposed algorithm for several challenging controller problems involving a cart pole, a pendubot, a double pendulum, and a cart double pendulum. The AEPILCO algorithm can learn a controller using fewer trials compared to PILCO. This is verified through theoretical analysis and experimental results.

Słowa kluczowe

reinforcement learning information entropy PILCO data efficiency

uczenie ze wzmocnieniem entropia informacji PILCO wydajność danych

Wydawca

Oficyna Wydawnicza Uniwersytetu Zielonogórskiego

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2019

Tom

Vol. 29, no. 2

Strony

351--362

Opis fizyczny

Bibliogr. 20 poz., rys., tab., wykr.

Twórcy

autor

Zhao Dongfang

School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China

autor

Liu Jiafeng

School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China

autor

Wu Rui

School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China

autor

Cheng Dansong

cdsinhit@hit.edu.cn

School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China

autor

Tang Xianglong

School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi Street #92, Harbin 150001, China

Bibliografia

[1] Ahmed, N.A. and Gokhale, D. (1989). Entropy expressions and their estimators for multivariate distributions, IEEE Transactions on Information Theory 35(3): 688–692.
[2] Bagnell, J.A. and Schneider, J.G. (2001). Autonomous helicopter control using reinforcement learning policy search methods, IEEE International Conference on Robotics and Automation, Seoul, South Korea, Vol. 2, pp. 1615–1620.
[3] Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S. and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning, arXiv:1703.03078.
[4] Deisenroth, M.P., Fox, D. and Rasmussen, C.E. (2015). Gaussian processes for data-efficient learning in robotics and control, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(2): 408–423.
[5] Deisenroth, M. and Rasmussen, C.E. (2011). PILCO: A model-based and data-efficient approach to policy search, Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, pp. 465–472.
[6] Ebert, F., Finn, C., Lee, A.X. and Levine, S. (2017). Self-supervised visual planning with temporal skip connections, arXiv:1710.05268.
[7] Fabisch, A. and Metzen, J.H. (2014). Active contextual policy search, Journal of Machine Learning Research 15(1): 3371–3399.
[8] Finn, C. and Levine, S. (2016). Deep visual foresight for planning robot motion, arXiv:1610.00696.
[9] Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S. and Abbeel, P. (2015). Deep spatial autoencoders for visuomotor learning, arXiv:1509.06113.
[10] Gruslys, A., Azar, M.G., Bellemare, M.G. and Munos, R. (2017). The reactor: A sample-efficient actor-critic architecture, arXiv:1704.04651.
[11] Hayes, G. and Demiris, J. (1994). A robot controller using learning by imitation, International Symposium on Intelligent Robotic Systems 676(5): 1257–1274.
[12] Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016). End-to-end training of deep visuomotor policies, Journal of Machine Learning Research 17(1): 1334–1373.
[13] Nagabandi, A., Kahn, G., Fearing, R.S. and Levine, S. (2017). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning, arXiv:1708.02596.
[14] Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E. and Liang, E. (2006). Autonomous inverted helicopter flight via reinforcement learning, in M.H. Ang Jr. and O. Khatib (Eds.), Experimental Robotics IX, Springer, Berlin/Heidelberg, pp. 363–372.
[15] Pan, Y. and Theodorou, E.A. (2014). Probabilistic differential dynamic programming, Advances in Neural Information Processing Systems 3: 1907–1915.
[16] Pan, Y., Theodorou, E.A. and Kontitsis, M. (2015). Sample efficient path integral control under uncertainty, Advances in Neural Information Processing Systems 2015: 2314–2322.
[17] Price, B. and Boutilier, C. (2003). Accelerating reinforcement learning through implicit imitation, Journal of Artificial Intelligence Research 19: 569–629.
[18] Silver, D., Sutton, R.S. and Müller, M. (2008). Sample-based learning and search with permanent and transient memories, International Conference on Machine Learning, Helsinki, Finland, pp. 968–975.
[19] Sutton, R.S. (1988). Learning to predict by the methods of temporal differences, Machine Learning 3(1): 9–44.
[20] Sutton, R.S. (1991). Dyna, an integrated architecture for learning, planning, and reacting, ACM Sigart Bulletin 2(4): 160–163.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-393c13f5-f173-4048-8e16-b74dd336f2c9