Epokowo-inkrementacyjny algorytm uczenia się ze wzmocnieniem wykorzystujący kryterium średniego wzmocnienia

Zajdel, R.

Artykuł - szczegóły

Tytuł artykułu

Epokowo-inkrementacyjny algorytm uczenia się ze wzmocnieniem wykorzystujący kryterium średniego wzmocnienia

Autorzy

Zajdel R.

Treść / Zawartość

Pełne teksty:

zajdel_epokowo-inkrementacyjny_7_2013.pdf

Pobierz

Identyfikatory

Warianty tytułu

The epoch-incremental reinforcement learning algorithm based on the average reward

Języki publikacji

Abstrakty

W artykule zaproponowano nowy, epokowo – inkrementacyjny algorytm uczenia się ze wzmocnieniem. Główną ideą tego algorytmu jest przeprowadzenie w trybie epokowym dodatkowych aktualizacji strategii w oparciu o odległości aktywnych w przeszłości stanów od stanu terminalnego. Zaproponowany algorytm oraz algorytmy R(0)-learning, R(λ)-learning, Dyna-R oraz prioritized sweeping-R zastosowano do sterowania modelem samochodu górskiego oraz modelem kulki umieszczonej na balansującej belce.

The application of the average reward reinforcement learning algorithms in the control were described in this paper. Moreover, new epoch-incremental reinforcement learning algorithm (EIR(0)-learning for short) was proposed. In this algorithm, the basic R(0)-learning algorithm was implemented in the incremental mode and the environment model was created. In the epoch mode, on the basis of the model, the distances of past active states to the terminal state were determined. These distances were then used in the update strategy. The proposed algorithm was applied to mountain car (Fig. 4) and ball-beam (Fig. 5) models. The proposed EIR(0)-learning was empirically compared to R(0)-learning [4, 6], R(λ)-learning and model based algorithms: Dyna-R and prioritized sweeping-R [11]. In the case of ball-beam system, EIR(0)-learning algorithm reached the stable control strategy after the smallest number of trials (Tab. 1, column 2). For the mountain car system, the number of trials was smaller than in the case of R(0)-learning and R(λ)-learning algorithms, but greater than for Dyna-R and prioritized sweeping-R. It is worth to pay attention to the fact that the execution times of Dyna-R and prioritized sweeping-R algorithms in the incremental mode were respectively 5 and 50 times longer than for proposed EIR(0)-learning algorithm (Tab. 2, column 3). The main conclusion of this work is that the epoch – incremental learning algorithm provided the stable control strategy in relatively small number of trials and in short time of single iteration.

Słowa kluczowe

uczenie się ze wzmocnieniem R-learning algorytm epokowo-inkrementacyjny

average reward reinforcement learning R-learning epoch-incremental reinforcement learning

Wydawca

Wydawnictwo PAK

Czasopismo

Pomiary Automatyka Kontrola

Rocznik

2013

Tom

R. 59, nr 7

Strony

700--703

Opis fizyczny

Bibliogr. 14 poz., rys., tab., wzory

Twórcy

autor

Zajdel R.

rzajdel@prz.edu.pl

Politechnika Rzeszowska, Al. Powstańców Warszawy 12, 35-959 Rzeszów

Bibliografia

[1] Watkins, C. J. C. H.: Learning from delayed Rewards. PhD thesis, Cambridge University, Cambridge, England, 1989.
[2] Barto A., Sutton R., Anderson C.: Neuronlike adaptive elements that can solve difficult learning problem, IEEE Trans. SMC, 13, pp. 834-847, 1983.
[3] Rummery G., Niranjan M.: On line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.
[4] Schwartz A.: A reinforcement learning method for maximizing undiscounted rewards, Proc. of Tenth International Conference on Machine Learning, Amhest, Massachusetts. Morgan Kaufman, 298-305, 1993.
[5] Tadepalli P., Ok D.: Model-Based Average Reward Reinforcement Learning. Artificial Intelligence, 100, pp. 177-224, 1998.
[6] Sutton R., Barto A.: Reinforcement learning: An Introduction, MIT Press, Cambridge, 1998.
[7] Cichosz P.: Systemy uczące się. WNT, Warszawa, 2000.
[8] Sutton R.: Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In Proc. of Seventh Int. Conf. on Machine Learning, pp. 216-224, 1990.
[9] Moore A., Atkeson C.: Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, pp. 103-130, 1993.
[10] Peng J., Williams R.: Efficient learning and planning within the Dyna framework. In Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pp. 281-290, 1993.
[11] Zajdel R., Algorytmy uczenia ze wzmocnieniem Dyna-R i prioritized sweeping-R, Inżynieria wiedzy i systemy ekspertowe, Akademicka Oficyna Wydawnicza EXIT, Warszawa 2009, 161-169.
[12] Zajdel R., Epoch-Incremental Queue-Dyna Algorithm, The Ninth International Conference on Artificial Intelligence and Soft Computing, Zakopane, Lecture Notes in Artificial Intelligence 5097, 1160-1170, 2008.
[13] Wellstead, P. E.: Introduction to Physical System Modelling, Control Systems Principles, 2000.
[14] Kaelbing L. P., Litman, M. L., Moore, A. W., (1996). Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research 4, 237–285.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-6761d4a6-ca68-4166-8dca-391927df5c1d