Looking for the Right Time to Shift Strategy in the Exploration-exploitation Dilemma

Perotto, F. S.

doi:10.4467/20838476SI.15.007.3029

Artykuł - szczegóły

Tytuł artykułu

Looking for the Right Time to Shift Strategy in the Exploration-exploitation Dilemma

Autorzy

Perotto F. S.

Wybrane pełne teksty z tego czasopisma

http://www.ejournals.eu/Schedae-Informaticae/

Identyfikatory

DOI

10.4467/20838476SI.15.007.3029

Warianty tytułu

Języki publikacji

Abstrakty

Balancing exploratory and exploitative behavior is an essential dilemma faced by adaptive agents. The challenge of finding a good trade-off between exploration (learn new things) and exploitation (act optimally based on what is already known) has been largely studied for decision-making problems where the agent must learn a policy of actions. In this paper we propose the engaged climber method, designed for solving the exploration-exploitation dilemma. The solution consists in explicitly creating two different policies (for exploring or for exploiting), and to determine the good moments to shift from the one to the other by the use of notions like engagement and curiosity.

Słowa kluczowe

learning and adaptation Markovian Decision Process Exploration-Exploitation Dilemma

Wydawca

Wydawnictwo Uniwersytetu Jagiellońskiego

Czasopismo

Schedae Informaticae

Rocznik

2015

Tom

Vol. 24

Strony

73--82

Opis fizyczny

Bibliogr. 12 poz., rys.

Twórcy

autor

Perotto F. S.

filipo.perotto@irit.fr

IRIT University of Toulouse 1 Capitole Toulouse, France

Bibliografia

[1] Sutton R., Barto A., Introduction to Reinforcement Learning. 1st edn. MIT Press, Cambridge, MA, USA, 1998.
[2] Sigaud O., Buffet O., eds. Markov Decision Processes in Artificial Intelligence. iSTE - Wiley, 2010.
[3] Wiering M., Otterlo M., Reinforcement learning and markov decision processes. In: Reinforcement Learning: State-of-the-Art. Springer, Berlin/Heidelberg 2012, pp. 3–42.
[4] Auer P., Cesa-Bianchi N., Fischer P., Finite-time analysis of the multiarmed bandit problem. 2002, 47(2–3), pp. 235–256.
[5] Tokic M., Adaptive epsilon-greedy exploration in reinforcement learning based on value difference. In: KI 2010, Berlin, Heidelberg, Springer-Verlag, 2010, pp. 203–210.
[6] Tokic M., Palm G., Value-difference based exploration: Adaptive control between epsilon-greedy and softmax. In: KI 2011, Berlin, Heidelberg, Springer-Verlag, 2011, pp. 335–346.
[7] Meuleau N., Bourgine P., Exploration of multi-state environments: Local measures and back-propagation of uncertainty. May 1999, 35(2), pp. 117–154.
[8] Kearns M., Singh S., Near-optimal reinforcement learning in polynomial time. November 2002, 49(2-3), pp. 209–232.
[9] Brafman R., Tennenholtz M., R-max – a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 2002, 3, pp. 213–231.
[10] Poupart P., Vlassis N., Hoey J., Regan K., An analytic solution to discrete bayesian reinforcement learning. In: Proc. of the 23rd ICML. ICML ’06, New York, ACM, 2006, pp. 697–704.
[11] Kaelbling L., Littman M., Moore A., Reinforcement learning: A survey. May 1996, 4(1), pp. 237–285.
[12] Guez A., Silver D., Dayan P., Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search. J. Artif. Int. Res., 2013, 48, pp. 841–883.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b2a02fc8-5e2f-492b-ac3b-9afe5f19bf27