Identyfikatory
Warianty tytułu
Przegląd wybranych rozwiązań opartych na uczeniu ze wzmocnieniem dla problemów z teorii gier
Języki publikacji
Abstrakty
This paper collects several applications of reinforcement learning in solving some problems related to game theory. The methods were selected to possibly show variety of problems and approaches. Selections includes Thompson Sampling, Q-learning, DQN and AlphaGo Zero using Monte Carlo Tree Search algorithm. Paper attempts to show intuition behind proposed algorithms with shallow explaining of technical details. This approach aims at presenting overview of the topic without assuming deep knowledge about statistics and artificial intelligence.
Artykuł gromadzi wybrane podejścia do rozwiązania problemów z teorii gier wykorzystując uczenie ze wzmocnieniem. Zastosowania zostały dobrane tak, aby przedstawić możliwie przekrojowo klasy problemów i podejścia do ich rozwiązania. W zbiorze wybranych algorytmów znalazły się: próbkowanie Thompsona, Q-learning (Q-uczenie), DQN, AlphaGo Zero. W artykule nacisk położono na przedstawienie intuicji sposobu działania algorytmów, koncentrując się na przeglądzie technologii zamiast na technicznych szczegółach.
Czasopismo
Rocznik
Tom
Strony
13--22
Opis fizyczny
Bibliogr. 27 poz.
Twórcy
autor
- Military University of Technology, Faculty of Cybernetics, Kaliskiego Str. 2, 00-908 Warsaw, Poland
Bibliografia
- [1] Binmore K., Game theory: a very short introduction, OUP Oxford, 2007.
- [2] Ameljańczyk A., „Teoria gier”, Vol. 690, p. 78, WAT, 1978.
- [3] Lattimore T., Szepesvári C., Bandit algorithms, Cambridge University Press, 2020.
- [4] Thompson W. R., „On the likelihood that one unknown probability exceeds another in view of the evidence of two samples”, Biometrika, Vol. 25, No. 3-4, 285-294 (1933).
- [5] Thompson W. R., „On the theory of apportionment”, American Journal of Mathematics, Vol. 57, No. 2, 450-456 (1935).
- [6] Agrawal S., Goyal N., „Further optimal regret bounds for thompson sampling”, Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 31, pp. 99-107, USA, 2013.
- [7] Chapelle O., Li L., „An empirical evaluation of thompson sampling”, in: Advance in Neural Information Processing Systems 24(NIPS 2011), Vol. 24, 1-9, 2011.
- [8] Bolstad W. M., Curran J. M., Introduction to Bayesian statistics. John Wiley & Sons, 2016.
- [9] Agrawal S., Goyal N., „Analysis of thompson sampling for the multi-armed bandit problem”, Proceedings of the Conference on Learning Theory, Edinburgh, UK, 25-27 June 2012, pp. 31-39.
- [10] Kaufmann E., Korda N., Munos R., „Thompson sampling: An asymptotically optimal finite-time analysis”, in: International Conference on Algorithmic Learning Theory, LNCS 7568, pp. 199-213, Springer 2012.
- [11] Auer P., Cesa-Bianchi N., Fischer P., „Finite-time analysis of the multiarmed bandit problem”, Mach. Learn., Vol. 47, No. 2, 235-256 (2002).
- [12] Audibert J.-Y., Bubeck S., „Minimax policies for adversarial and stochastic bandits”, in: COLT - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009.
- [13] Watkins C. J. C. H., Learning from delayed rewards, University of London 1989.
- [14] Sutton R. S., Barto A. G., Reinforcement learning: An introduction, MIT Press, 2018.
- [15] Bellman R., Kalaba R. E., Dynamic programming and modern control theory, Academic Press, NY 1965.
- [16] Melo F. S., „Convergence of Q-learning: A simple proof”, Institute of Systems and Robootics Tech. Rep., pp. 1-4, 2001.
- [17] Mnih V. and others, „Playing atari with deep reinforcement learning”, arXiv Prepr. arXiv1312.5602, 2013.
- [18] Mnih V. and others, „Human-level control through deep reinforcement learning”, Nature, Vol. 518, No. 7540, 529-533 (2015).
- [19] Li M., Zhang T., Chen Y., Smola A. J., „Efficient mini-batch training for stochastic optimization”, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, NY, 2014, pp. 661-670.
- [20] Bellemare M. G., Naddaf Y., Veness J., Bowling M., „The arcade learning environment: An evaluation platform for general agents”, Journal of Artificial . Intelligence Research, Vol. 47, 253-279 (2013).
- [21] Bellemare M. G., Veness J., Bowling M., „Investigating contingency awareness using Atari 2600 games”, Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, Vol. 26, No. 1, pp. 864-871, 2012.
- [22] Tromp J., Farnebäck G., „Combinatorics of go”, in: Computers and Games, LNCS 4630, pp. 84-99, Springer 2006.
- [23] Silver D. and others, „Mastering the game of go without human knowledge”, Nature, Vol. 550, No. 7676, 354-359 (2017).
- [24] Browne C. B. and others, „A survey of monte carlo tree search methods”, IEEE Transaction on Computational Intelligence and AI in Games, Vol. 4, No. 1, 1-43 (2012).
- [25] Silver D. and others, „Mastering the game of Go with deep neural networks and tree search”, Nature, Vol. 529, No. 7587, 484-489 (2016).
- [26] Rosin C.D., „Multi-armed bandits with episode context”, Annals of Mathematics and Artificial Intelligence, Vol. 61, No. 3, 203-230 (2011).
- [27] Zhang P. and others, „Reinforcement learning-based end-to-end parking for automatic parking system”, Sensors, Vol. 19(18), 3996 (2019).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-1b7dd8d5-97e6-4d52-b062-13211a914a6e