Adaptive (or actor) critics are a class of reinforcement learning algorithms. Generally, in adaptive critics, one starts with randomized policies and gradually updates the probability of selecting actions until a deterministic policy is obtained. Classically, these algorithms have been studied for Markov decision processes under model-free updates. Algorithms that build the model are often more stable and require less training in comparison to their model-free counterparts. We propose a new model-building adaptive critic, which builds the model during the learning, for a discounted-reward semi-Markov decision process under some assumptions on the structure of the process. We illustrate the use of our algorithm with numerical results on a system with 10 states and a real-world case-study from management science.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.