Dialogue systems rely on a careful reinforcement learning design: the learning algorithm and its state space representation. In lack of more rigorous knowledge, the designer resorts to its practical experience to choose the best option. In order to automate and to improve the performance of the aforementioned process, this article formalises the problem of online off-policy reinforcement learning algorithm selection. A meta-algorithm is given for input a portfolio constituted of several off-policy reinforcement learning algorithms. It then determines at the beginning of each new trajectory, which algorithm in the portfolio is in control of the behaviour during the full next trajectory, in order to maximise the return. The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS). Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Under some assumptions, a thorough theoretical analysis demonstrates its near-optimality considering the structural sampling budget limitations. Then, ESBAS is put to the test in a set of experiments with various portfolios, on a negotiation dialogue game. The results show the practical benefits of the algorithm selection for dialogue systems, in most cases even outperforming the best algorithm in the portfolio, even when the aforementioned assumptions are transgressed.
from cs.AI updates on arXiv.org http://ift.tt/2kpeITa
via IFTTT
No comments:
Post a Comment