Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, we show that the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, to address this long-horizon problem. We show theoretically and experimentally that combining importance sampling with options-based policies can significantly improve performance for long-horizon problems.
from cs.AI updates on arXiv.org http://ift.tt/2lRBaGy
via IFTTT
No comments:
Post a Comment