Average-Reward Off-Policy Policy Evaluation with Function Approximation
Shangtong Zhang 1 * Yi Wan 2 * Richard S. Sutton 2 Shimon Whiteson 1
Abstract which aim to generate a policy that maximizes the reward
rate by iteratively improving the policy using its estimated
We consider off-policy policy evaluation with differential value function (see, e.g., Howard (1960); Konda
function approximation (FA) in av ...


雷达卡




京公网安备 11010802022788号







