Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
Tengyu Xu 1 Zhuoran Yang 2 Zhaoran Wang 3 Yingbin Liang 1
Abstract (Haarnoja et al., 2018), etc. However, these successes usu-
Designing off-policy reinforcement learning al- ally rely on the access to on-policy samples, i.e., samples
gorithms is typically a very challenging task, be- collected online from on-policy visitation (or stationary)
cause a desirable itera ...


雷达卡




京公网安备 11010802022788号







