Beyond Variance Reduction: Understanding the
True Impact of Baselines on Policy Optimization
Wesley Chung * 1 Valentin Thomas * 2 Marlos C. Machado 3 4 5 Nicolas Le Roux 6 1 2
Abstract & Cesa-Bianchi, 2012). While the former measure is often
used in the context of bandits,1 Eiπ i is more common in
Bandit and reinforcement learning (RL) problems the context of Markov Decision Processes (MDPs), which
ca ...


雷达卡




京公网安备 11010802022788号







