【强化学习经典教材】insights in reinforcement learning[pdf] - 量化投资

1关注
2粉丝

本科生

29%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 6206 个
通用积分: 4.7204
学术水平: 3 点
热心指数: 3 点
信用等级: 3 点
经验: 1835 点
帖子: 75
精华: 0
在线时间: 68 小时
注册时间: 2005-8-28
最后登录: 2023-12-16

楼主

yujun1214 发表于 2018-5-25 13:45:29 |只看作者 |坛友微信交流群|倒序 |AI写论文

相似文件

换一批

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Contents
1 Introduction 3
1.1 The Aim of this Dissertation . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Reinforcement Learning 17
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Model-Free Value Learning . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Learning Action Values . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Estimation Biases in Maximization 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 The Single Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 The Double Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Comparing the Single and Double Estimator . . . . . . . . . . . . 65
3.6 A Comparison on Uniform Variables . . . . . . . . . . . . . . . . . 66
3.7 The Effect of More Samples . . . . . . . . . . . . . . . . . . . . . . 70
3.8 The Effect of More Variables . . . . . . . . . . . . . . . . . . . . . . 73
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.10 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 The Overestimation of Q-learning 85
4.1 Context and Contributions . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Overestimations in Bandit Problems . . . . . . . . . . . . . . . . . 89
4.3 Convergence Rates of Q-learning . . . . . . . . . . . . . . . . . . . 94
4.4 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5 Action Value Algorithms 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Gradients and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Expected Sarsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 General Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5 QV-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Actor Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.7 Actor Critic Learning Automata . . . . . . . . . . . . . . . . . . . 132
5.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.10 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6 Ensemble Algorithms in Reinforcement Learning 149
6.1 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.2 Voting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Policy Based Ensembles . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4 Summary of Ensemble Methods . . . . . . . . . . . . . . . . . . . 163
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.6 Discussion and Future Research . . . . . . . . . . . . . . . . . . . 177
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7 Continuous State and Action Spaces 181
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.2 Markov Decision Processes in Continuous Spaces . . . . . . . . . 182
7.3 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . 183
7.4 Approximate Reinforcement Learning . . . . . . . . . . . . . . . . 196
7.5 Continuous Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8 Discussion 231
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.3 Rules of Thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Publications by the Author 239
Dutch Summary 253
Acknowledgments 255
Bibliography 257