人大经济论坛 › 论坛 › 经济学人二区 › 外文文献专区 › 无限时域策略梯度估计

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: 何人来此

219 0

[计算机科学] 无限时域策略梯度估计 [推广有奖]

0关注
3粉丝

会员

学术权威

79%

还不是VIP/贵宾

威望: 10 级
论坛币: 10 个
通用积分: 61.9534
学术水平: 1 点
热心指数: 6 点
信用等级: 0 点
经验: 24791 点
帖子: 4194
精华: 0
在线时间: 0 小时
注册时间: 2022-2-24
最后登录: 2022-4-15

楼主

何人来此

发表于 2022-3-17 15:45:00 来自手机 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

摘要翻译：
在强化学习中，基于梯度的直接策略搜索方法作为一种解决部分可观测性问题和避免值函数方法中与策略退化相关的一些问题的手段，近年来受到了广泛的关注。本文介绍了一种基于仿真的算法GPOMDP，它用于在参数化随机策略控制的部分可观测马尔可夫决策过程中生成{em平均报酬}梯度的{em有偏}估计。Kimura，Yamamura和Kobayashi(1995)提出了类似的算法。该算法的主要优点是，它只需要两倍于策略参数的存储，使用一个自由参数$\beta\in[0,1)$（根据偏差-方差权衡，它有一个自然的解释），并且不需要对底层状态的了解。我们证明了GPOMDP的收敛性，并给出了参数$\beta$的正确选择与受控POMDP的{\em混合时间}的关系。我们简要地描述了GPOMDP在受控马尔可夫链、连续状态、观测和控制空间、多智能体、高阶导数以及一个用内部状态训练随机策略的版本上的扩展。在另一篇论文(Baxter，Bartlett，&Weaver，2001)中，我们展示了GPOMDP生成的梯度估计如何在传统的随机梯度算法和共轭梯度过程中用于寻找平均报酬的局部最优值
---
英文标题：
《Infinite-Horizon Policy-Gradient Estimation》
---
作者：
Jonathan Baxter and Peter L. Bartlett
---
最新提交年份：
2019
---
分类信息：

一级分类：Computer Science 计算机科学
二级分类：Artificial Intelligence 人工智能
分类描述：Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域，除了视觉、机器人、机器学习、多智能体系统以及计算和语言（自然语言处理），这些领域有独立的学科领域。特别地，包括专家系统，定理证明（尽管这可能与计算机科学中的逻辑重叠），知识表示，规划，和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--

---
英文摘要：
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter $\beta$ is related to the {\em mixing time} of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward
---
PDF链接：
https://arxiv.org/pdf/1106.0665

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Presentation Intelligence Computation Convergence Traditional algorithm 关注策略估计 problems

[计算机科学] 无限时域策略梯度估计 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本版微信群

[计算机科学] 无限时域策略梯度估计 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本版微信群

扫码加我拉你入群