楼主: mingdashike22
255 0

[计算机科学] Lambda策略迭代和应用到 俄罗斯方块游戏 [推广有奖]

  • 0关注
  • 3粉丝

会员

学术权威

78%

还不是VIP/贵宾

-

威望
10
论坛币
10 个
通用积分
74.0016
学术水平
0 点
热心指数
0 点
信用等级
0 点
经验
24862 点
帖子
4109
精华
0
在线时间
1 小时
注册时间
2022-2-24
最后登录
2022-4-15

楼主
mingdashike22 在职认证  发表于 2022-3-4 20:26:30 来自手机 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
摘要翻译:
研究了离散时间无限时域最优控制问题的马尔可夫决策过程。我们回顾了Bertsekas和Ioffe的工作,他们引入了$\lambda$Policy迭代,这是一个由$\lambda$参数化的算法族,推广了标准算法值迭代和策略迭代,并与Sutton和Barto描述的时间差分算法TD($\lambda$)有很深的联系。我们通过提供收敛速度界来深化作者的原始理论,它推广了例如Puterman所描述的值迭代的标准界。然后,本文的主要贡献是发展了该算法的理论,当它以近似形式使用时,并证明了这是合理的。这样,我们扩展和统一了Munos对近似值迭代和近似策略迭代的独立分析。最后,我们回顾了Bertsekas和Ioffe在俄罗斯方块播放控制器训练中的应用。我们提供了一个原始的性能界,可以应用于这样一个未折扣的控制问题。我们的实证结果不同于Bertsekas和Ioffe的结果(它们最初被定性为“矛盾的”和“有趣的”),更符合人们对学习实验的期望。我们讨论了产生这种差异的可能原因。
---
英文标题:
《Performance Bounds for Lambda Policy Iteration and Application to the
  Game of Tetris》
---
作者:
Bruno Scherrer (INRIA Lorraine - LORIA)
---
最新提交年份:
2011
---
分类信息:

一级分类:Computer Science        计算机科学
二级分类:Artificial Intelligence        人工智能
分类描述:Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
涵盖了人工智能的所有领域,除了视觉、机器人、机器学习、多智能体系统以及计算和语言(自然语言处理),这些领域有独立的学科领域。特别地,包括专家系统,定理证明(尽管这可能与计算机科学中的逻辑重叠),知识表示,规划,和人工智能中的不确定性。大致包括ACM学科类I.2.0、I.2.1、I.2.3、I.2.4、I.2.8和I.2.11中的材料。
--
一级分类:Computer Science        计算机科学
二级分类:Robotics        机器人学
分类描述:Roughly includes material in ACM Subject Class I.2.9.
大致包括ACM科目I.2.9类的材料。
--

---
英文摘要:
  We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes. We revisit the work of Bertsekas and Ioffe, that introduced $\lambda$ Policy Iteration, a family of algorithms parameterized by $\lambda$ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD($\lambda$) described by Sutton and Barto. We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman. Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration and Approximate Policy Iteration. Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe. We provide an original performance bound that can be applied to such an undiscounted control problem. Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as "paradoxical" and "intriguing"), and much more conform to what one would expect from a learning experiment. We discuss the possible reason for such a difference.
---
PDF链接:
https://arxiv.org/pdf/0711.0694
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Lambda 俄罗斯方块 lamb Lam BDA Bertsekas 迭代 Ioffe 时域 离散

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
扫码
拉您进交流群
GMT+8, 2026-2-7 05:57