|
RL执行代理通过决策树β表示,通过离散化表示学习的表格执行代理。Xvariable和X、X、X是环境状态变量。根据经验,我们将状态变量离散为follow-150-100-50 0 50 100 150风险敏感RL策略执行收益率0.0000.0050.0100.0150.0200.0250.0300.0350.040概率密度β=0,所有被动β=0.25β=0.5β=0.75β=1,在β学习的政策下执行收益率的所有积极分布。垃圾箱:X∈[0;A]∪ (A)+∞)十、∈(-∞; B]∪ (B)+∞)十、∈(-∞; C]∪ (C;C)∪ (C)+∞)十、∈(-∞; D]∪ (D;D)∪ (D;D)∪ (D)+∞),ABCCDDD公司∈ Rsame订单。图4显示了表格策略的片段,该片段随后通过将其行输入经典的Hunt算法(Hunt等人,1966)处理到图5中给出的决策树中,该算法确定了决策树的确切拓扑结构。状态变量1状态变量2状态变量3状态变量4操作………0≤ X1≤ A X2≤ B X3>C2X4>D3PassiveX1>A X2≤ B X3>C2X4>D3PassiveX1>A X2>B X3>C2X4>D3PassiveX1>A X2>B X3>C2D2<X4≤ D3攻击性EX1>A X2>B X3>C2D1<X4≤ D2攻击性EX1>A X2>B X3>C2X4≤ D1攻击性……………图4。表格优化执行策略的片段。4.3.
|