所用策略平台:ricequant,米筐科技的策略平台(www.ricequant.com),ps个人觉得不错,建议大家去用用试试。
基于历史涨跌的机器学习预测模型构建
上面介绍了机器学习的基本概念、scikit-learn的使用以及我们的数据——HS300指数数据的特征及分布,下面正式进入机器学习实战中。我讨论的问题主要有三点:
机器学习估计器的选择
- 机器学习估计器的选择,即我们使用何种算法进行我们的预测。
- 训练集样本数量的选择,即我们每次预测结果之前使用多少条训练集合的样本。
- 涨跌时间窗口的选择,即我们每个样本中的特征个数,我们训练集每个单元包含连续多少个交易日的涨跌。
事实上,机器学习应用中一个很棘手的问题就是根据自己问题的实际找到一个合适的估计器,不同的估计器适合于不同类型的数据以及研究的对象,下面这张图给出了一个粗暴的引导:
我们按图索骥,把目光聚焦到图的左上角,最终选择的结果为:EnsembleClassifiers、LinearSVC、KNeighborsClassifier.另外JMLR这儿有篇神奇的文章: Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?,文章测试了179种分类模型在UCI所有的121个数据上的性能,发现Random Forests 和 SVM 性能最好。为此,我们的EnsembleClassifiers选择RandomForestClassifier,即我们最终想在RandomForestClassifier、LinearSVC、KNeighborsClassifier中比较比较。
我们将CSI300数据与另外15只随机选择的股票,保证相同的训练集样本数量与时间窗口的情况下,分别使用RandomForestClassifier、LinearSVC、KNeighborsClassifier这三种估计器进行学习,与此同时通过计算预测的胜率来比较各自的表现。上代码:
import pandas as pdimport numpy as npfrom __future__ import division from sklearn.ensemble import RandomForestClassifierfrom sklearn import svmfrom sklearn import neighborsfrom collections import dequeimport matplotlib.pyplot as pltstock_list = ['CSI300.INDX', '000001.XSHE', '000623.XSHE', '000002.XSHE', '000063.XSHE', '000568.XSHE', '600036.XSHG', '600690.XSHG','600030.XSHG', '600674.XSHG', '600535.XSHG', '600010.XSHG', '600383.XSHG', '600104.XSHG', '600519.XSHG', '600276.XSHG']win_ratio1 = []win_ratio2 = []win_ratio3 = []window = 2for stock in stock_list: df = rd.get_price(stock, '2005-01-01', '2015-07-25').reset_index()[['OpeningPx', 'ClosingPx']] up_and_down = df['ClosingPx'] - df['OpeningPx'] > 0 X = deque(maxlen = 100) y = deque(maxlen = 100) clf1 = RandomForestClassifier() clf2 = svm.LinearSVC() clf3 = neighbors.KNeighborsClassifier() prediction1 = 0 prediction2 = 0 prediction3 = 0 test_num = 0 win_num1 = 0 win_num2 = 0 win_num3 = 0 current_index = 400 for current_index in range(current_index, len(up_and_down)-1, 1): fact = up_and_down[current_index+1] X.append(list(up_and_down[(current_index-window): current_index])) y.append(up_and_down[current_index]) if len(y) >= 100: test_num += 1 clf1.fit(X, y) clf2.fit(X, y) clf3.fit(X, y) prediction1 = clf1.predict(list(up_and_down[(current_index-window+1): current_index+1])) prediction2 = clf2.predict(list(up_and_down[(current_index-window+1): current_index+1])) prediction3 = clf3.predict(list(up_and_down[(current_index-window+1): current_index+1])) if prediction1[0] == fact: win_num1 += 1 if prediction2[0] == fact: win_num2 += 1 if prediction3[0] == fact: win_num3 += 1 #print win_num1/test_num #print win_num2/test_num #print win_num3/test_num #print "____________________" win_ratio1.append(win_num1/test_num) win_ratio2.append(win_num2/test_num) win_ratio3.append(win_num3/test_num)win_ratio123 = pd.DataFrame({'RandomForest' : win_ratio1, 'LinearSVC' : win_ratio2, 'KNN' : win_ratio3})win_ratio123.plot(figsize=(15, 10), title='Win Ratio Of Different Stocks')结果如图所示:
训练集样本数量的选择
可以看出,KNeighborsClassifier表现明显逊于RandomForestClassifier、LinearSVC,它的波动较大且胜率与另外两者比也不理想。我对RandomForestClassifier更熟一下,我下面就统一使用RandomForestClassifier作为估计器,而关于RandomForestClassifier的原理及优点等,这里的博客做了介绍。训练集样本数量一定程度制约了预测结果的准确性,理想情况下我当然希望每次做预测的样本数量越多越好,但是你知道理想很骨感的,训练集样本数量一方面受实际总的数据量限制,另外,计算的资源与时间也是制约的一个因素。我们最终要形成某种程度的妥协,即保证相当程度的预测效果下选择最小的训练集样本数量。于是我们计算样本数从1~300范围内的胜率,有了下面的代码:
import pandas as pdimport numpy as npfrom __future__ import division from sklearn.ensemble import RandomForestClassifierfrom collections import dequeimport matplotlib.pyplot as plt#df = rd.get_price('CSI300.INDX', '2005-01-01', '2015-07-25').reset_index()[['OpeningPx', 'ClosingPx']]#up_and_down = df['ClosingPx'] - df['OpeningPx'] > 0win_ratio2 = []samples_list = [x for x in range(300) if x != 0]window = 2for samples in samples_list: clf = RandomForestClassifier() X = deque(maxlen = samples) y = deque(maxlen = samples) prediction = 0 test_num = 0 win_num = 0 current_index = 600 for current_index in range(current_index, len(up_and_down)-1, 1): fact = up_and_down[current_index+1] X.append(list(up_and_down[(current_index-window): current_index])) y.append(up_and_down[current_index]) if len(y) >= samples: test_num += 1 clf.fit(X, y) prediction = clf.predict(list(up_and_down[(current_index-window+1): current_index+1])) if prediction[0] == fact: win_num += 1 #print win_num/test_num win_ratio2.append(win_num/test_num)fig = plt.figure(figsize=(12, 10))plt.plot(samples_list, win_ratio2,'ro--')plt.title('Win Ratio Of Different Number Of Training Samples')plt.show()结果:
涨跌时间窗口的选择
可以看出,控制其它条件不变的前提下,随着样本数量的增多,胜率逐步提高结果更为稳定并且最后维持在0.52~0.53左右波动,为了节约计算资源以及考虑到历史数据的总量,选择100个训练集样本数是较为合理的。涨跌时间窗口的大小选择,实际上反映了过去交易日历史的涨跌对下一个交易日的影响,这个影响是否客观存在,我认为从交易心理上说还是有一定依据的,比如作为交易者如果发现过去一连10个交易日全部飘红,对于后一天的走势我更愿意谨慎看空,出逃观望。当然,这是个极端的例子,不过归根结底的表现怎么样,还是要看数据给的答案。
import pandas as pdimport numpy as npfrom __future__ import division from sklearn.ensemble import RandomForestClassifierfrom collections import dequeimport matplotlib.pyplot as plt[color=rgb(137, 150, 168) !important]#df = rd.get_price('CSI300.INDX', '2005-01-01', '2015-07-25').reset_index()[['OpeningPx', 'ClosingPx']][color=rgb(137, 150, 168) !important]#up_and_down = df['ClosingPx'] - df['OpeningPx'] > 0win_ratio = []window_list = [x for x in range(200) if x != 0]for window in window_list: clf = RandomForestClassifier() X = deque(maxlen = 100) y = deque(maxlen = 100) prediction = 0 test_num = 0 win_num = 0 current_index = 400 for current_index in range(current_index, len(up_and_down)-1, 1): fact = up_and_down[current_index+1] X.append(list(up_and_down[(current_index-window): current_index])) y.append(up_and_down[current_index]) if len(y) >= 100: test_num += 1 clf.fit(X, y) prediction = clf.predict(list(up_and_down[(current_index-window+1): current_index+1])) if prediction[0] == fact: win_num += 1 [color=rgb(137, 150, 168) !important]#print win_num/test_num win_ratio.append(win_num/test_num)fig = plt.figure(figsize=(12, 10))plt.plot(window_list, win_ratio, 'm')plt.show()[color=rgb(137, 150, 168) !important]#Win Ratio Of Different Time Window'结果:
最后还有些想说的:
这样的结果很让我琢磨不透,有点看乱码的感觉。后来我改变了‘current_index’的值发现,基本每次的结果都差不多,一个共同点是:每次曲线的开端都会存在倒塌式下滑,而后稳定震荡于0.5扔硬币的概率左右。也就是说,动量是存在的,只不过很小(结合前面两节的试验结果其期望处于0.53这个位置),且时间窗口很短,超出这个时间窗口,预测问题就转化为扔硬币问题。蟒蛇大法好,希望早出回测,不想写循环......


雷达卡













京公网安备 11010802022788号







