楼主: sssh307
2538 2

How to split data into train and test sets, and predict the latter one? [推广有奖]

  • 0关注
  • 0粉丝

学前班

90%

还不是VIP/贵宾

-

威望
0
论坛币
0 个
通用积分
0.0044
学术水平
4 点
热心指数
1 点
信用等级
0 点
经验
142 点
帖子
5
精华
0
在线时间
1 小时
注册时间
2010-6-1
最后登录
2015-4-13

楼主
sssh307 发表于 2014-11-24 23:27:53 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

[size=13.63636302948px]Hi! I am a junior SAS analyst.


[size=13.63636302948px]I intend to split data into train and test sets, and use the model built from train set to predict data in test set, the number of observation is up to 50000 or more.


[size=13.63636302948px]the easiest way that I think of is to use the syntax "PROC SURVEYSELECT" to random-sample observations from whole data. For example,

[size=13.63636302948px]I may ask SAS to random-sample 30% as test set, (and the rest 70% is train set):


[size=13.63636302948px]PROC SURVEYSELECT DATA=whole.data OUT=test.set METHOD=srs SAMPRATE=0.3;

[size=13.63636302948px]RUN;


[size=13.63636302948px]Now, I have a test set in the dataset: 'test.set', however:


[size=13.63636302948px]1.how could I create a dataset (e.g. 'train.set') to accommodate the rest 70% data?

[size=13.63636302948px]2.After using 'train.set' to build a predictive model  (e.g. linear model), how could I use this model built in the 'train.set' to

[size=13.63636302948px]  predict data in the 'test.set'? and let the output revealing every predicted value and residual?


[size=13.63636302948px]Thanks for your patience!


[size=13.63636302948px]David



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:predict latter Train split Pred whole example latter junior number

沙发
fjrong 在职认证  发表于 2014-11-25 08:08:57 来自手机
sssh307 发表于 2014-11-24 23:27
Hi! I am a junior SAS analyst.

I intend to split data into train and test sets, and use the model ...
看不懂

藤椅
sssh307 发表于 2014-11-25 18:42:59
我用中文重打一次看看

我現在有一份50000多筆數據的資料檔,使用SAS軟體。 想要將其切割成train and test sets,用train set建模後,再用那個模型去對test set做預測。

首先我先用proc surveyselect分割了data,train set跟test set的資料數為7:3。並且已經篩出train set的data建立一個linear model,以下為SAS語法:

PROC SURVEYSELECT DATA=WORK.MERGED OUTALL OUT=all  METHOD=SRS  SAMPRATE=0.3;  
/*進行simple random sampling來分出train and test sets,SAMPRATE表示多少比例的觀察值為test set*/
RUN;
PROC PRINT DATA=all (obs=100);
RUN;
PROC FREQ DATA=all;
TABLES selected;              /*計算train (coded as 0) 與 test (coded as 1)set 底下分別的觀察值數量,確保data split正確*/
PROC REG DATA=all;     
where selected=0;               /*利用selected=0的數據作為train set來建模*/
MODEL y= x1-x1000/ stb;
RUN;
QUIT;

最後,我想了解,如何用估出來的模型,用其來預測test set中的數據 (跑出predicted values跟residual for per observations)?

非常感謝!!

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-26 20:09