关于SAS分组汇总 - SAS专版 - 经管之家(原人大经济论坛)

12关注
2粉丝

讲师

35%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 3010 个
通用积分: 7.9600
学术水平: 8 点
热心指数: 12 点
信用等级: 8 点
经验: 4580 点
帖子: 187
精华: 0
在线时间: 757 小时
注册时间: 2009-3-5
最后登录: 2024-4-17

dream_2016 发表于 2014-7-25 22:05:49 |显示全部楼层 |坛友微信交流群

300论坛币

数据集A，3个变量：股票代码(code)、成交时间(date)、成交金额(volume)，如：

000001 200203 5200
000001 200204 8100
000001 200205 3600
...
000002 200203 3300
000002 200204 4500
000002 200205 6700
...

现在想根据数据集A所有股票的成交金额确定2个分位点（0.1，0.9），将数据集A分为三组，然后对每只股票分别求三组（低、中、高）的样本数、总成交金额，我自己写的程序如下：

proc univariate data=a;
var volume;
output out=a_pct pctlpre=p pctlpts=(10 90);
run;

data b;
set a;
if _n_=1 then set a_pct;
if volume<=p10 then group1=1;
else if volume>p90 then group1=3;
else group1=2;
run;

proc sql;
create table result as
select count(group1) as num, sum(volume) as sumv from b
group by code group1;
quit;
run;

我给的例子是每个月的，但我的实际数据是每天的，数据量特别大，但上面的程序太慢了，哪位高手帮我改个快捷一点的程序？

最佳答案

bobguy 查看完整内容

In the following simulation the data set contains 1000 stocks with transactions from Jan 1, 1980 to Jan 1, 2014. The total obersavtions are over 12,000,000. Real time ~2.40 + 6.21 seconds CPU time ~ 8.59 + 14.13 seconds *****************log**************; 200 201 proc means data=stock p10 p90 noprint; 202 var volume; 203 output out=pctl p10=p10 p90=p90; 204 run; NOTE: There wer ...

关键词：Univariate proc sql Variate volume Result create 股票代码 volume 程序样本

回帖推荐

bobguy 发表于7楼查看完整内容

In the following simulation the data set contains 1000 stocks with transactions from Jan 1, 1980 to Jan 1, 2014. The total obersavtions are over 12,000,000. Real time ~2.40 + 6.21 seconds CPU time ~ 8.59 + 14.13 seconds *****************log**************; 200 201 proc means data=stock p10 p90 noprint; 202 var volume; 203 output out=pctl p10=p10 p90=p90; 204 run; NOTE: There wer ...

yongyitian 发表于6楼查看完整内容

使用道具举报

bobguy 发表于 2014-7-25 22:05:50 |显示全部楼层 |坛友微信交流群

In the following simulation the data set contains 1000 stocks with transactions from Jan 1, 1980 to Jan 1, 2014. The total obersavtions are over 12,000,000.

Real time ~2.40 + 6.21 seconds
CPU time ~ 8.59  + 14.13 seconds

*****************log**************;
200
201  proc means data=stock p10 p90 noprint;
202  var volume;
203  output out=pctl p10=p10 p90=p90;
204  run;

NOTE: There were 12432420 observations read from the data set WORK.STOCK.
NOTE: The data set WORK.PCTL has 1 observations and 4 variables.
NOTE: PROCEDURE MEANS used (Total process time):
   real time          2.40 seconds
   cpu time          8.59 seconds

205
206  data pctl_fmt;
207 set pctl;
208 length start $20.;
209 fmtname='group';
210 start='low' ; end=put(p10,best.); label=1;output;
211 start=put(p10,best.); ; end=put(p90,best.); label=2;output;
212 start=put(p90,best.); ; end='high'; label=3;output;
213
214 run;

NOTE: There were 1 observations read from the data set WORK.PCTL.
NOTE: The data set WORK.PCTL_FMT has 3 observations and 8 variables.
NOTE: DATA statement used (Total process time):
   real time          0.01 seconds
   cpu time          0.01 seconds

215
216 proc format cntlin=pctl_fmt;
NOTE: Format GROUP is already on the library WORK.FORMATS.
NOTE: Format GROUP has been output.
217 run;

NOTE: PROCEDURE FORMAT used (Total process time):
   real time          0.00 seconds
   cpu time          0.00 seconds

NOTE: There were 3 observations read from the data set WORK.PCTL_FMT.

218
219 data stock_view/view=stock_view;
220    set stock;
221    group=put(volume,group.);
222    run;

NOTE: DATA STEP view saved on file WORK.STOCK_VIEW.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
   real time          0.01 seconds
   cpu time          0.00 seconds

223
224 proc means data=stock_view n sum noprint;
225 class code group ;
226 var volume;
227 output out=sum n=count  sum=sum_volume;
228 run;

NOTE: View WORK.STOCK_VIEW.VIEW used (Total process time):
   real time          6.21 seconds
   cpu time          14.13 seconds

NOTE: There were 12432420 observations read from the data set WORK.STOCK.
NOTE: There were 12432420 observations read from the data set WORK.STOCK_VIEW.
NOTE: The data set WORK.SUM has 4008 observations and 6 variables.
NOTE: PROCEDURE MEANS used (Total process time):
   real time          6.22 seconds
   cpu time          14.13 seconds
***********************************************************;

data stock;
  do date='1jan1980'd to '1jan2014'd;
   do code=1000 to 2000;
   volume=ceil(ranuni(123)*10000);
   output;
   end;
  end;
run;

proc print data=stock(obs=10);
run;

proc means data=stock p10 p90 noprint;
var volume;
output out=pctl p10=p10 p90=p90;
run;

data pctl_fmt;
  set pctl;
  length start $20.;
  fmtname='group';
  start='low' ; end=put(p10,best.); label=1;output;
  start=put(p10,best.); ; end=put(p90,best.); label=2;output;
  start=put(p90,best.); ; end='high'; label=3;output;

  run;

  proc format cntlin=pctl_fmt;
  run;

data stock_view/view=stock_view;
set stock;
group=put(volume,group.);
run;

  proc means data=stock_view n sum noprint;
  class code group ;
  var volume;
  output out=sum n=count  sum=sum_volume;
  run;

  proc print data=sum;run;

使用道具举报

zhanglianbo35 发表于 2014-7-25 22:26:56 |显示全部楼层 |坛友微信交流群

你先确定，上面3段程序是哪一段最慢。

使用道具举报

dream_2016 发表于 2014-7-25 22:38:21 |显示全部楼层 |坛友微信交流群

zhanglianbo35 发表于 2014-7-25 22:26
你先确定，上面3段程序是哪一段最慢。

都很慢，我数据至少好几个G

使用道具举报

zhanglianbo35 发表于 2014-7-25 22:44:56 |显示全部楼层 |坛友微信交流群

几个G不算大，可以造index，后面两步用一个sql搞定，create view result as，然后在查询预览部分view中的结果

使用道具举报

dream_2016 发表于 2014-7-25 22:50:49 |显示全部楼层 |坛友微信交流群

zhanglianbo35 发表于 2014-7-25 22:44
几个G不算大，可以造index，后面两步用一个sql搞定，create view result as，然后在查询预览部分view中的结 ...

能帮忙写一下吗？我现在还算新手，只会用比较基础的

使用道具举报

yongyitian 发表于 2014-7-26 00:24:01 |显示全部楼层 |坛友微信交流群

data _null_;
set a_pct;
call symput('p10', p10);
call symput('p10plus', p10+0.000001);
call symput('p90', p90);
run;
proc format;
value pgroup low - &p10 = "1"
&p10plus - &p90 = "2"
&p90 - high = "3";
run;
data b1;
set a;
group1= put(volume, pgroup.);
run;
proc sql;
create table result1 as
select count(group1) as num, sum(volume) as sumv
from b1
group by code, group1;
quit;

复制代码

使用道具举报

dream_2016 发表于 2014-7-27 21:58:11 |显示全部楼层 |坛友微信交流群

yongyitian 发表于 2014-7-26 00:24

非常感谢您的解答，最后解决方案借鉴了你的proc format的做法，但在
data b1;
set a;
group1= put(volume, pgroup.);
run;
这一步借鉴了bobguy采用视图view的作法，这一作法对于我运行的大数据还是能节省不少时间的

使用道具举报

dream_2016 发表于 2014-7-27 22:26:03 |显示全部楼层 |坛友微信交流群

bobguy 发表于 2014-7-27 09:13
In the following simulation the data set contains 1000 stocks with transactions from Jan 1, 1980 to ...

谢谢你的回答，尤其是视图view的方法对我很有用，另外有个关于样本筛选的问题想请教一下你：
还是前面提到的数据集A（3个变量：股票代码(code)、成交时间(date)、成交金额(volume)），这个数据集里面可能有1000只股票，而我需要的样本只有600只（数据集sapcd，只有一列，即股票代码code）,我采用了以下两种方式获得我想要的样本：

1.merge方式
data a_1;
merge a(in=ina) sapcd(in=insap);
by code;
if ina and insap;
run;

2.hash方式
data a_1;
   if _n_=0 then set sapcd;
   if _n_=1 then do;
      declare hash h(dataset:'sapcd');
         h.definekey('code');
         h.definedata('code');
         h.definedone();
   end;
   set a;
   if h.find()=0 then output;
run;

运行结果一样，理论上hash方式的效率应该更高，但我发现这两种方法运行时间一样，是我hash的程序错了吗？另外，你有没有更好的方法？

使用道具举报

关于SAS分组汇总 [推广有奖]

最佳答案

回帖推荐

中级热心勋章

本版微信群