请假SAS高手，数据处理问题！

14关注
125
粉丝

已卖：299份资源

学科带头人

75%

还不是VIP/贵宾

-

0%

威望: 3 级
论坛币: 26157 个
通用积分: 12.7133
学术水平: 217 点
热心指数: 343 点
信用等级: 210 点
经验: 75401 点
帖子: 1099
精华: 1
在线时间: 2016 小时
注册时间: 2007-11-15
最后登录: 2016-3-21

楼主

peijiamei 发表于 2010-3-26 19:05:29 |AI写论文

1000论坛币

遇到一个数据处理问题，请教高手！

数据集1：

公司年份价格

1 1 3
1 2 5
2 1 2
2 2 3
3 1 3
3 2 7

数据集2：

公司

2
3

现在想把数据集1中包含数据集2中变量A的所有公司留下，其他去掉。数据量很大。
怎么编程呀？

最佳答案

gzjb 查看完整内容

data a; input company year price; datalines; 1 1 3 1 2 5 2 1 2 2 2 3 3 1 3 3 2 7 ; data b; input company; datalines; 2 3 ; proc sql; create table abcom AS select * from a where company in (select company from b) ; quit; proc print noobs; run; SAS Output: company year price ...

分享0 收藏0 回帖

关键词：数据处理问题数据处理怎么编程请教高手数据集高手数据处理 SAS 请假数据分析专题数据处理数据分析软件数据分析报告面板数据分析 excel数据分析数据分析方法项目数据分析

本帖被以下文库推荐

· SAS精彩问答|主题: 2530, 订阅: 30

我的微博：http://t.sina.com.cn/1087192374
欢迎互相加关注！

沙发

gzjb 发表于 2010-3-26 19:05:30

data a;
input company year price;
datalines;
1 1 3
1 2 5
2 1 2
2 2 3
3 1 3
3 2 7
;

data b;
input company;
datalines;
2
3
;

proc sql;
create table abcom AS
select *
from a
where company in (select company from b)
;
quit;

proc print noobs; run;

SAS Output:

                              company year price

                                       2       1    2
                                       2       2    3
                                       3       1    3
                                       3       2    7

Or use sql like:
***********************
proc sql;
create table abcom AS
select * from a,b
where a.company=b.company;
;
quit;

*or;
proc sql;
create table abcom AS
select a.company,a.year,a.price from a,b
where a.company=b.company;
;
quit

*Or;
proc sql;
create table abcom AS
select a.company,a.year,a.price from a inner join b
on  a.company=b.company;
;
quit;

藤椅

gzjb 发表于 2010-3-26 19:47:48

*** method1 using SQL listed in 2nd floor***;

*** the follows is methods2 ***;

data a;
input company year price;
datalines;
1 1 3
1 2 5
2 1 2
2 2 3
3 1 3
3 2 7
;

data b;
input company;
datalines;
2
3
;

proc sort data=a out=asort;
by company;
run;

proc sort data=b out=bsort;
by company;
run;

data newab;
merge asort (in=ina) bsort(in=inb);
by company;
if ina and inb;
run;

proc print noobs; run;

SAS Output:

                              company year price

                                       2       1    2
                                       2       2    3
                                       3       1    3
                                       3       2    7

已有 1 人评分	论坛币	学术水平	热心指数	收起理由
crackman	+ 100	+ 1	+ 1	精彩帖子

总评分: 论坛币 + 100 学术水平 + 1 热心指数 + 1 查看全部评分

板凳

jingju11 发表于 2010-3-26 20:45:56

楼上的答案很好。另外的方法还有

proc sql;
create table newAB as
select a.* from a
where exists (select *
from b
where a.company = b.company);
quit;

复制代码

已有 2 人评分	经验	论坛币	学术水平	热心指数	收起理由
peijiamei		+ 100			精彩帖子
crackman	+ 100	+ 100	+ 1	+ 1	精彩帖子

总评分: 经验 + 100 论坛币 + 200 学术水平 + 1 热心指数 + 1 查看全部评分

报纸

醉_清风 发表于 2010-3-26 21:05:57

关键是哪个效率更好...

从来不需要想起永远也不会忘记

地板

jingju11 发表于 2010-3-26 21:10:55

醉_清风发表于 2010-3-26 21:05
关键是哪个效率更好...

呵呵。当然是我的好了。1. 我的不需要sort； 2. 我的不需要笛卡尔匹配（对不起，这个说法好像有问题）。所以，只好说。。。

7楼

soporaeternus 发表于 2010-3-26 23:59:39

测试数据如下

data a;
do i=1 to 10000000;
x=ceil(ranuni(0)*1000);
y=ranuni(123456);
z=y/2;
output;
end;
drop i;
run;
data b;
do x = 1 to 200;
output;
end;
run;

复制代码

1085  proc sql;
1086  create table abcom AS
1087  select a.* from  a inner join b
1088  on  a.x=b.x;
NOTE: 表 WORK.ABCOM 创建完成，有 2001228 行，3 列。
1089  ;
1090  quit;
NOTE: “PROCEDURE SQL”所用时间（总处理时间）:
   实际时间       4.56 秒
   CPU 时间       4.61 秒

1091  data d;
1092    if _N_=1 then do;
1093       declare hash h1(dataset:"work.b");
1094       declare hiter i1("h1");
1095       h1.definekey("x");
1096       h1.definedata("x");
1097       h1.definedone();
1098    end;
1099
1100    set a end=EOF;
1101
1102    rc=h1.find();
1103    if rc=0 then output;
1104
1105    keep x y z;
1106  run;
NOTE: 从数据集 WORK.B 读取了 200 个观测。
NOTE: 从数据集 WORK.A 读取了 10000000 个观测。
NOTE: 数据集 WORK.D 有 2001228 个观测和 3 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
   实际时间       6.34 秒
   CPU 时间       4.00 秒

1107  proc sql;
1108
1109  create table newAB as
1110
1111 select a.* from a
1112
1113    where exists (select *
1114
1115             from b
1116
1117             where a.x = b.x);
NOTE: 表 WORK.NEWAB 创建完成，有 2001228 行，3 列。
1118
1119  quit;
NOTE: “PROCEDURE SQL”所用时间（总处理时间）:
   实际时间       2:52.17
   CPU 时间       2:52.01

1120
1121  proc sql;
1122
1123  create table newAB as
1124
1125 select a.* from a
1126
1127    where x in (select x
1128
1129             from b
1130
1131             );
NOTE: 表 WORK.NEWAB 创建完成，有 2001228 行，3 列。
1132
1133  quit;
NOTE: “PROCEDURE SQL”所用时间（总处理时间）:
   实际时间       14.56 秒
   CPU 时间       14.53 秒

1134  proc sort data=a;by x;run;
NOTE: 从数据集 WORK.A 读取了 10000000 个观测。
NOTE: 数据集 WORK.A 有 10000000 个观测和 3 个变量。
NOTE: “PROCEDURE SORT”所用时间（总处理时间）:
   实际时间       13.93 秒
   CPU 时间       14.71 秒

1134!                         quit;
1135  proc sort data=b;by x;run;
NOTE: 从数据集 WORK.B 读取了 200 个观测。
NOTE: 数据集 WORK.B 有 200 个观测和 1 个变量。
NOTE: “PROCEDURE SORT”所用时间（总处理时间）:
   实际时间       0.01 秒
   CPU 时间       0.01 秒

1135!                         quit;
1136
1137  data ccc;
1138    merge a(in=a1) b(in=b1);
1139    by x;
1140    if a1 and b1;
1141  run;
NOTE: 从数据集 WORK.A 读取了 10000000 个观测。
NOTE: 从数据集 WORK.B 读取了 200 个观测。
NOTE: 数据集 WORK.CCC 有 2001228 个观测和 3 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
   实际时间       2.98 秒
   CPU 时间       2.96 秒

可能的效率打分
A inner join
A -  hash
B in subquery
B+ sort+merge
？？where exist

对于海量数据，我个人偏好于各种join
hash在处理一多对应时需要相应修改
sort+merge比较不感冒。。。个人喜好

已有 1 人评分	论坛币	收起理由
peijiamei	+ 100	精彩帖子

总评分: 论坛币 + 100 查看全部评分

Let them be hard, but never unjust

8楼

gzjb 发表于 2010-3-27 08:10:33

7# soporaeternus

Thanks for letting me know. Learn hash from you. Bless.

9楼

bobguy 发表于 2010-3-27 09:57:15

peijiamei 发表于 2010-3-26 19:05
遇到一个数据处理问题，请教高手！

数据集1：

公司年份价格

1 1 3
1 2 5
2 1 2
2 2 3
3 1 3
3 2 7

数据集2：

公司

2
3

现在想把数据集1中包含数据集2中变量A的所有公司留下，其他去掉。数据量很大。
怎么编程呀？

There are many good answers posted for this problem. There is another good way to solve it IMO. It uses SAS formats. Here is the result.
As long as the keys are not many. Compiling a SAS format takes no time. You may see at the bottom that format approach beats all others if the keys can be expressed in a range.

526  proc sql noprint;
527  select x into: keepid separated by ','
528  from b;
529  quit;
NOTE: PROCEDURE SQL used (Total process time):
   real time          0.01 seconds
   cpu time          0.01 seconds

530
531  proc format;
532  value keys
533 &keepid='1'
534  other ='0'
535  ;
NOTE: Format KEYS is already on the library.
NOTE: Format KEYS has been output.
536  run;

NOTE: PROCEDURE FORMAT used (Total process time):
   real time          0.00 seconds
   cpu time          0.00 seconds

537
538  data d;
539 do until(end);
540    set a end=end;
541    if put(x,keys.)='1' then output;
542  end;
543  run;

NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: The data set WORK.D has 2000612 observations and 3 variables.
NOTE: DATA statement used (Total process time):
   real time          6.48 seconds
   cpu time          6.46 seconds

544
545  proc sql;
546 create table d AS
547 select a.* from  b inner join a
548 on  a.x=b.x;
NOTE: Table WORK.D created, with 2000612 rows and 3 columns.

549 ;
550 quit;
NOTE: PROCEDURE SQL used (Total process time):
   real time          6.25 seconds
   cpu time          5.98 seconds

551
552 data d;
553    if _N_=1 then do;
554       declare hash h1(dataset:"work.b");
555       declare hiter i1("h1");
556       h1.definekey("x");
557       h1.definedata("x");
558       h1.definedone();
559    end;
560
561    set a end=EOF;
562
563    rc=h1.find();
564    if rc=0 then output;
565
566    keep x y z;
567  run;

NOTE: There were 200 observations read from the data set WORK.B.
NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: The data set WORK.D has 2000612 observations and 3 variables.
NOTE: DATA statement used (Total process time):
   real time          5.93 seconds
   cpu time          5.93 seconds

568
569  proc format;
570  value keys
571 1-200='1'
572  other ='0'
573  ;
NOTE: Format KEYS is already on the library.
NOTE: Format KEYS has been output.
574  run;

NOTE: PROCEDURE FORMAT used (Total process time):
   real time          0.00 seconds
   cpu time          0.00 seconds

575
576  data d;
577 do until(end);
578    set a end=end;
579    if put(x,keys.)='1' then output;
580  end;
581  run;

NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: The data set WORK.D has 2000612 observations and 3 variables.
NOTE: DATA statement used (Total process time):
   real time          5.43 seconds
   cpu time          4.39 seconds

已有 2 人评分	论坛币	学术水平	热心指数	收起理由
peijiamei	+ 100			精彩帖子
soporaeternus		+ 1	+ 1	很好的FORMAT方法

总评分: 论坛币 + 100 学术水平 + 1 热心指数 + 1 查看全部评分

10楼

soporaeternus 发表于 2010-3-27 10:19:20

9# bobguy
format的用法学习了
以前SAS的人来做项目的时候，是说过类似的replace是format效率最高。。。。。。
学习了

说到方法的话，把b表表达成if else组合，在set a的时候调用，会不会也不错，等等有空测试下。。。。。。

请假SAS高手，数据处理问题！ [推广有奖]

最佳答案

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

初级信用勋章

本版微信群