人大经济论坛 › 论坛 › 数据科学与人工智能 › 数据分析与数据科学 › SAS专版 › 如何对两个数据集进行匹配

发帖

楼主: nuomaniya

11838 10

如何对两个数据集进行匹配 [推广有奖]

1关注
0粉丝

大专生

33%

还不是VIP/贵宾

威望: 0 级
论坛币: 0 个
通用积分: 0
学术水平: 0 点
热心指数: 0 点
信用等级: 0 点
经验: 218 点
帖子: 22
精华: 0
在线时间: 65 小时
注册时间: 2013-1-16
最后登录: 2015-11-2

楼主

nuomaniya 发表于 2013-1-17 00:55:30 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

有两个数据集A和B，如下：

数据集A			数据集B
date	group	no	date	group	no
2012-1-3	a	2	2012-1-8	a	4
2012-1-11	a	4	2012-1-13	a	8
2012-1-12	a	5	2012-1-18	a	4
2012-1-6	b	3	2012-2-5	b	3
2012-1-21	b	7	2012-3-8	c	2
2012-2-8	c	11	2012-3-17	c	10
2012-3-9	c	45	2012-1-17	d	4
2012-5-10	c	2
2012-1-11	d	113

想得到的结果是针对B中每一条数据都要找到A中的一条数据相对应，并将B的"date"、"NO"连接到A数据集中，规则：A中满足”date"<=B中"date" 且 A 中“group"=B中“group"条件的所有数据中“NO”最大的数据，从B的第一条开始匹配，一旦匹配成功那么A中该条数据就不参与下一次匹配过程，有点类似不放回抽样。请各位大牛予以帮助，拜谢。

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏2 回帖

关键词：数据集 Group date ATE ODA 如何

相关帖子

沙发

tangliang0905 发表于 2013-1-17 01:36:21

match by which column?

藤椅

jingju11 发表于 2013-1-17 02:51:39

assume only group is character and no missing. jingju
我想问题有些说反了.我理解是把A里的匹配到B里.

data _null_;
call symputx('_ab', nobs);
set a nobs =nobs; stop;
run;
data ab;
array t[&_ab,2] _temporary_; array tg[&_ab] $ _temporary_;
if _n_ =1 then do p =1 to nobs;
set a point =p nobs =nobs;
t[p,1]=date; t[p,2] =no; tg[p] =group;
end;
set b;
do i =1 to dim1(t);
if (group =tg[i]) then if (date >=t[i,1]) then if (t[i,2] >no_a) then do;
no_a =t[i,2]; date_a =t[i,1]; k =i;
end;
end;
if ^missing(k) then call missing(of tg[k]);
drop i k;
run;

复制代码

板凳

nuomaniya 发表于 2013-1-17 09:04:27

.....

报纸

nuomaniya 发表于 2013-1-17 15:39:25

如果GROUP 下面还有SUBGROUP(也是字符型，比如a1-a3)应当如何处理呢？如果GROUP包含缺失值会导致什么错误呢？请指教，谢谢!

地板

jingju11 发表于 2013-1-17 23:26:16

nuomaniya 发表于 2013-1-17 15:39
如果GROUP 下面还有SUBGROUP(也是字符型，比如a1-a3)应当如何处理呢？如果GROUP包含缺失值会导致什么错误呢 ...

add one more condition for the subgroup as we did for group.
Missing group values in set B are misleading here because we don't know which groups in A should be matched. I would delete the missing group from B before matching or conditionally excute the code for missing goup. For example,

if ^missing(group) then do;
...
end;

复制代码

The consequence of including missing groups in our code is, since we are sampling without replacement, we reset the matched group in A as missing after one match in order to prevent from next match and thus the reset missing could be mixed with the original missing.
Jingju

7楼

nuomaniya 发表于 2013-1-17 23:57:29

谢谢大牛了，还有个问题请教，如果在A和B中都增加一列“NEW_NO”（相同GROUP中不重复）, 匹配条件变成：A中满足”date"<=B中"date" 且 A 中“group"=B中“group"条件的所有数据中“NO”最大的数据，如果"NO“相同，取"NEW_NO”最小的，应当如何处理呢？

8楼

nuomaniya 发表于 2013-1-18 23:42:34

请大牛们多多指教，谢谢

9楼

jingju11 发表于 2013-1-19 01:49:03

nuomaniya 发表于 2013-1-18 23:42
请大牛们多多指教，谢谢

data _null_;
call symputx('_ab', nobs);
set a nobs =nobs end =Eof;
retain max_no_new .;
if max_no_new <=no_new then max_no_new =no_new;
if Eof then call symputx('no_new_a',max_no_new+1);
run; %put &max_no_new;
data ab;
array t[&_ab,4] _temporary_; array tg[&_ab,2] $ _temporary_;
if _n_ =1 then do p =1 to nobs;
set a point =p nobs =nobs;
t[p,1]=date; t[p,2] =no; t[p,3] =no_new; t[p,4] =_obs;
tg[p,1] =group; tg[p,2] =subGroup;
end;
set b;
no_new_a =&no_new_a;
*if group is missing in A then not matching to B;
if ^missing(group) then do i =1 to dim1(t);
if (group =tg[i,1] & subGroup =tg[i,2]) then if (date >=t[i,1]) then if (t[i,2] >=no_a) then if (t[i,3] <no_new_a) then do;
date_a =t[i,1]; no_a =t[i,2]; no_new_a =t[i,3]; _obs_a =t[i,4];
k =i;
end;
end;
*if find matched in B, reset group in B as missing;
if ^missing(k) then call missing(of tg[k,1]);
*if no match was in B then reset as missing;
if missing(k) then call missing(of no_new_a);
drop i k no_new;
run;