一个程序的比较(revisit)

0关注
159
粉丝

已卖：379份资源

院士

30%

还不是VIP/贵宾

-

0%

威望: 3 级
论坛币: 10965 个
通用积分: 5.0866
学术水平: 452 点
热心指数: 463 点
信用等级: 347 点
经验: 76409 点
帖子: 1937
精华: 1
在线时间: 3428 小时
注册时间: 2009-5-22
最后登录: 2020-1-26

楼主

jingju11 发表于 2014-9-9 02:41:19 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

很多的时候程序的比较并不一目了然.结果正确, 运行高效, 句法简单, 一定的通用性等等, 往往可以作为非正式的评判标准. 如果说某些标准具备主观性, 其中的前两个,正确性和效率, 往往更加客观,更加关键.

这里以前论坛上的某个问题以及给出的解决方案.因为可能牵扯较大的数据和较多的数据循环, 我这里做了简单的比较. 当然,结果和看起来或者原以为并不完全相同.

结果正确: 三个程序给定的结果完全一致
效率: 通过运行时间来决定.因为其中的运行时间的差异非常明显, 所以并没有反复运行实验.但是结果在不同的PC 和环境上也许有一定的差异.
简洁通用: 因人而异,不做深入比较.

具体问题来源.( https://bbs.pinggu.org/thread-3174189-1-1.html)

现有数据如下, 三个column （title，authors(不同的name用|隔开），number_authors)
Title                   Authors                                                 Number_authors
Title 1             Name A | Name B                                              2
Title 2             Name A | Name B  | Name C                               3..
大概有20000个observations，其中
1. title是unique的
2. number_authors 取值从1-200.
现在想做的是，对每一个observation生成一系列variables（5个）：at_least_x_authors_repeat. X从1-5取整数值. 变量取值0或1也就是：at_least_1_authors_repeat； at_least_2_authors_repeat；at_least_3_authors_repeat；at_least_4_authors_repeat at_least_5_authors_repeat.
变量描述了在这组数据中有多少作者是重复的

方法 1- 一个DATA STEP

方法 2- 多个 SQL 过程

方法 3- 多个DATA STEP和SQL的混合.

/*simulate data*/
%let n =50000;
data test;
call streaminit(0);
length Title $20. Authors $200.;
array LD[26] $6. _temporary_;
do i =1 to &n;
title =cats('title', put(i, z10.));
start =min(rand('poisson',2),10) +1;
stop = max(start,min(rand('poisson',6),26)); call missing(of ld );
do j =start to stop;
ld[j] ='NAME '||substrn('ABCDEFGHIJKLMNOPQRSTUVWXYZ', j, 1);
end;
Authors =catx('|', of ld);
Number_authors =stop -start +1;
output;
end;
drop start stop i j;
run;
/*-----------------------------------------------------------------------------------------------*/
%macro method1;
proc sql noprint;
select distinct cats(max(number_authors)) into: maxn from test;
quit;
data method1;
set test;
array t[&maxn] $10. _temporary_;
call missing(of t);
do i =1 to Number_authors;
t =scan(authors, i, '|');
end;
k =0;
do i =1 to nobs;
set test(keep=authors rename=(authors=authors1)) point=i nobs=nobs;
if _n_ ^=i then do j =1 to Number_authors;
if not missing(t[j]) then if find(authors1, cats(t[j])) then s ++1;
end;
k =max(k, s);
s =0;
end;
array f[5] at_least_1_authors_repeat at_least_2_authors_repeat at_least_3_authors_repeat at_least_4_authors_repeat at_least_5_authors_repeat;
do i =1 to dim(f);
f =0;
if i <=k then f =1;
end;
drop k j s authors1;
run;
%mend method1;
/*-----------------------------------------------------------------------------------------------*/
%macro method2;
data ex;
length author $10;
set test;
do i=1 to Number_authors;
author=strip(scan(authors,i,'|'));
output;
end;
keep Title author;
run;
proc sql;
create table ex1 as
select t.title, t.author, t1.title as title1
from ex t
left join
ex t1
on t.author=t1.author
and t.title ^=t1.title
order by t.title,t.author,t1.title
;
create table ex2 as
select t.title,t.title1,sum(case when t.title1 is null then 0 else 1 end) as cnt
from ex1 t group by t.title,t.title1
order by t.title,t.title1
;
create table method2 as
select title,
max(cnt) as max_cnt,
case when max(cnt)>=1 then 1 else 0 end as at_least_1_authors_repeat,
case when max(cnt)>=2 then 1 else 0 end as at_least_2_authors_repeat,
case when max(cnt)>=3 then 1 else 0 end as at_least_3_authors_repeat,
case when max(cnt)>=4 then 1 else 0 end as at_least_4_authors_repeat,
case when max(cnt)>=5 then 1 else 0 end as at_least_5_authors_repeat
from ex2
group by title;
quit;
%mend method2;
/*-----------------------------------------------------------------------------------------------*/
%macro method3;
proc sql noprint;
select distinct max(number_authors) into: maxn
from test;
quit;
%let maxn=&maxn; %put *&maxn*;
data test1;
array author(&maxn) $10;
set test;
comb=2**number_authors-1;
fmt="Binary"||cats(number_authors)||".";
do i=1 to comb;
k=0;
binary=reverse(putn(i,fmt));
call missing(of author1- author&maxn);
do j=1 to number_authors;
if substr(binary,j,1)="1" then do;
k+1;
author(k)=left(scan(authors,j,"|"));
end;
end;
call sortc(of author&maxn-author1);
output;
end;
keep author1-author5 title;
run;
proc sql;
create table test2 as
select distinct author1,author2, author3,author4,author5,title,count(distinct title) as titlenum
from test1
group by author1,author2,author3,author4,author5
order by author1,author2,author3,author4,author5;
quit;
data test3;
set test2;
by author1-author5;
array repeat_(5);
tmp=catx("*", of author1-author5);
varn=count(tmp,"*")+1;
repeat_(varn)=(titlenum>1);
run;
proc sql;
create table author_repeat as
select distinct title, max(repeat_1) as at_least_1_authors_repeat
, max(repeat_2) as at_least_2_authors_repeat
, max(repeat_3) as at_least_3_authors_repeat
, max(repeat_4) as at_least_4_authors_repeat
, max(repeat_5) as at_least_5_authors_repeat
from test3
group by title;
quit;
data method3;
merge test author_repeat;
by title;
run;
%mend method3;

复制代码

运行结果

N = 1,000

1=0.89100003242492 seconds

2=8.24699997901916 seconds

3=3.28099989891052 seconds

N = 5,000

1=021.744000196457 seconds

2=499.838999986648 seconds

3=017.079999923706 seconds

N =10,000

1=0088.73000001907 seconds

2=2230.13800001144 seconds

3=0033.89900016785 second

N =20,000

1=342.348000049591 seconds
2=000.000000000000 seconds(not available)
3=071.160000085831 seconds

N =50,000

1=2177.96100020408 seconds

2=0000.00000000000 seconds (not available)

3=0192.801999807357 seconds

如果想要比较结果,程序可以如下

proc compare base =method1 compare =method2; run;

复制代码

从结果来看,如果数据记录较少的时候(比如N=1,000), 三个程序的运行时间类似. 如果N =5000, 方法2 变得缓慢. 如果N =10,000, 方法2的运行时间至少20倍长.方法3的时间最短, 只有方法1 的1/2. 如果数据进一步增加至20,000, 方法2 因为时间过长,没有测试. 而方法3的效率凸现优势,只有方法1 的1 /5左右.N =50,000, 方法 3 的优势更加明显.

略做总结:

方法 1 是我的程序, 因为编写的思路和结构,似乎没有优化的可能性.如果数据量超过10,000, 运行比较缓慢.我猜想,如果N 超过100,000 程序几乎不可用. 我当时的思路是程序越简单越好.
方法 2 采用SQL过程.可以看得出, 许多人认为比较容易理解. 但是因为其中牵扯不等式的LEFT JOIN, 如果数据很大, 这一过程非常缓慢. 在SQL 里, SAS SQL 差不多效率是最低的,虽然在较新的版本里效率似乎有所改善.也就是说 ,如果这个程序不在SAS 里运行,或许效率要好的多. 当N 超过20,000 程序几乎不可用.
方法 3 看起来最复杂,但是效率最好, 尤其是在数据尺寸增大的时候. 其实程序是否复杂并不是关键,因为一旦程序确立并稳定下来,你不必要时时去阅读原程序.而效率就变得尤为关键重要.
从本题目的要求来看,显然方法3是最好的选择,因为起运行时间比其他的良种方案要快的多.

by JingJu(my blog)

Also a relevant link in my blog: http://blog.sina.com.cn/s/blog_a3a926360102v0w6.html

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享1 收藏3 回帖

关键词：Visit visi isi sit observations 解决方案程序

回帖推荐

playmore 发表于3楼查看完整内容

京剧大哥nb 之前的问题我没有参与现在这三个方法我就只能大概看懂第二个，其他的直接给跪了我觉得这可能就是SAS型语法的一大问题，语法复杂，很多简单问题会复杂化，data步尤其如此另外这个问题可以用矩阵运算解决，设有下面的矩阵 A B C D E 1 1 1 2 1 1 1 3 ... 4 ... 行(1,2,3,...)代表Title，列(A,B,C,...)代表Author，(m,n)处的元素等于1表示Tit ...

已有 3 人评分	经验	论坛币	学术水平	热心指数	信用等级	收起理由
webgu	+ 100	+ 100	+ 5	+ 5	+ 5	好久没倒腾SAS了，被各位大牛直接秒了。
statax		+ 48	+ 2	+ 2		精彩帖子
pobel		+ 5	+ 5	+ 5	+ 5	精彩帖子

总评分: 经验 + 100 论坛币 + 153 学术水平 + 12 热心指数 + 12 信用等级 + 10 查看全部评分

本帖被以下文库推荐

· Eternal SAS|主题: 62, 订阅: 7
· SAS相关|主题: 144, 订阅: 57

沙发

pobel

发表于 2014-9-9 08:18:56

京剧大哥很用心，佩服！

个人认为，方法2的亮点在于思路的巧妙。当时如果能想到这个方法，我就不会用方法3了。
只是在数据量大的时候，方法2的效率会受到那步LEFT JOIN的拖累。

藤椅

playmore 发表于 2014-9-9 09:22:34

京剧大哥nb
之前的问题我没有参与
现在这三个方法我就只能大概看懂第二个，其他的直接给跪了
我觉得这可能就是SAS型语法的一大问题，语法复杂，很多简单问题会复杂化，data步尤其如此

另外这个问题可以用矩阵运算解决，设有下面的矩阵
A    B    C    D    E
1 1    1
2 1    1    1
3          ...
4          ...
行(1,2,3,...)代表Title，列(A,B,C,...)代表Author，(m,n)处的元素等于1表示Title m有作者Name n。
则把每一行和其他所有行做下点乘，如1行乘2行为1*1+1*1+0*1=2，表示有2个作者同时在Title 1和Title 2出现，其他类似，最后整理下即可得到最后结果。这个方法的优点在于逻辑简单，代码也应该不复杂。只不过效率可能不高，问题的复杂度是m*(m-1)/2，和方法2的left join差不多。但是可以用稀疏矩阵进行存储和计算，再用个C写个矩阵的乘法会快很多。具体的代码就不写了，现在不太会用IML了，都转R了，呵呵。

板凳

jingju11 发表于 2014-9-9 10:14:32

pobel 发表于 2014-9-9 08:18
京剧大哥很用心，佩服！

个人认为，方法2的亮点在于思路的巧妙。当时如果能想到这个方法，我就不会用方法 ...

方案二在sas里运行之效率更加低下。这个原始问题要求的记录数大概是 20，000左右。我估计在我的PC 上至少四个小时以上，几乎不可用。
京剧

报纸

jingju11 发表于 2014-9-9 10:29:20

playmore 发表于 2014-9-9 09:22
京剧大哥nb
之前的问题我没有参与
现在这三个方法我就只能大概看懂第二个，其他的直接给跪了

你提到的矩阵维度，应该是 [记录数] * [所有参与的作者数（unique）]. 我模拟的数据里仅有26个不同的作者。但是实际上这个数字应该很大，很可能比记录数还要多。不过你的建议有助于解释我的程序的计算方法。
类似你的点乘概念--- 用该行和其余行做点乘，然后取出那个最大的和sum(1*1, 1*1, 1*0, ...)，就是最多有几位同时的合作者。
京剧

地板

ziyenano 发表于 2014-9-9 14:08:42

哈哈，没想到在这还能看到我写的程序。
如果追求完美，效率方面来看，方法三一定是最好的了。
PS：在这个论坛上，pobel对data步的控制算是顶尖水平，赞一个。
其实当时写这个问题，就当玩游戏锻炼锻炼自己思维，压根也没
考虑效率的问题。
sql的好处，简单、结构化，你想到的东西很容易就转成SQL语言，
当然有利有弊，很多时候效率问题就没法规避。
我在sas敲SQL很大程度是图省事，可以原封不动的扔到数据库去
运行。
其实SAS的SQL运行效率还可以，并不比Oracle这些数据库差，只是很多
时候数据库是安装的服务器上的，SAS通常都是PC版，才造成这种错觉。
我以前跟朋友开玩笑，代码优化都是穷人干的事情，给我一台小型机，
我连not in 都敢直接写。
还有这个问题本身也是挺复杂的，我自己都快看不懂自己写的东西了。
最后，感谢京剧大哥细心地把这些东西整理出来比较！

7楼

jingju11 发表于 2014-9-9 19:51:04

其实SAS的SQL运行效率还可以，并不比Oracle这些数据库差，只是很多
时候数据库是安装的服务器上的，SAS通常都是PC版，才造成这种错觉。

谢谢你的说明。一些观点我可以测试一下;
(1)sas on server may be faster than on PC
(2) performance of some database SQL than SAS sql

Jingju

8楼

jingju11 发表于 2014-9-10 01:23:21

jingju11 发表于 2014-9-9 19:51
谢谢你的说明。一些观点我可以测试一下;
(1)sas on server may be faster than on PC
(2) performanc ...

As of my test, some database SQL, such as teradata SQL, may be more efficient than SAS SQL, in some particular cases. The test was based on a left-join SQL in Method 2. The parameters were listed as per below:
n obs = 20,000
Resulted dataset records = 970,484,186
Resulted data file size = 59.254 GB
Time consumed:
SAS SQL = 4564 seconds(76'04'')
Teradata SQL= 688 seconds(11'28'')
This test is very pricy. I did not take further tests. To conclude, assume the server maintains its even and normal speed, the SAS/SQL code running in a SAS server is almost 7 times slow as the one running in teradata/SQL by the way of SAS SQL-passthrough. The efficiency gain for Teradata SQL should be significant, as my experience.

JingJu

9楼

jingju11 发表于 2014-9-10 09:28:45

playmore 发表于 2014-9-9 09:22
京剧大哥nb
之前的问题我没有参与
现在这三个方法我就只能大概看懂第二个，其他的直接给跪了

你说的很多我都不是很懂。我今天试着用fcmp做了一下，你的思路很好。概括了如下：
A = ｛r[i,j]｝matrix i =1 ....n , j =1... c where n is number of rows (titles) and c is number of all distinct author names
and r[i,j] =1 if the author i is on jth of the list. otherwise 0.
for example, if all 10 distinct authors and for title1 have authors of 1,3, 4.
so the first row =[1 0 1 1 0 0 0 0 0 0], and so on.the computation is

M= A(AT), AT is the transpose of A
U= element multiplication (M, S) where S is a matrix with all 1’s except for 0’s on diagonal
L=UT, UT is the transpose of U
The maximum common authors for ith row = ith row maximum of L

JingJu

10楼

playmore 发表于 2014-9-10 14:39:17

jingju11 发表于 2014-9-10 09:28
你说的很多我都不是很懂。我今天试着用fcmp做了一下，你的思路很好。概括了如下：
A = ｛r｝matrix i =1 ...

对，基本上就是这个算法
但我这个算法的问题是稀疏矩阵+矩阵乘法
直接用R的话，10000×10000基本上就是32位CPU+4G内存的上限了
除非用OpenBLAS之类的矩阵运算库自己写

SAS处理这个问题的好处是一行观测就是一个数据
没有冗余的数据
然后再用data步来做效率也很高
只不过data步写起来***，看起来更***
再加上什么array, call之类的，搞得想看懂比自己写还累

我觉得这个问题导致的结果就是可拿来重用并共享的宏十分稀少
一段设计精巧的data步代码，不但仅限于完成某一指定任务，而且限定于特定的表格结构
无法封装，不能继承
写完一段就扔，别人也很难看懂，也就更难在其上进行修改

[程序分享] 一个程序的比较(revisit) [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

回帖推荐

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

初级信用勋章

初级学术勋章

中级信用勋章

中级学术勋章

高级热心勋章

高级学术勋章

本版微信群

[程序分享] 一个程序的比较(revisit) [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

回帖推荐

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

初级信用勋章

初级学术勋章

中级信用勋章

中级学术勋章

高级热心勋章

高级学术勋章

本版微信群

扫码加我拉你入群