wdxmahone 发表于 2011-7-25 10:33 
您好,有没有比较快的方法啊,我的数据集有几百万行,而且是取每一行前200多行的数,计算相当慢,即使在服务器上计算也得几十分钟呢!谢哈~~bobguy 发表于 2011-7-23 11:08 
wdxmahone 发表于 2011-7-22 16:13 
各位大侠,我想取前100行的最小值,怎么实现。可以考虑以下测试数据。x1 x2
01 1
01 2
01 1
01 4
01 5
02 1
02 -3
02 3
02 4
02 5
;
X1是分组变量,我想取X2的每个分组变量前三行的最小值。
我用sql的group by做,但是由于数据量很大,要循环几千次,相当慢,不知道各位大侠有什么好的办法。谢哈~~
IT WORKS ON 3 ROWS. IT SHOULD WORK FOR 100 ROWS. BUT I AM NOT SURE ABOUT ITS PEFORMANCE.
ALSO THE LOGIC FOR THE FIRST AND THE SECOND OBS IN A GROUP IS LIGHTLY DIFFERENT. IT RETURNS THE MINIMUN UPTO 3 ROWS.
data test;
RETAIN G ' ';
input x1 $ x2;
IF G NE X1 THEN DO;
G=X1;
ID=0;
END;
ID+1;
datalines;
01 1
01 2
01 1
01 4
01 5
02 1
02 -3
02 3
02 4
02 5
;
PROC SQL;
SELECT DISTINCT A.X1, A.ID ,A.X2, MIN(B.X2) AS MIN
FROM TEST AS A
JOIN TEST AS B
ON
A.X1=B.X1
AND
(A.ID-B.ID= 0 OR A.ID-B.ID= 1 OR A.ID-B.ID= 2)
GROUP BY A.X1, A.ID
ORDER BY A.X1, A.ID
;
QUIT;
x1 ID x2 MIN
??????????????????????????????????????
01 1 1 1
01 2 2 1
01 3 1 1
01 4 4 1
01 5 5 1
02 1 1 1
02 2 -3 -3
02 3 3 -3
02 4 4 -3
02 5 5 3
As I said before, SQL may have performance issues because it involves a kind of cartisian joint.
Below is a data step approach. With 20000 group , maximum 1000 obs in each group and total obs of 10 millions . It takes 75 seconds on a I5 chip think pad. BTY, the cpu time is linear in number of K. That is a good thing.
879 %let group=20000;
880 %let maxn=1000;
881
882 %let k=100;
883
884 data t1;
885 do g=1 to &group;
886 n=ceil(ranuni(123)*&maxn);
887 do i=1 to n;
888 x=ceil(rannor(123)*&maxn);
889 output;
890 end;
891 end;
892 keep g x;
893 run;
NOTE: The data set WORK.T1 has 10053753 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 2.04 seconds
cpu time 2.04 seconds
894
895 data t2;
896 array tmpx(10000) _temporary_ ;
897 array tmpy(&k) _temporary_ ;
898
899 do until(last.g);
900 set t1;
901 by g;
902 if first.g then cnt=0;
903 cnt+1 ;
904 tmpx[cnt]=x;
905 k=0;
906 do i=1 to &k;
907 tmpy
=1e308;
908 end;
909 do i=max(cnt-&k+1,1) to cnt;
910 k+1;
911 tmpy[k]=tmpx;
912 *put tmpy[k]=;
913 end;
914 call sortn(of tmpy );
915 min=tmpy[1];
916 output;
917 end;
918 keep x g min;
919 run;
NOTE: There were 10053753 observations read from the data set WORK.T1.
NOTE: The data set WORK.T2 has 10053753 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 1:15.28
cpu time 1:15.14