楼主: bayes
3891 13

[原创博文] Table表过大,怎么处理? [推广有奖]

11
soporaeternus 发表于 2011-6-17 14:12:50
proc sort可以的话,sort一下 data步也可以啊

这个是不是proc freq自己的问题
Let them be hard, but never unjust

12
soporaeternus 发表于 2011-6-17 14:21:59
看了下SAS help的Freq 的Computational Resources
如果执意要用freq的话
可以把a和b字段代码化,就是对应到自增序列
这样可以减少freq时的内存消耗......
Let them be hard, but never unjust

13
bayes 发表于 2011-6-17 22:50:18
12# soporaeternus 这个应该就是5楼所说的方法吧,合并a和b,形成一个新的key
我也按key先sort过,成为自增序列。
但是freq还是说memory不够。。。

14
bobguy 发表于 2011-6-18 11:44:22
bayes 发表于 2011-6-17 11:42
6# bobguy
问题的关键,可能不在于800万的观测值,而在于a和b的种类太多。
我看了你的程序,以下两句:
a=ceil(ranuni(123)*1e3);
b=ceil(ranuni(123)*1e3);
实际上,给予a和b的种类只有各1000,而我的实际数据是各100万+,所以才会出现你可以模拟,而我这个却太大不能处理的情况。
不知道把上面两句改成以下两句:
a=ceil(ranuni(123)*1e6);
b=ceil(ranuni(123)*1e6);
之后,看看还能模拟不。
The same result can be achieved by using data step + sort. BTW if a and b have 1m possible values each, then combined one is very possible to be unique provided that total obs is 8m.

170 options FULLSTIMER;
171
172 data t1;
173 do i=1 to 8e6;
174 a=ceil(ranuni(123)*1e6);
175 b=ceil(ranuni(123)*1e6);
176 output;
177 end;
178 drop i;
179 run;
NOTE: The data set WORK.T1 has 8000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 1.43 seconds
user cpu time 1.24 seconds
system cpu time 0.18 seconds
Memory 180k
OS Memory 6520k
Timestamp 6/17/2011 11:38:10 PM
 
180
181 proc sort data=t1 out=t2 nodupkey;
182 by a b;
183 run;
NOTE: There were 8000000 observations read from the data set WORK.T1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.T2 has 8000000 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 4.21 seconds
user cpu time 6.83 seconds
system cpu time 0.67 seconds
Memory 66535k
OS Memory 71996k
Timestamp 6/17/2011 11:38:14 PM
 
184
185 data t3;
186 set t2;
187 by a b;
188 if first.b then cnt=0;
189 cnt+1;
190 if last.b then output;
191 run;
NOTE: There were 8000000 observations read from the data set WORK.T2.
NOTE: The data set WORK.T3 has 8000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 1.90 seconds
user cpu time 1.35 seconds
system cpu time 0.53 seconds
Memory 215k
OS Memory 6520k
Timestamp 6/17/2011 11:40:07 PM

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-2 21:11