[原创博文] Table表过大，怎么处理？ [推广有奖]

11楼

soporaeternus 发表于 2011-6-17 14:12:50

proc sort可以的话，sort一下 data步也可以啊

这个是不是proc freq自己的问题

Let them be hard, but never unjust

12楼

soporaeternus 发表于 2011-6-17 14:21:59

看了下SAS help的Freq 的Computational Resources
如果执意要用freq的话
可以把a和b字段代码化，就是对应到自增序列
这样可以减少freq时的内存消耗......

Let them be hard, but never unjust

13楼

bayes 发表于 2011-6-17 22:50:18

12# soporaeternus 这个应该就是5楼所说的方法吧，合并a和b，形成一个新的key
我也按key先sort过，成为自增序列。
但是freq还是说memory不够。。。

14楼

bobguy 发表于 2011-6-18 11:44:22

bayes 发表于 2011-6-17 11:42
6# bobguy
问题的关键，可能不在于800万的观测值，而在于a和b的种类太多。
我看了你的程序，以下两句：
a=ceil(ranuni(123)*1e3);
b=ceil(ranuni(123)*1e3);
实际上，给予a和b的种类只有各1000，而我的实际数据是各100万+，所以才会出现你可以模拟，而我这个却太大不能处理的情况。
不知道把上面两句改成以下两句：
a=ceil(ranuni(123)*1e6);
b=ceil(ranuni(123)*1e6);
之后，看看还能模拟不。

The same result can be achieved by using data step + sort. BTW if a and b have 1m possible values each, then combined one is very possible to be unique provided that total obs is 8m.

170 options FULLSTIMER;
171
172 data t1;
173 do i=1 to 8e6;
174 a=ceil(ranuni(123)*1e6);
175 b=ceil(ranuni(123)*1e6);
176 output;
177 end;
178 drop i;
179 run;
NOTE: The data set WORK.T1 has 8000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 1.43 seconds
user cpu time 1.24 seconds
system cpu time 0.18 seconds
Memory 180k
OS Memory 6520k
Timestamp 6/17/2011 11:38:10 PM
　
180
181 proc sort data=t1 out=t2 nodupkey;
182 by a b;
183 run;
NOTE: There were 8000000 observations read from the data set WORK.T1.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.T2 has 8000000 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 4.21 seconds
user cpu time 6.83 seconds
system cpu time 0.67 seconds
Memory 66535k
OS Memory 71996k
Timestamp 6/17/2011 11:38:14 PM
　
184
185 data t3;
186 set t2;
187 by a b;
188 if first.b then cnt=0;
189 cnt+1;
190 if last.b then output;
191 run;
NOTE: There were 8000000 observations read from the data set WORK.T2.
NOTE: The data set WORK.T3 has 8000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 1.90 seconds
user cpu time 1.35 seconds
system cpu time 0.53 seconds
Memory 215k
OS Memory 6520k
Timestamp 6/17/2011 11:40:07 PM