大家来讨论一下,SAS怎样才能读入就处理,这样能够处理无限大的数据 [推广有奖]

121楼

kaizhang 发表于 2010-10-13 17:51:34

我处理过80G的数据。我是用SAS软件把数据切割成小的CSV文件，然后利用SQL语言进程处理。

122楼

一诺9257 发表于 2010-10-27 19:53:13

没有思考过！平时的数据SAS都行

123楼

yangchiran 发表于 2010-10-28 10:31:18

你可以把这个数据文件拆成几分，并行执行处理
LIBNAME test oracle user=用户名 password=密码 path=数据库名;
%let f1 = 'd:\test\20101028.dat;
filename filec "(&f1)";
data testda ;

      infile filec recfm=f lrecl=2000 firstobs=1 obs=10000 ;
      input          @1    id                $10.
                        @11 name       $30.
                        @41 add          $40.
                                          ;
      run;
/* 第二个并行的SAS*、
LIBNAME test oracle user=用户名 password=密码 path=数据库名;
%let f1 = 'd:\test\20101028.dat;
filename filec  "(&f1)";
data testda ;

      infile filec recfm=f lrecl=2000 firstobs=10001obs=20000 ;
      input          @1    id                $10.
                        @11 name       $30.
                        @41 add          $40.
                                          ;
      run;

124楼

onlyshenlinbo 发表于 2010-11-9 21:59:08

sas很重要，还是新手，还需要前辈们指教

125楼

laga 发表于 2010-11-15 17:57:10

学习来了。

126楼

winddance 发表于 2010-12-1 10:10:46

用一下SAS sql语句就行了啊。

127楼

numman 发表于 2010-12-5 14:05:42

marloneusa 发表于 2009-6-30 02:46
爱萌发表于 2009-6-30 00:06
abelus 发表于 2009-6-27 15:14
说说数据处理的要求吧，具体要得到啥结果之类的，越详细越好。

我处理过千万条记录级别的数据，SAS完全可以胜任。

想请教你怎么处理的,能不能发一个类似程序到我的wjw84221@yahoo.com.cn
以供研究之用,谢谢,
顺便把你的经验也写一下,
我谢谢您了
我也向学习学习。我们经常碰到GB以上的数据，化的时间很长才能处理机条简单的句子。如果是百万级的变量和千万计的纪录，那该如何是好。谢谢。我的地址是
marlone.zj@gmail.com
thanks.
同意abelus的说法

128楼

sophiafinn 发表于 2010-12-23 15:18:03

我有大数据正头疼怎么导进来呢

129楼

oloolo 发表于 2010-12-24 13:42:49

Some simple algorithms can be done so. For example, simple OLS regression, which relies on solving the normal equation is a typical case. Now suppose you have an extremely large table, say 10^15 recoreds and 500 variables, you can construct the matrix of X and Y corresponding to normal equation while you read in the data sequentially because the normal equation system ends up with only summation and summation of cross products. After passing all records through, you obtain enough information of sufficient statistics and what you need to do is simply to apply a sweep operator to the 502-by-502 normal equation matrix.

Of course, since the data is so huge that you will have to distribute it over thousands of computers and process the summation in parallel.

The so-called Stream Algorithm, originally designed for extremely long tables, also looks promising to solve your puzzles. Under this algorithm, you may not even need to read in all data in order to obtain a strongly consistent result. But it needs custom design per specific case exploiting special stochastic characteristics of the data and also depends what you want to do. There is no universal solution.

1# 爱萌