人大经济论坛 › 论坛 › 数据科学与人工智能 › 数据分析与数据科学 › SAS专版 › 怎么样用SAS来产生模拟数据集

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: 小宝爱波1314

10340 31

怎么样用SAS来产生模拟数据集 [推广有奖]

11楼

小宝爱波1314 发表于 2014-3-19 21:27:18 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-19 21:07
1. the warning is ok.
2. delete the ALL dataset before running the macro,
3. search proc append ...

您是指删掉这里面的base=all的语句么？proc append base=all data=sample; run; %put loop=&simu_num;
%end;变成proc append data=sample; run; %put loop=&simu_num;
%end;

使用道具举报

12楼

小宝爱波1314 发表于 2014-3-19 21:35:57 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-19 21:07
1. the warning is ok.
2. delete the ALL dataset before running the macro,
3. search proc append ...

您好，我加了一个force option之后，在final的数据集里面可以找到结果了。然后您给我的的程序一共从1001到1010运行了十次，如果我需要做1000次的话，只需要将%do simu_num = 1001 %to 1010;改为%do simu_num = 1 %to 1000;就可以了吧？

使用道具举报

13楼

yongyitian 发表于 2014-3-19 21:43:00 |只看作者 |坛友微信交流群

小宝爱波1314 发表于 2014-3-19 21:35
您好，我加了一个force option之后，在final的数据集里面可以找到结果了。然后您给我的的程序一共从1001到 ...

ＡＬＬ前面有the说的是ＡＬＬ是一个名词。 the ALL dataset = the dataset ALL

运行这个宏之前要删除数据集ＡＬＬ
do i=1 to 1000; ro do i = 1001 to 2000;

使用道具举报

14楼

小宝爱波1314 发表于 2014-3-19 22:27:58 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-19 21:43
ＡＬＬ前面有the说的是ＡＬＬ是一个名词。 the ALL dataset = the dataset ALL

运行这个宏之前要删除数 ...

太谢谢您了，每次跟您请教都能学会很多东西。希望以后我也能想您一样强

使用道具举报

15楼

小宝爱波1314 发表于 2014-3-23 10:17:53 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-19 11:20

您好，如果我想在在age的最大值和中位数之间随机产生0.05*N个模拟数，在中位数和最小值之间随机产生0.05*N个模拟数，最后把两个产生的模拟数放到一起，然后再执行上述步骤，这样用SAS可行么？

使用道具举报

16楼

yongyitian 发表于 2014-3-25 11:33:28 |只看作者 |坛友微信交流群

小宝爱波1314 发表于 2014-3-23 10:17
您好，如果我想在在age的最大值和中位数之间随机产生0.05*N个模拟数，在中位数和最小值之间随机产生0.05* ...

可以用sas做。不过结果可能和前面的结果差别不大。　可以用proc　means　求median的值，然后用类似６楼的方

法把median放再一各宏变量里，下面的也和６楼的方法一样。

https://s.pinggu.org/search.php?mod=forum&searchid=3995&orderby=lastpost&ascdesc=desc&searchsubmit=yes&

使用道具举报

17楼

小宝爱波1314 发表于 2014-3-25 15:26:23 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-25 11:33
可以用sas做。不过结果可能和前面的结果差别不大。　可以用proc　means　求median的值，然后用类似 ...

是的，您说得对。我做过了，结果偏差很小。我想用SAS得到一组模拟数（x），一共100000个>mean+3std的数占5%（没有上限），0<x<mean-3std的数占5%，mean-3std<x<mean-2std的数占45%，mean+2std<x<mean+3std的数占45%。按这种要求来产生模拟数，可以实现么？

使用道具举报

18楼

小宝爱波1314 发表于 2014-3-30 20:13:24 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-19 11:20

您好，上次问您的问题您没来得及给我解答，然后我自己写了一点程序，但是还是有点问题想要请教您。
我自己写了一段程序，产生了一个outlier的数据集，一共100000条数据，我想从这个数据集中随机抽出一部分（rate）数据来随机替代原来数据集中的一部分数据，然后计算在这样的替代rate之下数据的mean和std。以下是我的程序。

/*create outlier dataset*/
%macro cond(cond1, cond2);
  when (c[&cond1] >0 and &cond2  ) do;
c[&cond1] +-1;
sampSize +-1;
output;
end;
%mend cond;
data B1841039.outlier_weight;
  call streaminit(12345);
  sampSize =100000;
  array p[4] _temporary_(5 5 45 45);
  array c[4] _temporary_;
  do i =1 to dim(p);
c =ceil(sampSize *p/100);
end;
  c[4] =c[4]-(sum(of c
)-sampSize);
  mean = 3283.95;
  std = 563.1736630;
  do until (sampSize <=0);
x =rand('normal', mean, std);
select;
   %cond(1,%str(x>mean+3*std                ) )
   %cond(2,%str(x>0       and x<mean-3*std) )
   %cond(3,%str(x>mean-3*std and x<mean-2*std) )
   %cond(4,%str(x>mean+2*std and x<mean+3*std) )
   otherwise;
   end;
end;
  stop;
  run;
/*create random number and "total"  */

proc sql;
create table temp as
select *,
      ranuni(123) as key,
      count(*) as total
from B1841039.outlier_weight
order by key;
quit;
/*sample portion of data from outlier dataset, store in datset sample1*/
%macro sample1(rate1=0.003987);
data sample1 sample2;
set temp;
if _n_<=int(total*&rate1) then output sample1;
else output sample2;
drop key total;
run;
%mend;
%sample1;
/*sample portion of data from weight dataset, store in datset sample_1*/
proc sql;
create table temp_ori as
select *,
      ranuni(123) as key,
      count(*) as total
from b1841039.birth_data
order by key;
quit;
%macro sample2(rate2=0.1);
data sample_1 sample_2;
set temp_ori;
if _n_<=int(total*&rate2) then output sample_1;
else output sample_2;
drop key total;
run;
%mend;
%sample2;
/*replace portion of birth_weight by x*/

data replace (rename=(x=birth_weight));
merge sample_1 sample1;
drop sampsize mean std i birth_weight;
run;

data rep_weight;
set replace sample_2;
run;
/*calculate mean and std of weight after replace*/

proc means data=rep_weight mean std;
var birth_weight;
run;
但是我想要在每个rate之下把随机从outlier里面抽取1000次，然后随机替代源数据集里面的数据1000次，最后得到以下这样的样式的数据，但是我写不好这个循环，您可以指点我么？

Simulation degree Simulation dataset order mean std
0.1                                           1                   …..          ……
0.1                                           2                   …….             …….
0.1                                           3                   …….. …….
0.1                                           ….                      …..    ……
0.1                                        1000                      …… …….

使用道具举报

19楼

yongyitian 发表于 2014-3-31 10:34:25 |只看作者 |坛友微信交流群

小宝爱波1314 发表于 2014-3-30 20:13
您好，上次问您的问题您没来得及给我解答，然后我自己写了一点程序，但是还是有点问题想要请教您。
我自 ...

试了一下第一部分, 用Jingju11 的code, 去掉c[4]=... 这一行后, 可以在指定区间内生成某一比例的随机数。比如总共生成１００个随机数，５个（５％）在第一区间，５个（５％）在第二区间，４５个（４５％）在第三区间，　４５个（４５％）在第四区间。结果可一用 proc sql 验证.

其他的code没有试，因为缺少部分数据。　建议先用小一点的已知结果的数据测试程序，以便找到问题所在。
或者用简单的数据举例说明你想要的结果。

%macro cond(cond1, cond2);
when (c[&cond1] >0 and &cond2 ) do;
c[&cond1] +-1;
sampSize +-1;
output;
end;
%mend cond;
data have;
call streaminit(12345);
sampSize =100;
array p[4] _temporary_(5 5 45 45);
array c[4] _temporary_;
do i =1 to dim(p);
c[i] =ceil(sampSize *p[i]/100);
end;
* c[4] =c[4] - (sum(of c[*])-sampSize);
mean =100; std =15;
do until (sampSize <=0);
x =rand('normal', mean, std);
select;
%cond(1,%str(x>mean+3*std ) )
%cond(2,%str(x>0 and x<mean-3*std) )
%cond(3,%str(x>mean-3*std and x<mean-2*std) )
%cond(4,%str(x>mean+2*std and x<mean+3*std) )
otherwise;
end;
end;
stop;
run;
proc sql;
select count(*) as n1 from have where x > 145;
select count(*) as n2 from have where 0< x < 55;
select count(*) as n3 from have where 55 < x < 70;
select count(*) as n4 from have where 130 < x < 145;
quit;

复制代码

使用道具举报

20楼

小宝爱波1314 发表于 2014-3-31 13:30:03 |只看作者 |坛友微信交流群

yongyitian 发表于 2014-3-31 10:34
试了一下第一部分, 用Jingju11 的code, 去掉c[4]=... 这一行后, 可以在指定区间内生成某一比例的随机数 ...

我想写的程序是从上面已经产生的含有大量数据（N1）的outlier里面随机抽出来一部分(rate1)，然后代替birth_data里面的weight的一部分数据（rate2），birth_data的weight含有N2条观测，两者的关系是rate1*N1=rate2*N2。然后计算出代替之后weight的均值和标准差。我想让rate2在0.05、0.10、0.15.....0.95的情况下，每个情况重复一千次。比如rate2在0.05的情况下重复1000次，得到1000个代替后weight的数据集，然后得到的1000个模拟数据集以后，分别计算出这些模拟数据集中age的均值和方差，并且将这些均值和方差合并入格式如下的数据集中。我只能写一个一个情况的替代，不会写循环，您可以帮我看看么数据我放在了下面