楼主: ReneeBK
2041 15

[问答] Random sampling & matrix of histograms problem [推广有奖]

11
Lisrelchen 发表于 2014-5-7 00:21:26
For some things, SPSS is great. But in this case, do yourself a favor and install the R plugin.

BEGIN PROGRAM R.

numsamples = 100
samplesize = 100
x = rnorm(1000)
NewDataSet = matrix(nrow=samplesize,ncol=numsamples)
for (i in 1:numsamples)
   {  sampled.x = sample(x, size=samplesize, replace = TRUE)
      NewDataSet[,i] = sampled.x
   }

plot(NewDataSet[,23], NewDataSet[,57])

END PROGRAM.

12
Lisrelchen 发表于 2014-5-7 00:22:01
Here is an incomplete thought which has a tidy relationship to my other code
but is a little bit oblique.
I will leave it to those who actually need to do this to contemplate its
essence and adapt it to their pain.
Hint (multiply X by a factor (of your choice) and select critical cumulative
cut off (CW) for inclusion).
The weighted X will be the Bootstrap sampling frequencies for each case for
number of samples .
* This might actually be one of my rare brain farts, but I think it is on
the right track *.
* No time to verify if it's a gem or a train wreck waiting to happen *.
--
MATRIX.
SAVE UNIFORM(1000,100)/ OUTFILE * /VAR X001 TO X100.
END MATRIX.
VARSTOCASES  /ID=id  /MAKE X FROM X001 TO X100 /INDEX=Index1(X).
SORT CASES BY Index1 (A) X (D).
COMPUTE X=X * 'your call here....Must be > 2 since E (uniform)=.5'
SPLIT FILE BY Index1.
CREATE CW=CSUM(X).
SELECT IF CW LE 'your call here....' (probably 1000 for this example).

David Marso

13
Lisrelchen 发表于 2014-5-7 00:22:38
this is the last set of syntax with four tweaks.
1) is screated a new variable CasePicked.
2) I added CasePicked to the list of variables written out.
3) I split the file by sample_id.
3) There is a frequencies on CasePicked for each sample.


new file.
set seed 20130407.
input program.
    vector PopX(1000,f6.3).
    loop #i = 1 to 1000.
       compute PopX(#i) = rv.normal(0,1).
    end loop.
    end case.
    end file.
end input program.
execute.
dataset name madeup.
dataset activate madeup.
* from pop of 1000 draw 100 samples of size 50 with replacement.
vector PopX = PopX1 to PopX1000.
display vector.
numeric SampledX (f6.3).
loop sample_id = 1 to 100.
    loop draw = 1 to 50.

       compute CasePicked = rnd(rv.uniform(.5,1000.5)).
       compute SampledX = PopX(CasePicked).
       xsave outfile = 'c:\project\long1.sav' /keep =sample_id draw CasePicked SampledX.
    end loop.
end loop.
execute.
get file= 'c:\project\long1.sav'.
dataset name longy.
descriptives variables = SampledX.
split file by sample_id.
frequencies vars = casepicked.

[size=+1]It is easier to see        that some cases are picked more than on[size=+1]ce if you          change the frequencies command in the last syntax I posted. [size=+1]frequencies vars = casepicked /format=dfreq. [size=+1]that pops the cases that are picked more              than once [size=+1]for a sample to the top              of the frequency table.

Art Kendall

14
Lisrelchen 发表于 2014-5-7 00:24:41
Yes Art I see!,

I walked through the code and this time it certainly is sampling with
replacement. I tried to amend it to work with data in long format (instead
of having the data being sampled in wide column format), but I was
unsuccessful.

In general though I don't see why I would prefer this to the approach I
posted at the onset of series of emails (feel free to enlighten me). To make
your approach work you would need to flip the original data, which is an
expensive procedure. You also need to externally write a file with XSAVE.

While you are right in these things aren't a big deal for small datasets,
this is more code, making it intrinsically more complicated. So again, why
exactly would your approach be preferable?

David,

I liked your prior MATRIX bootstrap code better than the new snippet (and
the code I provided at the beginning of the post, which is almost an exact
duplicate of what  you wrote in 1996 holy poopers
<https://groups.google.com/group/sci.stat.consult/msg/710ea4ab83ddf24a?dmode=source&pli=1>
!).

Mainly I'm concerned about the VARSTOCASES when either the number of
original cases is larger or the number of samples needed is larger. I
wouldn't want to stack the dataset and then sample if the original OP's
request was with a population of 40,000 cases and he wanted 1,000 samples
(i.e. a stacked dataset of 40 million). The problem grows with the size of
the original population even if the number or size of samples needed does
not. It does plug away though like a charm even with 40,000 cases and 1,000
samples!

Of course, whatever procedures individuals utilize will be dependent on the
nature of the task and size of the data. I believe your MATRIX procedure
could be modified to work in alot of situations. Either by calculating the
stats right within a MATRIX loop, or by piping out to a new dataset,
calculating the stats, and iterating for the number of repetitions one
wants.

I'm thinking of here problems that are too big to practically stack the data
and use split file. Otherwise, I'm personally pretty cool with the solution
you posted over 16 years ago!


Andy

15
Lisrelchen 发表于 2014-5-7 00:25:24
But:  How often does one really want to bootstrap with huge samples?
I think of it mainly as a small sample technique.

VARSTOCASES is only one way to do the followup on the recent post.

MATRIX.
SAVE UNIFORM(100000,1).
END MATRIX.

COMPUTE samplenumber=($CASENUM-1)/1000+1.


David Marso

16
Lisrelchen 发表于 2014-5-7 00:26:33
Yes the code I initially produced was 9 samples of size 100.

It is difficult to objectively evaluate your own code. I believe all of the
examples given in the thread (and the new  split off thread
<http://spssx-discussion.1045642.n5.nabble.com/Sampling-WITH-replacement-used-in-bootstrapping-complex-sampling-etc-Demo-syntax-td5718495.html>
) are not intuitive to the uninitiated. IMO David's most recent MATRIX
examples are pretty simple (MATRIX isn't inherently more difficult to
understand). Both mine (or I feel I should call it David 16 years ago), and
yours rely on what I would consider idiosyncratic aspects of SPSS code;
input program for mine to build an empty dataset, and XSAVE for yours.
MATRIX code requires many loops and indices, but I don't believe it is any
more difficult to follow along.

The concept isn't simple, and I'm not sure it can be made that simple. The
fact that the first several rounds of your syntax did not produce random
sampling with replacement I believe is evidence of that. All of the examples
are pretty concise, and so is IMO a bit tit-for-tat to argue that one is
obviously superior in terms of readability. I've already stated why I would
prefer the approach I initially wrote over the one you produced, and already
admitted the grievances were minor in many situations.


Andy

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2025-12-30 21:20