I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:
- Start with an initially empty "sampled set".
- Draw a (single) weighted sample with replacement with whatever method you have.
- Check whether you have already picked it.
- If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
- Repeat 2-4 until the sampled set is of the desired size.
Will this give me a valid weighted sample?
(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)
UPDATE 2012-01-05
Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).
First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:
get_rejection_sample <- function(replace_sample, n) { top = length(replace_sample) bottom = 1 mid = (top+bottom)/2 u = unique(replace_sample[1:mid]) while(length(u) != n) { if (length(u) < n) { bottom = mid } else { top = mid } mid = (top+bottom)/2 u = unique(replace_sample[1:mid]) if (mid == top || mid == bottom) { break } } u}It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:
> x = 1:10**7> w = rexp(length(x))> system.time(y <- sample(x, 500, FALSE, w)) user system elapsed 12.820 0.048 12.902 > system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500)) user system elapsed 0.311 0.066 0.378 Warning message:In sample(x, 1000, TRUE, w) : Walker's alias method used: results are different from R < 2.2.0As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample 106 indices from a list of about 1.2×107, which takes hours using replace=FALSE.