楼主: Lisrelchen
1203 0

How do I sample without replacement using a sampling-with-replacement function? [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50057 个
通用积分
79.9387
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:

  • Start with an initially empty "sampled set".
  • Draw a (single) weighted sample with replacement with whatever method you have.
  • Check whether you have already picked it.
  • If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
  • Repeat 2-4 until the sampled set is of the desired size.

Will this give me a valid weighted sample?

(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)

UPDATE 2012-01-05

Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).

First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:

get_rejection_sample <- function(replace_sample, n) {    top = length(replace_sample)    bottom = 1    mid = (top+bottom)/2    u = unique(replace_sample[1:mid])    while(length(u) != n) {        if (length(u) < n) {            bottom = mid        }        else {            top = mid        }        mid = (top+bottom)/2        u = unique(replace_sample[1:mid])        if (mid == top || mid == bottom) {            break        }    }    u}

It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:

> x = 1:10**7> w = rexp(length(x))> system.time(y <- sample(x, 500, FALSE, w))   user  system elapsed  12.820   0.048  12.902 > system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))   user  system elapsed   0.311   0.066   0.378 Warning message:In sample(x, 1000, TRUE, w) :  Walker's alias method used: results are different from R < 2.2.0

As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample 106 indices from a list of about 1.2×107, which takes hours using replace=FALSE.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:replacement placement function Sampling replace following function without already desired

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-23 19:35