楼主: Shazam
1676 2

[Survey Sampling Question]Algorithm for sampling without replacement? [推广有奖]

  • 0关注
  • 2粉丝

硕士生

40%

还不是VIP/贵宾

-

TA的文库  其他...

C++

Culture(文化学)

C#(Newoccidental)

威望
0
论坛币
1847 个
通用积分
2.0750
学术水平
6 点
热心指数
6 点
信用等级
5 点
经验
1118 点
帖子
144
精华
0
在线时间
4 小时
注册时间
2006-5-10
最后登录
2016-12-4

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

Thanks!


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:replacement Algorithm placement Sampling question particular determine elements occurred between

沙发
Shazam 发表于 2014-4-30 07:44:56 |只看作者 |坛友微信交流群
Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}
There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67

使用道具

藤椅
Shazam 发表于 2014-4-30 07:46:10 |只看作者 |坛友微信交流群
Knuth's algorithm S
TaskKnuth's algorithm S
You are encouraged to solve this task according to the task description, using any language you may know.
This is a method of randomly sampling n items from a set of M items, with equal probability; where M >= n and M, the number of items is unknown until the end. This means that the equal probability sampling should be maintained for all successive items > n as they become available (although the content of successive samples can change).
The algorithm
Select the first n items as the sample as they become available;
For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If not, have it randomly (1/n) replace one of the previously selected n items of the sample.
Repeat #2 for any subsequent items.
The Task
Create a function s_of_n_creator that given n the maximum sample size, returns a function s_of_n that takes one parameter, item.
Function s_of_n when called with successive items returns an equi-weighted random sample of up to n of its items so far, each time it is called, calculated using Knuths Algorithm S.
Test your functions by printing and showing the frequency of occurrences of the selected digits from 100,000 repetitions of:
Use the s_of_n_creator with n == 3 to generate an s_of_n.
call s_of_n with each of the digits 0 to 9 in order, keeping the returned three digits of its random sampling from its last call with argument item=9.
Note: A class taking n and generating a callable instance/function might also be used.
Reference
The Art of Computer Programming, Vol 2, 3.4.2 p.142

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 00:03