楼主: xmwise
2204 5

[其他] 判断两组数据的观察值是否一致 [推广有奖]

  • 0关注
  • 1粉丝

已卖:4份资源

讲师

93%

还不是VIP/贵宾

-

威望
0
论坛币
34600 个
通用积分
7.6571
学术水平
0 点
热心指数
0 点
信用等级
0 点
经验
5463 点
帖子
135
精华
0
在线时间
956 小时
注册时间
2014-9-12
最后登录
2023-11-30

楼主
xmwise 发表于 2017-3-7 21:50:55 |AI写论文
15论坛币
请问各位坛友:

我有两笔用csv格式保存的数据,其变量名称相同,观察值数量相同,每个观察值的取值也相同,可以说这两笔数据应该是一模一样的。

但我把它们分别导入stata后,再用cf命令两两组数据进行对比,stata却认为其中有一部分变量不相同,截图如下:

然而,这些所谓“不一致”的观察值从数字上来看还是一样的。

请问为什么会出现这样的情况?
明明一模一样的数据,stata却认为不一样?



附件: 你需要登录才可以下载或查看附件。没有帐号?我要注册

最佳答案

Newkoarla 查看完整内容

basically I would ask you if you sorted both data groups? the best practice is to sort data before you run compare. You can refer the article I copy below: How do I check that the same data input by two people are consistently entered? | Stata FAQ When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double ...
关键词:Stata tata 一模一样

沙发
Newkoarla 发表于 2017-3-7 21:50:56
basically I would ask you if you sorted both data groups? the best practice is to sort data before you run compare.

You can refer the article I copy below:
How do I check that the same data input by two people are consistently entered? | Stata FAQ
When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2. After we read in the data, we sort the datasets by the id variable id and then save the data.
clear
input id str8 name  age ht wt income
11 john    23 68 145 23000
12 charlie 25 72 178 45000
13 sally   21 64 135 12000
4  mike    34 70 156  5600
43 paul    30 73 189 15600
end

sort id
save person1, replace

clear
input id str8 name age ht wt income
11 john    23.5 68 145 23000
12 charles   25 52 178 45000
13 sally     21 64  .  12000
4  michael   34 70 156  5600
43 Paul      30 73 189  5600
end

sort id
save person2, replace
We compare the two datasets with the cf command to see if any discrepancies exist between the two datasets.
use person1, clear
cf _all using person2, verbose

              id:  match
            name:  3 mismatches
             age:  1 mismatches
              ht:  1 mismatches
              wt:  1 mismatches
          income:  1 mismatches
r(9);
The cf command revealed that differences do exist, however, it did not specify for which observations the mismatches occurred, which is our main objective. To find out where the errors occurred, we start by creating a large dataset that combines the two. However, in the large dataset we must distinguish the data input by person1 and person2. We choose to rename all variables from person1, except for the id variable (this is for matching purposes), by adding the suffix "_person1" via the rename command. We use the foreach command to make the renaming process more efficient. Once we the variables are renamed, person2 is merged with person1 by the id variable, id, and then the merged dataset is listed.
use person1, clear

foreach var of varlist name-income{
  rename `var' `var'_person1
}

merge id using person2
list

     +---------------------------------------------------------------------------------------------------------+
     | id   name_p~1   age_pe~1   ht_per~1   wt_per~1   income~1      name    age   ht    wt   income   _merge |
     |---------------------------------------------------------------------------------------------------------|
  1. |  4       mike         34         70        156       5600   michael     34   70   156     5600        3 |
  2. | 11       john         23         68        145      23000      john   23.5   68   145    23000        3 |
  3. | 12    charlie         25         72        178      45000   charles     25   52   178    45000        3 |
  4. | 13      sally         21         64        135      12000     sally     21   64     .    12000        3 |
  5. | 43       paul         30         73        189      15600      Paul     30   73   189     5600        3 |
     +---------------------------------------------------------------------------------------------------------+
In exploring the discrepancies, we can either display discrepancies by the variables or discrepancies by observations. We start by listing the discrepancies by the variables. We start by using the foreach command and reference the variables from person2 (they do not have the suffix), name-income. We use the if clause, `var’ != `var’_person1, which lists only observations for a given variable, the given variable referenced by `var’ from the foreach command, when the data entered by person2 (`var’) is not equal to person1 (`var’_person1). When this condition is met, we list id, the value entered by person2 (`var’) and the value entered by person1 (`var’_person1).
Note that when we list the variables, the variables with no suffix correspond to the entries made by person2.
*Discrepancies listed by variables.

foreach var of varlist name-income{
  list id `var' `var'_person1 if `var' != `var'_person1, abbreviate(15)
}
     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
  3. | 12   charles        charlie |
  5. | 43      Paul           paul |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+
When we list discrepancies by observations, we need to modify the prior program to evaluate the variables on a case-by-case basis i.e., for observation 1, we evaluate the entries across all variables given in the foreach. Once observation 1 is checked and discrepancies listed, we move to observation 2. This process is repeated until the last observation is completed. First, we find how many observations are in the data with the count command and then insert that value in the forvalues loop. The forvalues argument will allow us to evaluate discrepancies on a case-by-case basis. We added _n == `i’ to the if clause in the list command to evaluate the variables in the foreach command for a given observation before moving to the next observation.
*Discrepancies listed by id variable.

count
    5

forvalues i = 1/5 {
   foreach var of varlist name-income{
   list id `var' `var'_person1 if (`var' != `var'_person1) & _n == `i', abbreviate(15)
   }
}

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  3. | 12   charles        charlie |
     +-----------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +--------------------------+
     | id   name   name_person1 |
     |--------------------------|
  5. | 43   Paul           paul |
     +--------------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+


If you can post the data group and the code, I might be able to help you out here.

Good luck!
已有 1 人评分论坛币 学术水平 热心指数 收起 理由
admin_kefu + 30 + 2 + 2 热心帮助其他会员

总评分: 论坛币 + 30  学术水平 + 2  热心指数 + 2   查看全部评分

藤椅
血浪星空 发表于 2017-3-8 14:44:49
这要问专业的计算机人员了

板凳
lile23 发表于 2017-3-8 18:58:51
应该跟计算机的精度有关系,具体你应该看一下数值计算这本书

报纸
Newkoarla 发表于 2017-3-10 00:22:07
Here is the syntax of CF
附件: 你需要登录才可以下载或查看附件。没有帐号?我要注册
已有 1 人评分论坛币 收起 理由
giresse + 20 精彩帖子

总评分: 论坛币 + 20   查看全部评分

地板
xmwise 发表于 2017-4-23 17:11:21
Newkoarla 发表于 2017-3-10 00:22
Here is the syntax of CF
非常感谢!

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
jg-xs1
拉您进交流群
GMT+8, 2026-1-5 02:09