这一组中的问题是原始数据中ivid06的信息不能和同一行的ivid08匹配,可以观察到同一个household的数据有颠倒的现象。现在想以ivid08为基准进行调整。以第一行为例,ivid06与ivid08并非同一个人,但可以观察到ivid06==60519130240803的信息应该与第四行ivid08==605191302400804匹配,故将第四行的ivid06_revised赋值为60519130240803 (见蓝色高亮)。下表中的ivid06_revised即目测检查后的修改结果。
由于数据量非常大因此visual inspection非常耗时,同时原始普查数据存在个别错误,即使能做出判断也存在一些错误,如name的拼写(见黄色高亮),出生年月的误差,relation的变化(如06年的child在08年可能是head),请问有没有一种算法可以提供精度较高的匹配?本人第一次提问,如有疏忽请予以指出,谢谢。
| hhid08 | hhid06 | ivid08 | ivid06 | ivid06_revised | name08 | name06 |
| 6051913024008 | 605191302408 | 605191302400801 | 60519130240803 | 60519130240801 | h nghüa by¨ | BY¡ Y §¸P |
| 6051913024008 | 605191302408 | 605191302400802 | 60519130240801 | 60519130240807 | y mYp nia | BY¡ H NGHüA |
| 6051913024008 | 605191302408 | 605191302400803 | 60519130240802 | 60519130240808 | h hi¨n by¨ | BY¡ Y S|¥NG |
| 6051913024008 | 605191302408 | 605191302400804 | 60519130240805 | 60519130240803 | BY¡ Y §¸P | NI£ Y DA N¤ |
| 6051913024008 | 605191302408 | 605191302400805 | 60519130240806 | 60519130240804 | y min by¨. | BY¡ H KYNH |
| 6051913024008 | 605191302408 | 605191302400806 | 60519130240807 | 60519130240805 | NI£ Y DA N¤ | NI£ Y MIP |
| 6051913024008 | 605191302408 | 605191302400807 | 60519130240808 | 60519130240806 | BY¡ H KYNH | BY¡ HIAN |
| 6051913024008 | 605191302408 | 605191302400808 | BY¡ H trim | |||
| 6051913024008 | 605191302408 | 60519130240804 | BY¡ Y MIN |
| gender08 | gender06 | yob08 | yob06 | mob08 | mob06 | relation08 | relation06 |
| Female | Male | 1985 | 1989 | 10 | 7 | Head | Other |
| Male | Female | 1935 | 1985 | 6 | 10 | Parent | Head |
| Female | Male | 1949 | 1987 | 7 | 8 | Parent | Other |
| Male | Male | 1989 | 1980 | 7 | 8 | Other | Spouse |
| Male | Female | 1991 | 2004 | 5 | 11 | Other | Child |
| Male | Male | 1980 | 1935 | 8 | 6 | Other | Parent |
| Female | Female | 2004 | 1949 | 11 | 7 | Child | Parent |
| Female | 2006 | 4 | Child | ||||
| Male | 1991 | 5 | Other |


雷达卡


京公网安备 11010802022788号







