数据库匹配问题 - 第2页 - Stata专版

11楼

sungmoo 发表于 2010-8-26 06:34:50

voodoo 发表于 2010-8-25 11:25 映射规则可能因数据库不同而有所差异，但我希望得到的是一个处理相似问题的整体思路和算法。

本题的实质是：找出“简称”与“全称”间的联系（或者说，简称的“取名规则”）——这与stata无关，再找出表现这种联系的命令——这与stata有关。

上述“映射规则”实质上就是简称的取名规则（严格讲，从数学上说，若把B库fndnmefull与A库fndnme各看成一个观测值集合，则从fndnmefull到fndnme事实上没有构成映射关系，因为fndnmefull中有元素没有像）。

如果简称的取名不是系统的、整体的，就很难有“整体思路与算法”。

12楼

ctx5518 发表于 2010-8-27 13:58:29

clear
set more off

use a.dta, clear
gen new = fndnme
replace new = substr(new, 5, 20) if strpos(fndnme, "基金")>0

levelsof new, local(new)

tempfile a
save `a'

use b.dta, clear
gen new = ""
foreach x of local new {
replace new = "`x'" if strpos( fndnmefull, "`x'")>0
}

merge new using `a', uniqusing sort

keep if _merge == 3
keep fndcd fndnme fndnmefull

order fndcd fndnme fndnmefull

已有 1 人评分	学术水平	热心指数	信用等级	收起理由
日新少年	+ 1	+ 1	+ 1	精彩帖子

总评分: 学术水平 + 1 热心指数 + 1 信用等级 + 1 查看全部评分

13楼

ctx5518 发表于 2010-8-27 14:09:36

clear
set more off

use a.dta, clear
gen new = fndnme
replace new = substr(new, 5, 20) if strpos(fndnme, "基金")>0
levelsof new, local(new)

tempfile a
save `a'

use b.dta, clear
gen new = ""
foreach x of local new {
replace new = "`x'" if strpos( fndnmefull, "`x'")>0
}

merge new using `a', uniqusing sort

keep if _merge == 3
keep fndcd fndnme fndnmefull

order fndcd fndnme fndnmefull

14楼

yjknmg 发表于 2010-8-30 11:33:14

来学习学习

15楼

voodoo 发表于 2010-8-30 22:07:38

将我整合坛友sungmoo和ctx5518相关建议的解决方案贴出（看起来还真有点复杂，;-)），也希望高手批评指正。

// 整体思路：1. 找出匹配规则 → 2. Stata实现 → 3. 保存匹配／尚未匹配 → 4. 下一步循环…… → 最后的“手工”验证与处理

set more off

// 复制原始文件，以防止覆盖
copy A.dta _A.dta
copy B.dta _B.dta

capture program drop findrule
program findrule
      use A, clear
      merge 1:1 _n using B, nogen
      br
end

capture program drop statamatch
program statamatch
      args rule
      // 2. Stata实现
      use B, clear
      quietly levelsof new, local(strnew)

      tempfile b
      save `b'

      use A.dta, clear
      gen new = ""
      if `rule' == 1 {             // strict rule
            foreach x of local strnew {
                     replace new = "`x'" if strpos("`x'", fndnme)>0
            }
      }
      else if `rule' == 2 {             // slack rule
            foreach x of local strnew {
                     replace new = "`x'" if indexnot(fndnme, "`x'")==0 &       ///
                              strpos("`x'", substr(fndnme, -4, .))>0 & missing(new)
            }
      }

      merge m:1 new using `b'

      // 3. 保存匹配和尚未匹配
      preserve
      keep if _merge == 3
      keep fndcd fndnme fndnmefull
      append using matched
      save matched, replace

      restore, preserve
      keep if _merge == 2
      keep fndnmefull
      sort fndnmefull
      save B, replace

      restore
      keep if _merge == 1
      keep fndcd fndnme
      sort fndnme
      save A, replace

      // 4. 浏览
      use matched, clear
      br
end

copy _A.dta A.dta, replace
copy _B.dta B.dta, replace

* 0. 生成空白matched.dta
clear
save matched, replace emptyok

* 1. Loop #1
findrule

use B, clear
gen new = subinstr(fndnmefull, "证券投资基金", "", .)       // 去除"证券投资基金"字样
save B, replace

statamatch 1             // strict rule

* 2. Loop #2
findrule

use B, clear
gen new = "基金" + fndnmefull
save B, replace

statamatch 1             // 匹配封闭式基金，strict rule

* 3. Loop #3
findrule

use A, clear
drop if strmatch(fndnme, "基金*")       // 删除可能带来混淆的封闭式基金
save A, replace

use B, clear
gen new = subinstr(fndnmefull, "摩根士丹利华鑫", "大摩", .)             // "摩根士丹利华鑫" -> "大摩"
replace new = subinstr(new, "宝康", "华宝兴业宝康", .)
replace new = subinstr(new, "德盛", "国联安", .)
replace new = subinstr(new, "普天", "鹏华普天", .)
replace new = subinstr(new, "华泰柏瑞", "友邦华泰", .)
save B, replace

statamatch 2             // slack rule

* 4. manual handling
use B, clear
ren fndnmefull fndnme
append using A, gen(A)
sort fndnme
br
// 以下有些基金代码得借助google或百度等搜索工具
replace fndcd = "050001.OF" if fndnme == "博时价值增长证券投资基金"
replace fndcd = "020008.OF" if fndnme == "国泰金鹿保本增值混合证券投资基金"
replace fndcd = "020006.OF" if fndnme == "国泰金象保本增值混合证券投资基金"
replace fndcd = "150001.OF" if fndnme == "国投瑞银瑞福分级股票型证券投资基金"
replace fndcd = "519011.OF" if fndnme == "海富通精选证券投资基金"
replace fndcd = "040002.OF" if fndnme == "华安MSCI中国A股指数增强型证券投资基金"
replace fndcd = "240012.OF" if fndnme == "华宝兴业增强收益债券型证券投资基金"
// ...... ......

keep if A == 0
keep fndnme fndcd
ren fndnme fndnmefull
append using matched

sort fndnmefull fndnme
duplicates tag fndnmefull, gen(tag)
br if tag
drop tag
bysort fndnmefull (fndnme): keep if _n == _N

duplicates tag fndcd, gen(tag)
br if tag
// 相应处理 ......

save matched, replace

use matched, clear
merge 1:1 fndnmefull using _B, nogen assert(matched)

// DONE!

已有 3 人评分	论坛币	学术水平	热心指数	信用等级	收起理由
Sunknownay		+ 2	+ 2	+ 2	热心帮助其他会员
crystal8832	+ 20	+ 1	+ 1	+ 1	补偿
dxystata	+ 20				好的意见建议

总评分: 论坛币 + 40 学术水平 + 3 热心指数 + 3 信用等级 + 3 查看全部评分

巫毒上传，必属佳品！
坛友下载，三思后行！

16楼

dxystata 发表于 2011-7-2 08:22:03

voodoo 发表于 2010-8-30 22:07
将我整合坛友sungmoo和ctx5518相关建议的解决方案贴出（看起来还真有点复杂，;-)），也希望高手批评指正。

// 整体思路：1. 找出匹配规则 → 2. Stata实现 → 3. 保存匹配／尚未匹配 → 4. 下一步循环…… → 最后的“手工”验证与处理

set more off

// 复制原始文件，以防止覆盖
copy A.dta _A.dta
copy B.dta _B.dta

capture program drop findrule
program findrule
      use A, clear
      merge 1:1 _n using B, nogen
      br
end

capture program drop statamatch
program statamatch
      args rule
      // 2. Stata实现
      use B, clear
      quietly levelsof new, local(strnew)

      tempfile b
      save `b'

      use A.dta, clear
      gen new = ""
      if `rule' == 1 {             // strict rule
            foreach x of local strnew {
                     replace new = "`x'" if strpos("`x'", fndnme)>0
            }
      }
      else if `rule' == 2 {             // slack rule
            foreach x of local strnew {
                     replace new = "`x'" if indexnot(fndnme, "`x'")==0 &       ///
                              strpos("`x'", substr(fndnme, -4, .))>0 & missing(new)
            }
      }

      merge m:1 new using `b'

      // 3. 保存匹配和尚未匹配
      preserve
      keep if _merge == 3
      keep fndcd fndnme fndnmefull
      append using matched
      save matched, replace

      restore, preserve
      keep if _merge == 2
      keep fndnmefull
      sort fndnmefull
      save B, replace

      restore
      keep if _merge == 1
      keep fndcd fndnme
      sort fndnme
      save A, replace

      // 4. 浏览
      use matched, clear
      br
end

copy _A.dta A.dta, replace
copy _B.dta B.dta, replace

* 0. 生成空白matched.dta
clear
save matched, replace emptyok

* 1. Loop #1
findrule

use B, clear
gen new = subinstr(fndnmefull, "证券投资基金", "", .)       // 去除"证券投资基金"字样
save B, replace

statamatch 1             // strict rule

* 2. Loop #2
findrule

use B, clear
gen new = "基金" + fndnmefull
save B, replace

statamatch 1             // 匹配封闭式基金，strict rule

* 3. Loop #3
findrule

use A, clear
drop if strmatch(fndnme, "基金*")       // 删除可能带来混淆的封闭式基金
save A, replace

use B, clear
gen new = subinstr(fndnmefull, "摩根士丹利华鑫", "大摩", .)             // "摩根士丹利华鑫" -> "大摩"
replace new = subinstr(new, "宝康", "华宝兴业宝康", .)
replace new = subinstr(new, "德盛", "国联安", .)
replace new = subinstr(new, "普天", "鹏华普天", .)
replace new = subinstr(new, "华泰柏瑞", "友邦华泰", .)
save B, replace

statamatch 2             // slack rule

* 4. manual handling
use B, clear
ren fndnmefull fndnme
append using A, gen(A)
sort fndnme
br
// 以下有些基金代码得借助google或百度等搜索工具
replace fndcd = "050001.OF" if fndnme == "博时价值增长证券投资基金"
replace fndcd = "020008.OF" if fndnme == "国泰金鹿保本增值混合证券投资基金"
replace fndcd = "020006.OF" if fndnme == "国泰金象保本增值混合证券投资基金"
replace fndcd = "150001.OF" if fndnme == "国投瑞银瑞福分级股票型证券投资基金"
replace fndcd = "519011.OF" if fndnme == "海富通精选证券投资基金"
replace fndcd = "040002.OF" if fndnme == "华安MSCI中国A股指数增强型证券投资基金"
replace fndcd = "240012.OF" if fndnme == "华宝兴业增强收益债券型证券投资基金"
// ...... ......

keep if A == 0
keep fndnme fndcd
ren fndnme fndnmefull
append using matched

sort fndnmefull fndnme
duplicates tag fndnmefull, gen(tag)
br if tag
drop tag
bysort fndnmefull (fndnme): keep if _n == _N

duplicates tag fndcd, gen(tag)
br if tag
// 相应处理 ......

save matched, replace

use matched, clear
merge 1:1 fndnmefull using _B, nogen assert(matched)

// DONE!