这部分将生成1998-2006年的平衡面板和非平衡面板。do文档运行结束后将生成balanced.1998-2006.dta和unbalanced.1998-2006.dta两个文件。代码如下:
*生成9年的平衡面板和非平衡面板
cls
clear
set more off
cd "/Users/youwang/Desktop/FIRM"
qui do "my preprocess"
**------------------------------------------------------------------------------
* PART1:根据法人代码匹配 ID
**------------------------------------------------------------------------------
*生成用于匹配的变量
forval i = 1998/2006{
use "m`i'.dta",clear
gen match_id=id`i'
save "m`i'.ID.dta",replace //保存用于匹配的样本
}
*连续9年匹配
forval i = 1999/2006{
use "m1998.ID.dta"
merge 1:1 match_id using "m`i'.ID.dta"
gen status_ID1998_`i' = _merge //记录样本的匹配状况
drop _merge
save "m1998.ID.dta",replace
}
use "m1998.ID.dta",clear
save "merge.1998-2006.ID.dta",replace
use "merge.1998-2006.ID.dta",clear
keep if status_ID1998_2006==3 & status_ID1998_2005==3 & status_ID1998_2004==3 & ////
status_ID1998_2003==3 & status_ID1998_2002==3 & status_ID1998_2001==3 & ////
status_ID1998_2000==3 & status_ID1998_1999==3 //去掉1998-2006未能连续匹配成功的样本
save "matched.1998-2006.ID.dta",replace
use "merge.1998-2006.ID.dta",clear
drop if status_ID1998_2006==3 & status_ID1998_2005==3 & status_ID1998_2004==3 & ////
status_ID1998_2003==3 & status_ID1998_2002==3 & status_ID1998_2001==3 & ////
status_ID1998_2000==3 & status_ID1998_1999==3 //去掉1998-2006未能连续匹配成功的样本
save "unmatched.1998-2006.ID.dta",replace
**------------------------------------------------------------------------------
* PART2:根据法人名称匹配 NAME
**------------------------------------------------------------------------------
*筛选出1998-2006年未能用ID匹配的样本,用于本轮匹配
forval i = 1998/2006{
use "unmatched.1998-2006.ID.dta",clear
keep *`i'
drop if name`i'==""
gen match_name=name`i'
save "m`i'.NAME.dta",replace
}
*连续9年匹配
forval i = 1999/2006{
use "m1998.NAME.dta"
merge 1:1 match_name using "m`i'.NAME.dta"
gen status_NAME1998_`i' = _merge //记录样本的匹配状况
drop _merge
save "m1998.NAME.dta",replace
}
use "m1998.NAME.dta",clear
save "merge.1998-2006.NAME.dta",replace
use "merge.1998-2006.NAME.dta",clear
keep if status_NAME1998_2006==3 & status_NAME1998_2005==3 & status_NAME1998_2004==3 & ////
status_NAME1998_2003==3 & status_NAME1998_2002==3 & status_NAME1998_2001==3 & ////
status_NAME1998_2000==3 & status_NAME1998_1999==3
save "matched.1998-2006.NAME.dta",replace
use "merge.1998-2006.NAME.dta",clear
drop if status_NAME1998_2006==3 & status_NAME1998_2005==3 & status_NAME1998_2004==3 & ////
status_NAME1998_2003==3 & status_NAME1998_2002==3 & status_NAME1998_2001==3 & ////
status_NAME1998_2000==3 & status_NAME1998_1999==3
save "unmatched.1998-2006.NAME.dta",replace
**------------------------------------------------------------------------------
* PART3:根据电话号码匹配 PHONE
**------------------------------------------------------------------------------
*以电话号码为匹配变量对数据集进行匹配
*说明:各年统计的电话号码格式不尽相同,有的年份将电话号码和长途区号一并统计,有
*的年份将电话号码和长途区号分开统计。电话号码的位数也不完全相同,有的企业用手机
*号代替电话号码,有的企业电话号码(不含长途区号)只有7位,而有的企业的电话号码
*(不含长途区号)却有8位。为了便于匹配且保证电话号码与企业一一对应,我们用“电话
*号码后七位+地区代码前六位+行业代码”构建新的匹配代码。2000-2003年,地区代码为省
*地县码,只有六位数;2004-20012年,地区代码为行政区代码,有十二位数。行政区代码=
*省地县码(六位数)+乡村码(六位数)。
forval i = 1998/2006{
use "unmatched.1998-2006.NAME.dta",clear
keep *`i'
drop if phone==""
gen match_phone=substr(dq,1,4)+substr(nic,1,3)+substr(phone,-7,7)
bysort match_phone : drop if _N>1
save "m`i'.PHONE.dta",replace
}
*连续9年匹配
forval i = 1999/2006{
use "m1998.PHONE.dta"
merge 1:1 match_phone using "m`i'.PHONE.dta"
gen status_PHONE1998_`i' = _merge //记录样本的匹配状况
drop _merge
save "m1998.PHONE.dta",replace
}
use "m1998.PHONE.dta",clear
save "merge.1998-2006.PHONE.dta",replace
use "merge.1998-2006.PHONE.dta",clear
keep if status_PHONE1998_2006==3 & status_PHONE1998_2005==3 & status_PHONE1998_2004==3 & ////
status_PHONE1998_2003==3 & status_PHONE1998_2002==3 & status_PHONE1998_2001==3 & ////
status_PHONE1998_2000==3 & status_PHONE1998_1999==3
save "matched.1998-2006.PHONE.dta",replace
use "merge.1998-2006.PHONE.dta",clear
drop if status_PHONE1998_2006==3 & status_PHONE1998_2005==3 & status_PHONE1998_2004==3 & ////
status_PHONE1998_2003==3 & status_PHONE1998_2002==3 & status_PHONE1998_2001==3 & ////
status_PHONE1998_2000==3 & status_PHONE1998_1999==3
save "unmatched.1998-2006.PHONE.dta",replace
**------------------------------------------------------------------------------
* PART4:根据法人代表进行匹配 REP
**------------------------------------------------------------------------------
*以法人代表为匹配变量对数据集进行匹配,本部分匹配和其他各部分的匹配独立
*说明:不同企业存在同名法人代表的情形可能存在,为了解决这个问题,我们在法人代表的前面加上地区代码
*的前四位(代表企业所处的地区)和行业分类码的前三位(代表企业所处的中类)和企业的主要产品,生成新
*的法人代表。
forval i = 1998/2006{
use "unmatched.1998-2006.PHONE.dta",clear
keep *`i'
drop if corp_representive`i'==""
gen match_rep=substr(dq`i',1,4)+substr(nic`i',1,3)+corp_representive`i'
bysort match_rep : drop if _N>1
save "m`i'.REP.dta",replace
}
*连续9年匹配
forval i = 1999/2006{
use "m1998.REP.dta"
merge 1:1 match_rep using "m`i'.REP.dta"
gen status_REP1998_`i' = _merge //记录样本的匹配状况
drop _merge
save "m1998.REP.dta",replace
}
use "m1998.REP.dta",clear
save "merge.1998-2006.REP.dta",replace
use "merge.1998-2006.REP.dta",clear
keep if status_REP1998_2006==3 & status_REP1998_2005==3 & status_REP1998_2004==3 & ////
status_REP1998_2003==3 & status_REP1998_2002==3 & status_REP1998_2001==3 & ////
status_REP1998_2000==3 & status_REP1998_1999==3
save "matched.1998-2006.REP.dta",replace
use "merge.1998-2006.REP.dta",clear
drop if status_REP1998_2006==3 & status_REP1998_2005==3 & status_REP1998_2004==3 & ////
status_REP1998_2003==3 & status_REP1998_2002==3 & status_REP1998_2001==3 & ////
status_REP1998_2000==3 & status_REP1998_1999==3
save "unmatched.1998-2006.REP.dta",replace
**------------------------------------------------------------------------------
* PART5:生成平衡面板(9年)
**------------------------------------------------------------------------------
use "matched.1998-2006.ID.dta",clear
append using "matched.1998-2006.NAME.dta"
append using "matched.1998-2006.PHONE.dta"
append using "matched.1998-2006.REP.dta"
save "balanced.1998-2006.dta",replace
**------------------------------------------------------------------------------
* PART6:生成非平衡面板(9年)
**------------------------------------------------------------------------------
forval i = 1998/2006{
use "unmatched.1998-2006.REP.dta",clear
keep *`i'
drop if id`i'==""
gen match_id=id`i'
save "m`i'.RESIDUAL.dta",replace
}
*连续9年匹配
forval i = 1999/2006{
use "m1998.RESIDUAL.dta"
merge 1:1 match_id using "m`i'.RESIDUAL.dta"
drop _merge
save "m1998.RESIDUAL.dta",replace
}
use "m1998.RESIDUAL.dta",clear
save "merge.1998-2006.RESIDUAL.dta",replace
use "balanced.1998-2006.dta",clear
append using "merge.1998-2006.RESIDUAL.dta"
save "unbalanced.1998-2006.dta",replace
*删除中间过程产生的文件
forval i = 1998/2006{
erase "m`i'.dta"
erase "m`i'.ID.dta"
erase "m`i'.NAME.dta"
erase "m`i'.PHONE.dta"
erase "m`i'.REP.dta"
erase "m`i'.RESIDUAL.dta"
}
erase "matched.1998-2006.ID.dta"
erase "matched.1998-2006.NAME.dta"
erase "matched.1998-2006.PHONE.dta"
erase "matched.1998-2006.REP.dta"
erase "unmatched.1998-2006.ID.dta"
erase "unmatched.1998-2006.NAME.dta"
erase "unmatched.1998-2006.PHONE.dta"
erase "unmatched.1998-2006.REP.dta"
erase "merge.1998-2006.ID.dta"
erase "merge.1998-2006.NAME.dta"
erase "merge.1998-2006.PHONE.dta"
erase "merge.1998-2006.REP.dta"
erase "merge.1998-2006.RESIDUAL.dta"
至此,《工业企业数据库》的匹配过程结束。2002年及以前使用的是1994年版的《国民经济行业分类代码》,2003年及以后使用的是2002年版的《国民经济行业分类代码》。我的匹配过程中将企业所处的行业作为了辅助的匹配条件(nic),比较严谨的做法是将两个版本的行业分类代码统一后再匹配。但是这部分工作还没完成,目前给出的匹配代码忽略的行业分类码的差异。我取的行业分类码的前三位,这样可以尽可能的减少分类码差异所造成的影响。同时通过包含行业分类码匹配成功的样板极少,即使两个版本的分类码不完全统一,也不会的匹配结果产生明显的影响。