用SQL的方式表达下吧(在R中作为示例)
# 户主及配偶&男
t1 <- sqldf(“SELECT hhid, edu AS edu1 FROM tbl WHERE pline IN (1 , 2) AND gender = 0”)
# 户主及配偶&女
t2 <- sqldf(“SELECT hhid, edu AS edu2 FROM tbl WHERE pline IN (1 , 2) AND gender = 1”)
# 户主之子
t6M <- sqldf(“SELECT hhid, edu AS edu6m FROM tbl WHERE pline =6 AND gender = 0”)
# 户主之女
t6F <- sqldf(“SELECT hhid, edu AS edu6f FROM tbl WHERE pline =6 AND gender = 1”)
# 户主之媳
t7F <- sqldf(“SELECT hhid, edu AS edu7f FROM tbl WHERE pline =7 AND gender = 1”)
# 户主之婿
t7M <- sqldf(“SELECT hhid, edu AS edu7m FROM tbl WHERE pline =7 AND gender = 0”)
# 户主之孙子女
t8 <- sqldf(“SELECT hhid, edu AS edu8 FROM tbl WHERE pline =8”)
# 数据合并
dat <- merge(t1, t2, by = 'hhid', all.x = T)
dat <- merge(dat, t6M, by = 'hhid', all.x = T)
dat <- merge(dat, t6F, by = 'hhid', all.x = T)
dat <- merge(dat, t7M, by = 'hhid', all.x = T)
dat <- merge(dat, t7F, by = 'hhid', all.x = T)
dat <- merge(dat, t8, by = 'hhid', all.x = T)
这种方式按照家庭,把人口结构给体现出来了(同一个家庭中,子女的关系是不唯一的,6/7标签的,对应标签为8,无法完全准确判断是否子女关系啊,和人口普查统计的数据格式还是有一些差别的吧)
如上的这种数据格式,可能便于分析。
不用SQL表达,直接用R的方式修改如下:
t1 <- subset(tbl, (pline %in% c(1, 2)) & (gender = 0), select = c(hhid, edu))
names(t1) <- c('hhid', 'edu1')
或者用tidyverse包来搞
t1 <- (tbl %>% filter(pline %in% c(1, 2), gender = 0) %>% select(hhid, edu))
names(t1) <- c('hhid', 'edu1')
其他类似
############################################################
如果要把两代家庭,和三代家庭的分开,也可以用拆分数据的方式得到
比如两代和三代的分类
tftyp2 <- sqldf("SELECT hhid, MAX(pline) AS ftype FROM tbl GROUP BY hhid HAVING MAX(pline) IN (6, 7)")
tftyp3 <- sqldf("SELECT hhid, MAX(pline) AS ftype FROM tbl GROUP BY hhid HAVING MAX(pline) = 8")
基本思路是不管咋折腾,围绕hhid来弄,用pline识别结构(如果设计数据的时候,性别包含在pline中可能会更加方便点,是不是为了体现男女平等,所以没明确规定吧)
|