1. 研究问题
葡萄牙某银行拟根据现有客户资料建立预测模型,以配合其数据库营销策略,营销方式为电话直销,销售产品为某金融产品(term deposit),数据分析的目标为通过预测模型识别对该金融产品有较高购买意愿的用户群。
2. 数据来源
网上公开数据,数据下载链接为(zip包,包含数据字段说明): http://archive.ics.uci.edu/ml/ma ... ases/00222/bank.zip
3. 数据载入,整理
a) 数据载入
- bank <- read.csv("bank-full.csv", sep = ";", header = T)
复制代码
b) 描述性统计分析 对数据进行summary分析,了解每个字段的分布。
- summary(bank)
复制代码
- ## age job marital education
- ## Min. :18.0 blue-collar:9732 divorced: 5207 primary : 6851
- ## 1st Qu.:33.0 management :9458 married :27214 secondary:23202
- ## Median :39.0 technician :7597 single :12790 tertiary :13301
- ## Mean :40.9 admin. :5171 unknown : 1857
- ## 3rd Qu.:48.0 services :4154
- ## Max. :95.0 retired :2264
- ## (Other) :6835
- ## default balance housing loan contact
- ## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
- ## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
- ## Median : 448 unknown :13020
- ## Mean : 1362
- ## 3rd Qu.: 1428
- ## Max. :102127
- ##
- ## day month duration campaign
- ## Min. : 1.0 may :13766 Min. : 0 Min. : 1.00
- ## 1st Qu.: 8.0 jul : 6895 1st Qu.: 103 1st Qu.: 1.00
- ## Median :16.0 aug : 6247 Median : 180 Median : 2.00
- ## Mean :15.8 jun : 5341 Mean : 258 Mean : 2.76
- ## 3rd Qu.:21.0 nov : 3970 3rd Qu.: 319 3rd Qu.: 3.00
- ## Max. :31.0 apr : 2932 Max. :4918 Max. :63.00
- ## (Other): 6060
- ## pdays previous poutcome y
- ## Min. : -1.0 Min. : 0.00 failure: 4901 no :39922
- ## 1st Qu.: -1.0 1st Qu.: 0.00 other : 1840 yes: 5289
- ## Median : -1.0 Median : 0.00 success: 1511
- ## Mean : 40.2 Mean : 0.58 unknown:36959
- ## 3rd Qu.: -1.0 3rd Qu.: 0.00
- ## Max. :871.0 Max. :275.00
- ##
复制代码
4. 运用决策树模型对数据做初步分类建模和变量选择
- require(rpart)
- require(caret)
- require(ggplot2)
- require(gplots)
- bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.001)
- treeImp <- varImp(bank.tree, scale = TRUE, surrogates = FALSE, competes = TRUE)
- treeImp$Variable <- rownames(treeImp)
- treeImp.sort <- treeImp[order(-treeImp$Overall), ]
- ggplot(treeImp, aes(Variable, Overall)) + geom_bar(stat = "identity") + coord_flip()
复制代码
根据cp plot对树做裁剪
- plotcp(bank.tree)
复制代码
- printcp(bank.tree)
- ##
- ## Classification tree:
- ## rpart(formula = y ~ ., data = bank, method = "class", cp = 0.001)
- ##
- ## Variables actually used in tree construction:
- ## [1] age balance contact day duration education housing
- ## [8] job marital month pdays poutcome previous
- ##
- ## Root node error: 5289/45211 = 0.12
- ##
- ## n= 45211
- ##
- ## CP nsplit rel error xerror xstd
- ## 1 0.0380 0 1.00 1.00 0.013
- ## 2 0.0253 3 0.89 0.89 0.012
- ## 3 0.0170 4 0.86 0.86 0.012
- ## 4 0.0080 5 0.84 0.85 0.012
- ## 5 0.0042 7 0.83 0.84 0.012
- ## 6 0.0040 10 0.81 0.84 0.012
- ## 7 0.0034 13 0.80 0.84 0.012
- ## 8 0.0022 15 0.80 0.82 0.012
- ## 9 0.0020 21 0.78 0.82 0.012
- ## 10 0.0018 24 0.78 0.82 0.012
- ## 11 0.0016 26 0.77 0.81 0.012
- ## 12 0.0015 30 0.77 0.81 0.012
- ## 13 0.0014 32 0.76 0.81 0.012
- ## 14 0.0013 40 0.75 0.81 0.012
- ## 15 0.0012 44 0.75 0.81 0.012
- ## 16 0.0011 51 0.74 0.81 0.012
- ## 17 0.0010 61 0.72 0.81 0.012
- ## 18 0.0010 63 0.72 0.81 0.012
复制代码
- bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.0022373)
- plot(bank.tree, branch = 0, margin = 0.1, uniform = T)
- text(bank.tree, use.n = T, col = "red", cex = 0.6)
复制代码
5. 变量初选,分析和变换
根据决策树分析的结果,我们选择变量重要性最高的前5个变量做进一步研究,依次是:
- Duration : last contact duration, in seconds (numeric)
- month : last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- poutcome : outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
- pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- previous : number of contacts performed before this campaign and for this client (numeric)
a) Duration
- bank$y_dummy = ifelse(bank$y == "yes", 1, 0)
- summary(bank$duration)
- ## Min. 1st Qu. Median Mean 3rd Qu. Max.
- ## 0 103 180 258 319 4920
- ggplot(bank, aes(duration, y_dummy)) + geom_smooth() + geom_point()
- ## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
复制代码
根据拟合线的形态,需要对duration做一个二次项。- bank$duration.sq <- bank$duration * bank$duration
复制代码
b) month
- summary(bank$month)
- ## apr aug dec feb jan jul jun mar may nov oct sep
- ## 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
- plotMeans(bank$y_dummy, bank$month, error.bars = "se")
复制代码
- bank$month.sel <- ifelse(bank$month == "dec", 1, 0)
- bank$month.sel <- ifelse(bank$month == "mar", 1, bank$month)
- bank$month.sel <- ifelse(bank$month == "oct", 1, bank$month)
- bank$month.sel <- ifelse(bank$month == "sep", 1, bank$month)
复制代码
c) poutcome
- summary(bank$poutcome)
- ## failure other success unknown
- ## 4901 1840 1511 36959
- plotMeans(bank$y_dummy, bank$poutcome, error.bars = "se")
复制代码
- bank$poutcome.success <- ifelse(bank$poutcome == "success", 1, 0)
复制代码
d) pdays
- summary(bank$pdays)
- ## Min. 1st Qu. Median Mean 3rd Qu. Max.
- ## -1.0 -1.0 -1.0 40.2 -1.0 871.0
- bank$nocontact <- ifelse(bank$pdays == -1, 1, 0)
- bank$pdays <- ifelse(bank$pdays == -1, 0, bank$pdays)
- summary(bank$nocontact)
- ## Min. 1st Qu. Median Mean 3rd Qu. Max.
- ## 0.000 1.000 1.000 0.817 1.000 1.000
- plotMeans(bank$y_dummy, as.factor(bank$nocontact), error.bars = "se")
复制代码
- ggplot(bank, aes(log(pdays + 1))) + geom_histogram()
复制代码
- ggplot(bank, aes(log(pdays + 1), y_dummy)) + geom_smooth() + geom_point()
复制代码
e) previous
- summary(bank$previous)
- ## Min. 1st Qu. Median Mean 3rd Qu. Max.
- ## 0.00 0.00 0.00 0.58 0.00 275.00
- ggplot(bank, aes(log(previous + 1))) + geom_histogram()
复制代码
- ggplot(bank, aes(log(previous + 1), y_dummy)) + geom_smooth() + geom_point()
复制代码
- bank$previous.0 <- as.factor(ifelse(bank$previous == 0, 1, 0))
- bank$previous.1 <- as.factor(ifelse(bank$previous == 1, 1, 0))
- bank$previous.2 <- as.factor(ifelse(bank$previous == 2, 1, 0))
- bank$previous.2plus <- as.factor(ifelse(bank$previous > 2, 1, 0))
复制代码
6. 逻辑回归建模
- logistic.full <- glm(y_dummy ~ duration + duration.sq + month.sel + poutcome.success +
- bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 + bank$previous.2 +
- bank$previous.2plus, data = bank)
- summary(logistic.full)
复制代码
- ##
- ## Call:
- ## glm(formula = y_dummy ~ duration + duration.sq + month.sel +
- ## poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.0 +
- ## bank$previous.1 + bank$previous.2 + bank$previous.2plus,
- ## data = bank)
- ##
- ## Deviance Residuals:
- ## Min 1Q Median 3Q Max
- ## -1.1567 -0.1148 -0.0418 0.0131 1.0833
- ##
- ## Coefficients: (2 not defined because of singularities)
- ## Estimate Std. Error t value Pr(>|t|)
- ## (Intercept) 1.58e-01 2.35e-02 6.74 1.6e-11 ***
- ## duration 6.57e-04 9.61e-06 68.44 < 2e-16 ***
- ## duration.sq -1.35e-07 6.15e-09 -21.97 < 2e-16 ***
- ## month.sel -6.72e-03 4.35e-04 -15.46 < 2e-16 ***
- ## poutcome.success 4.55e-01 8.08e-03 56.34 < 2e-16 ***
- ## bank$nocontact -1.75e-01 2.34e-02 -7.49 7.0e-14 ***
- ## log(pdays + 1) -2.11e-02 4.33e-03 -4.86 1.2e-06 ***
- ## bank$previous.01 NA NA NA NA
- ## bank$previous.11 -2.52e-02 7.13e-03 -3.54 0.0004 ***
- ## bank$previous.21 -1.64e-02 7.71e-03 -2.13 0.0334 *
- ## bank$previous.2plus1 NA NA NA NA
- ## ---
- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- ##
- ## (Dispersion parameter for gaussian family taken to be 0.07691)
- ##
- ## Null deviance: 4670.3 on 45210 degrees of freedom
- ## Residual deviance: 3476.3 on 45202 degrees of freedom
- ## AIC: 12340
- ##
- ## Number of Fisher Scoring iterations: 2
复制代码
- logistic.step <- step(logistic.full, direction = "both", k = 2)
复制代码
- ## Start: AIC=12340
- ## y_dummy ~ duration + duration.sq + month.sel + poutcome.success +
- ## bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +
- ## bank$previous.2 + bank$previous.2plus
- ##
- ##
- ## Step: AIC=12340
- ## y_dummy ~ duration + duration.sq + month.sel + poutcome.success +
- ## bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +
- ## bank$previous.2
- ##
- ##
- ## Step: AIC=12340
- ## y_dummy ~ duration + duration.sq + month.sel + poutcome.success +
- ## bank$nocontact + log(pdays + 1) + bank$previous.1 + bank$previous.2
- ##
- ## Df Deviance AIC
- ## <none> 3476 12340
- ## - bank$previous.2 1 3477 12343
- ## - bank$previous.1 1 3477 12351
- ## - log(pdays + 1) 1 3478 12362
- ## - bank$nocontact 1 3481 12395
- ## - month.sel 1 3495 12577
- ## - duration.sq 1 3513 12819
- ## - poutcome.success 1 3720 15407
- ## - duration 1 3837 16797
复制代码
- summary(logistic.step)
- ##
- ## Call:
- ## glm(formula = y_dummy ~ duration + duration.sq + month.sel +
- ## poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.1 +
- ## bank$previous.2, data = bank)
- ##
- ## Deviance Residuals:
- ## Min 1Q Median 3Q Max
- ## -1.1567 -0.1148 -0.0418 0.0131 1.0833
- ##
- ## Coefficients:
- ## Estimate Std. Error t value Pr(>|t|)
- ## (Intercept) 1.58e-01 2.35e-02 6.74 1.6e-11 ***
- ## duration 6.57e-04 9.61e-06 68.44 < 2e-16 ***
- ## duration.sq -1.35e-07 6.15e-09 -21.97 < 2e-16 ***
- ## month.sel -6.72e-03 4.35e-04 -15.46 < 2e-16 ***
- ## poutcome.success 4.55e-01 8.08e-03 56.34 < 2e-16 ***
- ## bank$nocontact -1.75e-01 2.34e-02 -7.49 7.0e-14 ***
- ## log(pdays + 1) -2.11e-02 4.33e-03 -4.86 1.2e-06 ***
- ## bank$previous.11 -2.52e-02 7.13e-03 -3.54 0.0004 ***
- ## bank$previous.21 -1.64e-02 7.71e-03 -2.13 0.0334 *
- ## ---
- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- ##
- ## (Dispersion parameter for gaussian family taken to be 0.07691)
- ##
- ## Null deviance: 4670.3 on 45210 degrees of freedom
- ## Residual deviance: 3476.3 on 45202 degrees of freedom
- ## AIC: 12340
- ##
- ## Number of Fisher Scoring iterations: 2
复制代码
7 模型scoring和ROC评估
- require(ROCR)
- bank.pred<-1/(1+exp(-predict(logistic.step)))
- roc.data <- prediction(bank.pred, labels = bank$y)
- roc.data <- performance(roc.data, "tpr", "fpr")
- plot(roc.data)
复制代码
score的分布为- score<-data.frame("prob.y"=bank.pred,"y"=as.factor(bank$y_dummy))
- ggplot(score, aes(x=prob.y, fill=y)) + geom_histogram(position="identity", binwidth=0.01,alpha=0.5)
复制代码
通过对ROC和Score分布的分析,逻辑回归Score的分类效果还是不错的。具体的score cutoff值需要根据业务要求和营销成本而定。


雷达卡




京公网安备 11010802022788号







