[其他] 【独家发布】spambase data-用来做分类的数据
下图是数据读入部分截图:
其中最后一个变量代表的编码是:1:spam ,0:email;
需要解决的问题:如何根据前面的57个变量来对判断出该条数据是否为spam还是email,正确率如何?
下面是代码和具体解释:
载入包:
### example for the linear discriminatanalysis
####load packages
if(!require(MASS)){
installed.packages("MASS")
}
读取数据:
### load dataset
data.class <- read.table('文件路径/spambase.data',sep = ',')
lda.fit <- lda(V58~.,data.class)
summary(data.class)
plot(lda.fit)
上面的summary(data.class)这个语句是对读入的数据描述统计,输出结果为:
V1 V2 V3 V4 V5 V6
Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. : 0.00000 Min. : 0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.:0.0000
Median :0.0000 Median : 0.000 Median :0.0000 Median : 0.00000 Median : 0.0000 Median :0.0000
Mean :0.1046 Mean : 0.213 Mean :0.2807 Mean : 0.06542 Mean : 0.3122 Mean :0.0959
3rd Qu.:0.0000 3rd Qu.: 0.000 3rd Qu.:0.4200 3rd Qu.: 0.00000 3rd Qu.: 0.3800 3rd Qu.:0.0000
Max. :4.5400 Max. :14.280 Max. :5.1000 Max. :42.81000 Max. :10.0000 Max. :5.8800
V7 V8 V9 V10 V11 V12
Min. :0.0000 Min. : 0.0000 Min. :0.00000 Min. : 0.0000 Min. :0.00000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.0000 Median : 0.0000 Median :0.00000 Median : 0.0000 Median :0.00000 Median :0.1000
Mean :0.1142 Mean : 0.1053 Mean :0.09007 Mean : 0.2394 Mean :0.05982 Mean :0.5417
3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.: 0.1600 3rd Qu.:0.00000 3rd Qu.:0.8000
Max. :7.2700 Max. :11.1100 Max. :5.26000 Max. :18.1800 Max. :2.61000 Max. :9.6700
V13 V14 V15 V16 V17 V18
Min. :0.00000 Min. : 0.00000 Min. :0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median : 0.00000 Median :0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
Mean :0.09393 Mean : 0.05863 Mean :0.0492 Mean : 0.2488 Mean :0.1426 Mean :0.1847
3rd Qu.:0.00000 3rd Qu.: 0.00000 3rd Qu.:0.0000 3rd Qu.: 0.1000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :5.55000 Max. :10.00000 Max. :4.4100 Max. :20.0000 Max. :7.1400 Max. :9.0900
V19 V20 V21 V22 V23 V24
Min. : 0.000 Min. : 0.00000 Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. : 0.00000
1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.00000
Median : 1.310 Median : 0.00000 Median : 0.2200 Median : 0.0000 Median :0.0000 Median : 0.00000
Mean : 1.662 Mean : 0.08558 Mean : 0.8098 Mean : 0.1212 Mean :0.1016 Mean : 0.09427
3rd Qu.: 2.640 3rd Qu.: 0.00000 3rd Qu.: 1.2700 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.: 0.00000
Max. :18.750 Max. :18.18000 Max. :11.1100 Max. :17.1000 Max. :5.4500 Max. :12.50000
V25 V26 V27 V28 V29 V30
Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. : 0.00000 Min. :0.0000
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.:0.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median :0.0000 Median : 0.00000 Median :0.0000
Mean : 0.5495 Mean : 0.2654 Mean : 0.7673 Mean :0.1248 Mean : 0.09892 Mean :0.1029
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.: 0.00000 3rd Qu.:0.0000
Max. :20.8300 Max. :16.6600 Max. :33.3300 Max. :9.0900 Max. :14.28000 Max. :5.8800
V31 V32 V33 V34 V35 V36
Min. : 0.00000 Min. :0.00000 Min. : 0.00000 Min. :0.00000 Min. : 0.0000 Min. :0.00000
1st Qu.: 0.00000 1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.00000
Median : 0.00000 Median :0.00000 Median : 0.00000 Median :0.00000 Median : 0.0000 Median :0.00000
Mean : 0.06475 Mean :0.04705 Mean : 0.09723 Mean :0.04784 Mean : 0.1054 Mean :0.09748
3rd Qu.: 0.00000 3rd Qu.:0.00000 3rd Qu.: 0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.:0.00000
Max. :12.50000 Max. :4.76000 Max. :18.18000 Max. :4.76000 Max. :20.0000 Max. :7.69000
V37 V38 V39 V40 V41 V42
Min. :0.000 Min. :0.0000 Min. : 0.00000 Min. :0.00000 Min. :0.00000 Min. : 0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.0000
Median :0.000 Median :0.0000 Median : 0.00000 Median :0.00000 Median :0.00000 Median : 0.0000
Mean :0.137 Mean :0.0132 Mean : 0.07863 Mean :0.06483 Mean :0.04367 Mean : 0.1323
3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.: 0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0000
Max. :6.890 Max. :8.3300 Max. :11.11000 Max. :4.76000 Max. :7.14000 Max. :14.2800
V43 V44 V45 V46 V47 V48
Min. :0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. :0.000000 Min. : 0.00000
1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.: 0.00000
Median :0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median :0.000000 Median : 0.00000
Mean :0.0461 Mean : 0.0792 Mean : 0.3012 Mean : 0.1798 Mean :0.005444 Mean : 0.03187
3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.1100 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.: 0.00000
Max. :3.5700 Max. :20.0000 Max. :21.4200 Max. :22.0500 Max. :2.170000 Max. :10.00000
V49 V50 V51 V52 V53 V54
Min. :0.00000 Min. :0.000 Min. :0.00000 Min. : 0.0000 Min. :0.00000 Min. : 0.00000
1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.: 0.00000
Median :0.00000 Median :0.065 Median :0.00000 Median : 0.0000 Median :0.00000 Median : 0.00000
Mean :0.03857 Mean :0.139 Mean :0.01698 Mean : 0.2691 Mean :0.07581 Mean : 0.04424
3rd Qu.:0.00000 3rd Qu.:0.188 3rd Qu.:0.00000 3rd Qu.: 0.3150 3rd Qu.:0.05200 3rd Qu.: 0.00000
Max. :4.38500 Max. :9.752 Max. :4.08100 Max. :32.4780 Max. :6.00300 Max. :19.82900
V55 V56 V57 V58
Min. : 1.000 Min. : 1.00 Min. : 1.0 Min. :0.000
1st Qu.: 1.588 1st Qu.: 6.00 1st Qu.: 35.0 1st Qu.:0.000
Median : 2.276 Median : 15.00 Median : 95.0 Median :0.000
Mean : 5.191 Mean : 52.17 Mean : 283.3 Mean :0.394
3rd Qu.: 3.706 3rd Qu.: 43.00 3rd Qu.: 266.0 3rd Qu.:1.000
Max. :1102.500 Max. :9989.00 Max. :15841.0 Max. :1.000
注意:一下是不严谨的分析结果,目的是作为一个case讲述如何分析这个数据集
###把原有的数据集抽取出来一部分当作testdata(实际中要一开始对数据进行划分,分为了train data test data和validation data这里简单起见直接忽略这一部分)
data.test1 <- data.class[10:200,]
data.test2 <- data.class[4500:4600,]
data.test <- rbind(data.test1,data.test2)
real <- data.test[,58]
上面这些语句的目的是从真实的为spam和email的数据中各自抽取一部分作为test.data,这里为了简单起见没有进行随机抽样。最后一行代码是test.data每条数据中spam和email的真实分类
下面数据是用test.data对数据进行预测
data.test.t <-data.test[,1:57]
lda.pred <- predict(lda.fit,data.test.t)
下面的语句是把预测的判别结果和实际数据分类的结果合并在一起
###conbind the predict and real
re.fal <- data.frame(lda.pred$class,real)
计算正确率
####compute the error
names(lda.pred)
lda.class <- lda.pred$class
table(lda.class,real)
mean(lda.class==real)
上述语句输出结果:
names(lda.pred)
[1] "class" "posterior" "x"
> lda.class <- lda.pred$class
> table(lda.class,real)
real
lda.class 0 1
0 100 47
1 1 144
> mean(lda.class==real)
[1] 0.8356164
可以看到上述正确判别结果为:0.8356164
上面由于只是输出0-1的结果,但实际中我们为了更直观看到spam和email我们需要进行实际的对0-1分别代表着spam和email
####add the lable
lda.class1 <-factor(lda.pred$class,levels = c(1,0),labels = c('spam','email'))
real1 <- factor(real,levels = c(1,0),labels = c('spam','email'))
table(lda.class1,real1)
mean(lda.class1==real1)
可以得到上述的输出结果为:
real1
lda.class1 spam email
spam 144 1
email 47 100
> mean(lda.class1==real1)
[1] 0.8356164
正确判别为:83.5%
这里需要注意基几点:1.数据没有进行划分,上面case的数据划分不严谨,作为case就没有进行详细的介绍如何。
2.没有进行cross-validation,上面数据量用来分析的结果算是比较大的,一般分析需要CV的。



雷达卡






京公网安备 11010802022788号







