无效值或缺失值处理:常用的处理方法有:估算,整例删除,变量删除和成对删除
1、缺失值检测及处理
(1)is.na():缺失值检测; table(is.na()):分类统计;sum(is.na()):统计缺失值个数;mean(is.na()):统计缺失值个数占比;(2)na.rm=TRUE:分析中排出缺失值;
(3)na.omit():删除含缺失值所在行;
(4)异常值重新编码为缺失值;
code及示例:(1)
>is.na(leadership)
manager date country gender age q1 q2 q3 q4 q5
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> table(is.na(leadership))
FALSE TRUE
48 2
> sum(is.na(leadership))
[1] 2
> mean(is.na(leadership))
[1] 0.04
(2)
manager<-c(1,2,3,4,5)
date<-c("10/24/08","10/28/08","10/01/08","10/12/08","05/01/08")
country<-c("US","US","UK","UK","UK")
gender<-c("M","F","F","M","F")
age<-c(32,45,25,39,99)
q1<-c(5,3,3,3,2)
q2<-c(4,5,5,3,2)
q3<-c(5,2,5,4,1)
q4<-c(5,5,5,NA,2)
q5<-c(5,5,2,NA,1)
leadership<-data.frame(manager,date,country,gender,age,q1,q2,q3,q4,q5,str= FALSE)
> sum(leadership[,6:10],na.rm = TRUE)
[1] 82
(3)
>na.omit(leadership)
manager date country gender age q1 q2 q3 q4 q5
1 1 10/24/08 US M 32 5 4 5 5 5
2 2 10/28/08 US F 45 3 5 2 5 5
3 3 10/01/08 UK F 25 3 5 5 5 2
5 5 05/01/08 UK F 99 2 2 1 2 1
(4)
> leadership$age[leadership$age==99]<-NA
> leadership
manager date country gender age q1 q2 q3 q4 q5
1 1 10/24/08 US M 32 5 4 5 5 5
2 2 10/28/08 US F 45 3 5 2 5 5
3 3 10/01/08 UK F 25 3 5 5 5 2
4 4 10/12/08 UK M 39 3 3 4 NA NA
5 5 05/01/08 UK F NA 2 2 1 2 1
2、变量名类型判断及格式转换(1)
(2)日期格式、运算及其与字符转换方法
注:日期值的默认格式:yyyy-mm-dd。Sys.Date():返回当天的日期。date:返回当前的日期和时间;format(x,format="output_format"):输出指定格式的日期值;
difftime(time1, time2, tz,units = c("auto", "secs", "mins", "hours", "days", "weeks")):计算time1-time2的时间间隔;
as.date(x,日期格式):转换时间格式;
(1)日期格式转换
> b<-format(leadership[,2],format="%b %d %Y")
> b
[1] "10/24/08" "10/28/08" "10/01/08" "10/12/08" "05/01/08"
> b<-format(leadership[,2],format="%b %d %y")
> b
[1] "10/24/08" "10/28/08" "10/01/08" "10/12/08" "05/01/08"
> a<-as.Date(leadership[,2],"%m/%d/%y")
> a
[1] "2008-10-24" "2008-10-28" "2008-10-01" "2008-10-12" "2008-05-01"
> today<-Sys.Date()
> c<-difftime(today,as.Date(leadership[1,2],"%b/%d/%y"),units = "day") /*两个时间的格式需一致(用as.Date()来转换)*/
> c
Time difference of 3520 days
3、变量的重命名、及重编码
names():变量重命名;
variable[condition]<-expression:在condition的值为TRUE时之行赋值;
within(data,expr,……):变量重编码,其中data为数据组、expr:为分类表达式;
code及示例
(1)
> names(leadership)
[1] "manager" "date" "country" "gender" "age" "q1" "q2"
[8] "q3" "q4" "q5"
> names(leadership)[2]<-"testdate"
> leadership
manager testdate country gender age q1 q2 q3 q4 q5
1 1 10/24/08 US M 32 5 4 5 5 5
2 2 10/28/08 US F 45 3 5 2 5 5
3 3 10/01/08 UK F 25 3 5 5 5 2
4 4 10/12/08 UK M 39 3 3 4 NA NA
5 5 05/01/08 UK F 99 2 2 1 2 1
> names(leadership)[6:10]<-c("item1","item2","item3","item4","item5")
> leadership
manager testdate country gender age item1 item2 item3 item4 item5
1 1 10/24/08 US M 32 5 4 5 5 5
2 2 10/28/08 US F 45 3 5 2 5 5
3 3 10/01/08 UK F 25 3 5 5 5 2
4 4 10/12/08 UK M 39 3 3 4 NA NA
5 5 05/01/08 UK F 99 2 2 1 2 1
(2)
name<-c("john davis","angels williams","bullwinkle moose",
"david jones","janice markhammer","cheryl cushing","reuven ytzrhak",
"greg knox","hoel england","mary rayburn")
mscore<-c(502,600,412,358,495,512,410,625,573,522)
sscore<-c(95,99,80,82,75,85,80,95,89,86)
escore<-c(25,22,18,15,20,28,15,30,27,18)
students_score<-data.frame(name,mscore,sscore,escore)
score_function<-function(students_score,i=length(students_score)) {
min=0
max=0
while (is.numeric(students_score[,i])) {
if (i>max) max<-i else min<-i
; i<- i-1}
z<-scale(students_score[,min:max])
score<-apply(z,1,mean)
students_score<-cbind(students_score,score)
y<-quantile(students_score$score,c(0.2,0.4,0.6,0.8))
students_score<-within(students_score,{
grade <- NA
grade[score<y[1]] <- "F"
grade[score<y[2] & score>=y[1]] <- "D"
grade[score<y[3] & score>=y[2]] <- "C"
grade[score<y[4] & score>=y[3]] <- "B"
grade[score>=y[4]] <- "A"})
return(students_score)}
score_function(students_score)
4、数据的合并及子集的提取(1)添加列(即按某列唯一标识合并)
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)x,y:数据组
by = intersect(names(x), names(y)):用于连接两个数据集的列,intersect(a,b)值向量a,b的交集,names(x)指提取数据集x的列名
by.x,by.y:指定依据哪些行合并数据框,默认值为相同列名的列
all,all.x,all.y:指定x和y的行是否应该全在输出文件
sort:by指定的列(即公共列)是否要排序
suffixes:指定除by外相同列名的后缀incomparables:指定by中哪些单元不进行合并
cbind():按列整合数据组或矩阵。如cbind(A,B),其中A和B必须要有相同的行数,且顺序一致;
(2)添加行
rbind():两个具有相同变量的数据框或矩阵直接按行合并数据;
(3)子集提取
1、利用数据组下标提取,如leadership[,6:10]选取6-10列,
2、利用剔除法提取:leadership[c(-8,-9)]剔除第8和9列;
3、逻辑函数提取:如newdata <- leadership[which(leadership$gender == "M" & leadership$age > 30),]
4、subset提取,如:
newdata <- subset(leadership, age >= 35 | age < 24, select = c(q1, q2, q3, q4))
newdata <- subset(leadership, gender == "M" & age > 25, select = gender:q4)
5、sample():随机抽样,如:
leadership 数据集中随机抽取一个大小为3的样本
mysample <- leadership[sample(1:nrow(leadership),3,replace = F),]
code及例子
(1)merge的inner连接
name <- c('A','B','A','A','C','D')
school <- c('s1','s2','s1','s1','s1','s3')
class <- c(10, 5, 4, 11, 1, 8)
English <- c(85, 50, 90 ,90, 12, 96)
w <- data.frame(name, school, class, English)
w
name <- c('A','B','C','F')
school <- c('s3','s2','s1','s2')
class <- c(5, 5, 1,3)
maths <- c(80,89,55,90)
English <- c(88, 89, 32, 89)
q <- data.frame(name, school, class, maths, English)
q
merge(w,q,by=intersect(names(w)[1],names(q)[1]))
merge(w,q,by='name')
merge(w,q,by.x = 'name')
运行结果
name school.x class.x English.x school.y class.y maths English.y1
A s1 10 85 s3 5 80 882
A s1 4 90 s3 5 80 883
A s1 11 90 s3 5 80 884
B s2 5 50 s2 5 89 895
C s1 1 12 s1 1 55 32
(2)merge的outer连接> merge(w, q, all=TRUE, sort=TRUE) name school class English maths1
A s1 4 90 NA2
A s1 10 85 NA3
A s1 11 90 NA4
A s3 5 88 805
B s2 5 50 NA6
B s2 5 89 897
C s1 1 12 NA8
C s1 1 32 559
D s3 8 96 NA10
F s2 3 89 9
(3)merge的left连接
> merge(w ,q ,all.x=TRUE,sort=TRUE)
name school class English maths
1 A s1 4 90 NA
2 A s1 10 85 NA
3 A s1 11 90 NA
4 B s2 5 50 NA
5 C s1 1 12 NA
6 D s3 8 96 NA
> merge(w, q, by = 'name',all.x = TRUE, sort = TRUE)
name school.x class.x English.x school.y class.y maths English.y
1 A s1 10 85 s3 5 80 88
2 A s1 4 90 s3 5 80 88
3 A s1 11 90 s3 5 80 88
4 B s2 5 50 s2 5 89 89
5 C s1 1 12 s1 1 55 32
6 D s3 8 96 <NA> NA NA NA
(4)merge的right连接
> merge(w ,q ,by = 'name', all.y=TRUE,sort=TRUE)
name school.x class.x English.x school.y class.y maths English.y
1 A s1 10 85 s3 5 80 88
2 A s1 4 90 s3 5 80 88
3 A s1 11 90 s3 5 80 88
4 B s2 5 50 s2 5 89 89
5 C s1 1 12 s1 1 55 32
6 F <NA> NA NA s2 3 90 89