用R语言做词云图首先要进行分词,R中有两个包Rwrodseg和jiebaR包可以实现分词,仔细对比两个包可以发现jieba包做的更好,功能函数要多一些。这里我们以红楼梦文本为例,进行文本分析。本文实现三个部分任务:
- Part 1:以红楼梦110回为例,做词云图;
- Part 2:统计介词和助词频率;
- Part 3:统计指定词频率。
首先,我们把红楼梦文本数据读取进入R里面,使用scan函数读取,并结合正则表达式分章节,最后用sapply函数分开文本:
- rm(list = ls())
- file.data <- scan("hongloumeng.txt", sep = "\n", what = "")
- chapter <- grep(pattern = "第.+回 ", file.data)
- txt <- sapply(seq_along(chapter), function(i) {
- if (i < length(chapter)) {
- paste(file.data[chapter[i]:(chapter[i + 1] - 1)], collapse = "")
- } else {
- paste(file.data[chapter[i]:length(file.data)], collapse = "")
- }
- })
- #### 以第110回为例,用txt[[110]]表示即可
- library(jiebaR)
- ## PART 1:实词词云图
- cutter <- worker(stop_word = "stop_word.txt")
- # 添加新词,如贾宝玉等
- new_user_word(cutter, "贾宝玉")
- # 分词
- segwords <- segment(txt[[110]], cutter)
- # 字符大于1
- segwords <- segwords[which(nchar(segwords) > 1)]
- # filter words
- segwords <- filter_segment(input = segwords, filter_words = "一个")
- # frequency
- fq <- freq(segwords)
- fq <- fq[order(fq$freq, decreasing = TRUE), ]
- fq[1:100, ]
- fq <- fq[1:500, ]
- library(wordcloud2)
- wordcloud2(fq[1:300, ], size = 0.5, minSize = 0, shape = "star", ellipticity = 0.85)
Part 2部分代码如下:
- ## PART 2: 虚词词频率统计
- cutter2 <- worker("tag")
- classfication <- cutter2 <= txt[[110]]
- # u表示助词,p表示介词等,参考https://wenku.baidu.com/view/a093f16ab84ae45c3b358c8c.html
- mywords <- c("u", "p")
- xuci <- classfication[which(names(classfication) %in% mywords)]
- myfreq <- freq(xuci)
- myfreq <- myfreq[order(-myfreq$freq), ]
- myfreq
- char freq
Part 3部分代码如下:
- ## PART 3: 指定虚词为"之", "其", "或"
- cutter3 <- worker("tag")
- classification <- cutter3 <= txt[[110]]
- mywords2 <- classification[which(unname(classification) %in% c("之", "其", "或"))]
- freq(mywords2)
- help(package = "jiebaR")
- char freq
- 1 或 3
- 2 之 6
- 3 其 4
R语言爱好者 大珞珞
2018年5月8日