用R语言做词云图首先要进行分词,R中有两个包Rwrodseg和jiebaR包可以实现分词,仔细对比两个包可以发现jieba包做的更好,功能函数要多一些。这里我们以红楼梦文本为例,进行文本分析。本文实现三个部分任务:
- Part 1:以红楼梦110回为例,做词云图;
- Part 2:统计介词和助词频率;
- Part 3:统计指定词频率。
首先,我们把红楼梦文本数据读取进入R里面,使用scan函数读取,并结合正则表达式分章节,最后用sapply函数分开文本:
- rm(list = ls())
- file.data <- scan("hongloumeng.txt", sep = "\n", what = "")
- chapter <- grep(pattern = "第.+回 ", file.data)
- txt <- sapply(seq_along(chapter), function(i) {
- if (i < length(chapter)) {
- paste(file.data[chapter[i]:(chapter[i + 1] - 1)], collapse = "")
- } else {
- paste(file.data[chapter[i]:length(file.data)], collapse = "")
- }
- })
- #### 以第110回为例,用txt[[110]]表示即可
- library(jiebaR)
- ## PART 1:实词词云图
- cutter <- worker(stop_word = "stop_word.txt")
- # 添加新词,如贾宝玉等
- new_user_word(cutter, "贾宝玉")
- # 分词
- segwords <- segment(txt[[110]], cutter)
- # 字符大于1
- segwords <- segwords[which(nchar(segwords) > 1)]
- # filter words
- segwords <- filter_segment(input = segwords, filter_words = "一个")
- # frequency
- fq <- freq(segwords)
- fq <- fq[order(fq$freq, decreasing = TRUE), ]
- fq[1:100, ]
- fq <- fq[1:500, ]
- library(wordcloud2)
- wordcloud2(fq[1:300, ], size = 0.5, minSize = 0, shape = "star", ellipticity = 0.85)