人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Intro to Text Analysis with R

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: oliyiyi

1413 4

Intro to Text Analysis with R [推广有奖]

1关注
184
粉丝

版主

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2016-1-23 19:56:49 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

One of the most powerful aspects of using R is that you can download free packages for so many tools and types of analysis. Text analysis is still somewhat in its infancy, but is very promising. It is estimated that as much as 80% of the world’s data is unstructured, while most types of analysis only work with structured data. In this paper, we will explore the potential of R packages to analyze unstructured text.

R provides two packages for working with unstructured text – TM and Sentiment. TM can be installed in the usual way. Unfortunately, Sentiment has been archived in 2012, and is therefore more difficult to install. However, it can still be installed using the following method, according to Frank Wang (Wang).

install.packages("devtools")require(devtools)install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.1.tar.gz")install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")

The remaining required packaged can be installed as follows.

install.packages("plyr")install.packages("ggplot2")install.packages("wordcloud")install.packages("RColorBrewer")install.packages("tm")install.packages("SnowballC")

Once initially installed, each can be loaded later as library(name).

The next step is to load the data. I chose to download comments from a newspaper vent line (Charleston Gazette-Mail ). This data was saved to a text file and loaded and processed as follows.

###Get the datadata <- readLines("./Data/Comments/vent.txt")df <- data.frame(data)textdata <- df[df$data, ]textdata = gsub("[[:punct:]]", "", textdata)

Next, we remove nonessential characters such as punctuation, numbers, web addresses, etc from the text, before we begin processing the actual words themselves. The code that follows was partially adapted from Gaston Sanchez in his work with sentiment analysis of Twitter data (Sanchez).

textdata = gsub("[[:punct:]]", "", textdata)textdata = gsub("[[:digit:]]", "", textdata)textdata = gsub("http\\w+", "", textdata)textdata = gsub("[ \t]{2,}", "", textdata)textdata = gsub("^\\s+|\\s+$", "", textdata)try.error = function(x){ y = NA try_error = tryCatch(tolower(x), error=function(e) e) if (!inherits(try_error, "error")) y = tolower(x) return(y)}textdata = sapply(textdata, try.error)textdata = textdata[!is.na(textdata)]names(textdata) = NULL

Next, we perform the sentiment analysis, classifying comments using a Bayesian analysis. A polarity of positive, negative, or neutral is determined. Finally, the comment, emotion, and polarity are combined in a single dataframe.

class_emo = classify_emotion(textdata, algorithm="bayes", prior=1.0)emotion = class_emo[,7]emotion[is.na(emotion)] = "unknown"class_pol = classify_polarity(textdata, algorithm="bayes")polarity = class_pol[,4]
sent_df = data.frame(text=textdata, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

Now that we have processed the comments, we can graph the emotions and polarities.

ggplot(sent_df, aes(x=emotion)) +geom_bar(aes(y=..count.., fill=emotion)) +scale_fill_brewer(palette="Dark2") +labs(x="emotion categories", y="")ggplot(sent_df, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + labs(x="polarity categories", y="")
[color=rgb(255, 255, 255) !important]

We now prepare the data for creating a word cloud. This includes removing common English stop words.

emos = levels(factor(sent_df$emotion))nemo = length(emos)emo.docs = rep("", nemo)for (i in 1:nemo){  tmp = textdata[emotion == emos]  emo.docs = paste(tmp, collapse=" ")}emo.docs = removeWords(emo.docs, stopwords("english"))corpus = Corpus(VectorSource(emo.docs))tdm = TermDocumentMatrix(corpus)tdm = as.matrix(tdm)colnames(tdm) = emoscomparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),                scale = c(3,.5), random.order = FALSE,                title.size = 1.5)
[color=rgb(255, 255, 255) !important]

What do we gain from this analysis beside an attractive word cloud?  We can analyze the word cloud itself.  The Sentiment package has identified the most frequently occurring, important words, and their likely association with emotions.  For instance, ‘guns’ was associated with anger, while ‘hillary’ was associated with fear.  ‘pet’ was associate with sadness, and ‘aep’ was associated with surprise.  With very little work, we have automatically extracted the important topics from the unstructured text.
More importantly, we also have a table of the comments themselves with the emotions and polarity attached.  If we desire, we can sort them by emotion or polarity and continue our analysis.  If this had been corporate satisfaction data, for example, we may want to dig deeper into angry comments and joyous comments for different reasons.  We may use this as a tool to intelligently select comments for Quality Assurance analysis rather than blind random selection.  Text and Sentiment Analysis may be in its infancy, but it is can also be the beginning for further analysis.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖



关键词：Analysis Analysi alysis Analys intro structured potential download provides explore

本帖被以下文库推荐

· Case Study NewOccidental|主题: 264, 订阅: 12

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具举报

沙发

hjtoh 发表于 2016-1-23 20:07:02 来自手机 |只看作者 |坛友微信交流群

oliyiyi 发表于 2016-1-23 19:56
One of the most powerful aspects of using R is that you can download free packages for so many tools ...

掌握一门语言很有必要

已有 1 人评分	经验	热心指数	收起理由
Nicolle	+ 20	+ 1	精彩帖子

总评分: 经验 + 20 热心指数 + 1 查看全部评分

使用道具举报

藤椅

seahhj 发表于 2016-1-24 00:12:00 |只看作者 |坛友微信交流群

good material, thanks for sharing

使用道具举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12402323 个通用积分 1620.8615 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 477211 点帖子 23879 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	板凳 Nicolle 发表于 2016-1-24 01:47:21 \|只看作者 \|坛友微信交流群提示: 作者被禁止或删除内容自动屏蔽

	回复使用道具举报显身卡