Rwordseg_Vignette_CN.pdf-经管之家资源下载-人大经济论坛

签到
- 苹果/安卓/wp
- 苹果/安卓/wp
客户端
0.0

0.00

经管百科

人大经济论坛 › 附件下载

附件下载


所在主题：基于R语言的文本挖掘tm和Rwordseg包的说明资料分享
文件名: Rwordseg_Vignette_CN.pdf
资料下载链接地址: https://bbs.pinggu.org/a-1719649.html
附件大小: 315.88 KB 举报本内容
资料如下 [hide] [/hide] 代码诸如： [hide] ### R code from vignette source 'tm.Rnw' ### Encoding: UTF-8 ################################################### ### code chunk number 1: Init ################################################### library("tm") data("crude") ################################################### ### code chunk number 2: Ovid ################################################### txt <- system.file("texts", "txt", package = "tm") (ovid <- Corpus(DirSource(txt, encoding = "UTF-8"), readerControl = list(language = "lat"))) ################################################### ### code chunk number 3: VectorSource ################################################### docs <- c("This is a text.", "This another one.") Corpus(VectorSource(docs)) ################################################### ### code chunk number 4: Reuters ################################################### reut21578 <- system.file("texts", "crude", package = "tm") reuters <- Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) ################################################### ### code chunk number 5: tm.Rnw:120-121 (eval = FALSE) ################################################### ## writeCorpus(ovid) ################################################### ### code chunk number 6: tm.Rnw:132-133 ################################################### inspect(ovid[1:2]) ################################################### ### code chunk number 7: tm.Rnw:137-138 ################################################### identical(ovid[[2]], ovid[["ovid_2.txt"]]) ################################################### ### code chunk number 8: tm.Rnw:156-157 ################################################### reuters <- tm_map(reuters, as.PlainTextDocument) ################################################### ### code chunk number 9: tm.Rnw:165-166 ################################################### reuters <- tm_map(reuters, stripWhitespace) ################################################### ### code chunk number 10: tm.Rnw:171-172 ################################################### reuters <- tm_map(reuters, tolower) ################################################### ### code chunk number 11: Stopwords ################################################### reuters <- tm_map(reuters, removeWords, stopwords("english")) ################################################### ### code chunk number 12: Stemming ################################################### tm_map(reuters, stemDocument) ################################################### ### code chunk number 13: tm.Rnw:204-206 ################################################### query <- "id == '237' & heading == 'INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE'" tm_filter(reuters, FUN = sFilter, query) ################################################### ### code chunk number 14: DublinCore ################################################### DublinCore(crude[[1]], "Creator") <- "Ano Nymous" meta(crude[[1]]) ################################################### ### code chunk number 15: tm.Rnw:237-241 ################################################### meta(crude, tag = "test", type = "corpus") <- "test meta" meta(crude, type = "corpus") meta(crude, "foo") <- letters[1:20] meta(crude) ################################################### ### code chunk number 16: tm.Rnw:258-260 ################################################### dtm <- DocumentTermMatrix(reuters) inspect(dtm[1:5,100:105]) ################################################### ### code chunk number 17: tm.Rnw:269-270 ################################################### findFreqTerms(dtm, 5) ################################################### ### code chunk number 18: tm.Rnw:275-276 ################################################### findAssocs(dtm, "opec", 0.8) ################################################### ### code chunk number 19: tm.Rnw:288-289 ################################################### inspect(removeSparseTerms(dtm, 0.4)) ################################################### ### code chunk number 20: tm.Rnw:303-305 ################################################### inspect(DocumentTermMatrix(reuters, list(dictionary = c("prices", "crude", "oil")))) 复制代码 ### R code from vignette source 'extensions.Rnw' ################################################### ### code chunk number 1: Init ################################################### library("tm") library("XML") ################################################### ### code chunk number 2: extensions.Rnw:71-76 ################################################### VecSource <- function(x) { s <- Source(length = length(x), names = names(x), class = "VectorSource") s$Content <- as.character(x) s } ################################################### ### code chunk number 3: extensions.Rnw:85-89 ################################################### getElem.VectorSource <- function(x) list(content = x$Content[x$Position], uri = NA) pGetElem.VectorSource <- function(x) lapply(x$Content, function(y) list(content = y, uri = NA)) ################################################### ### code chunk number 4: extensions.Rnw:114-117 ################################################### readPlain <- function(elem, language, id) PlainTextDocument(elem$content, id = id, language = language) ################################################### ### code chunk number 5: extensions.Rnw:145-150 ################################################### df <- data.frame(contents = c("content 1", "content 2", "content 3"), title = c("title 1", "title 2", "title 3"), authors= c("author 1" , "author 2" , "author 3" ), topics = c("topic 1", "topic 2", "topic 3"), stringsAsFactors = FALSE) ################################################### ### code chunk number 6: extensions.Rnw:156-157 ################################################### names(attributes(PlainTextDocument())) ################################################### ### code chunk number 7: Mapping ################################################### m <- list(Content = "contents", Heading = "title", Author = "authors", Topic = "topics") ################################################### ### code chunk number 8: myReader ################################################### myReader <- readTabular(mapping = m) ################################################### ### code chunk number 9: extensions.Rnw:180-181 ################################################### (corpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))) ################################################### ### code chunk number 10: extensions.Rnw:186-188 ################################################### corpus[[1]] meta(corpus[[1]]) ################################################### ### code chunk number 11: CustomXMLFile ################################################### custom.xml <- system.file("texts", "custom.xml", package = "tm") print(readLines(custom.xml), quote = FALSE) ################################################### ### code chunk number 12: mySource ################################################### mySource <- function(x, encoding = "UTF-8") XMLSource(x, function(tree) XML::xmlChildren(XML::xmlRoot(tree)), myXMLReader, encoding) ################################################### ### code chunk number 13: myXMLReader ################################################### myXMLReader <- readXML( spec = list(Author = list("node", "/document/writer"), Content = list("node", "/document/description"), DateTimeStamp = list("function", function(x) as.POSIXlt(Sys.time(), tz = "GMT")), Description = list("attribute", "/document/@short"), Heading = list("node", "/document/caption"), ID = list("function", function(x) tempfile()), Origin = list("unevaluated", "My private bibliography"), Type = list("node", "/document/type")), doc = PlainTextDocument()) ################################################### ### code chunk number 14: extensions.Rnw:301-302 ################################################### corpus <- Corpus(mySource(custom.xml)) ################################################### ### code chunk number 15: extensions.Rnw:306-308 ################################################### corpus[[1]] meta(corpus[[1]]) 复制代码 [/hide]
熟悉论坛请点击新手指南
下载说明
1、论坛支持迅雷和网际快车等p2p多线程软件下载，请在上面选择下载通道单击右健下载即可。 2、论坛会定期自动批量更新下载地址,所以请不要浪费时间盗链论坛资源,盗链地址会很快失效。 3、本站为非盈利性质的学术交流网站,鼓励和保护原创作品，拒绝未经版权人许可的上传行为。本站如接到版权人发出的合格侵权通知，将积极的采取必要措施；同时，本站也将在技术手段和能力范围内，履行版权保护的注意义务。 (如有侵权，欢迎举报)

二维码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

京ICP备16021002号-2 京B2-20170662号京公网安备 11010802022788号论坛法律顾问：王进律师知识产权保护声明免责及隐私声明