[Case Study]Mining Association Rules using R

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2015-3-21 08:03:36 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Authors:	Michael Hahsler, Bettina Grün, Kurt Hornik
Title:	[download] (15527)arules - A Computational Environment for Mining Association Rules and Frequent Item Sets
Reference:	Vol. 14, Issue 15, Sep 2005Submitted 2005-04-15, Accepted 2005-09-29
Type:	Article
Abstract:	Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏2 回帖

关键词：Association Case study Using Rules ATION databases presented provides Michael between

本帖被以下文库推荐

· Case Study NewOccidental|主题: 264, 订阅: 12

沙发

Lisrelchen 发表于 2015-3-21 08:08:15

Example 1: Analyzing and preparing a transaction data set

In this example, we show how a data set can be analyzed and manipulated before associations are mined. This is important for finding problems in the data set which could make the mined associations useless or at least inferior to associations mined on a properly prepared data set.
For the example, we look at the Epub transaction data contained in package arules. This data set contains downloads of documents from the Electronic Publication platform of the Vienna University of Economics and Business available via http://epub.wu-wien.ac.at from January 2003 to December 2008.

library("arules")
data("Epub")
Epub
summary(Epub)
year <- strftime(as.POSIXlt(transactionInfo(Epub)[["TimeStamp"]]), "%Y")
table(year)
Epub2003 <- Epub[year == "2003"]
length(Epub2003)
image(Epub2003)
transactionInfo(Epub2003[size(Epub2003) 20])
inspect(Epub2003[1:5])
as(Epub2003[1:5], "list")
EpubTidLists <- as(Epub, "tidLists")
EpubTidLists
as(EpubTidLists[1:3], "list")

复制代码

藤椅

Lisrelchen 发表于 2015-3-21 08:10:11

Example 2: Preparing and mining a questionnaire data set

As a second example, we prepare and mine questionnaire data. We use the Adult data set from the UCI machine learning repository (Asuncion and Newman 2007) provided by package arules. This data set is similar to the marketing data set used by Hastie et al. (2001) in their chapter about association rule mining. The data originates from the U.S. census bureau database and contains 48842 instances with 14 attributes like age, work class, education, etc. In the original applications of the data, the attributes were used to predict the income level of individuals. We added the attribute income with levels small and large,
representing an income of ≤ USD 50,000 and > USD 50,000, respectively. This data is included in arules as the data set AdultUCI.

data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2,]
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
+ labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
+ c(0,25,40,60,168)),
+ labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
+ c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]0]),Inf)),
+ labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
+ c(-Inf,0,
+ median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]0]),Inf)),
+ labels = c("none", "low", "high"))
Adult <- as(AdultUCI, "transactions")
Adult
summary(Adult)
itemFrequencyPlot(Adult, support = 0.1, cex.names=0.8)
Next, we call the function apriori() to find all rules (the default association type for
apriori()) with a minimum support of 1% and a confidence of 0.6.
rules <- apriori(Adult,
+ parameter = list(support = 0.01, confidence = 0.6))
rules
summary(rules)
rulesIncomeSmall <- subset(rules, subset = rhs %in% "income=small" & lift 1.2)
rulesIncomeLarge <- subset(rules, subset = rhs %in% "income=large" & lift 1.2)
inspect(head(sort(rulesIncomeSmall, by = "confidence"), n = 3))
inspect(head(sort(rulesIncomeLarge, by = "confidence"), n = 3))
write(rulesIncomeSmall, file = "data.csv", sep = ",", col.names = NA)
write.PMML(rulesIncomeSmall, file = "data.xml")

复制代码

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25371.2470 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4493 点帖子 13491 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	板凳 auirzxp 发表于 2015-3-21 08:13:52 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

报纸

Lisrelchen 发表于 2015-3-21 08:17:30

Example 3: Extending arules with a new interest measure

data("Adult")
fsets <- eclat(Adult, parameter = list(support = 0.05),control = list(verbose=FALSE))
#For the denominator of all-confidence we need to find all mined single items and their corresponding
#support values. In the following we create a named vector where the names are the
#column numbers of the items and the values are their support.
singleItems <- fsets[size(items(fsets)) == 1]
## Get the col numbers we have support for
singleSupport <- quality(singleItems)$support
names(singleSupport) <- unlist(LIST(items(singleItems),decode = FALSE))
head(singleSupport, n = 5)
#Next, we can calculate the all-confidence
itemsetList <- LIST(items(fsets), decode = FALSE)
allConfidence <- quality(fsets)$support /sapply(itemsetList, function(x)+ max(singleSupport[as.character(x)]))
quality(fsets) <- cbind(quality(fsets), allConfidence)
summary(fsets)
fsetsEducation <- subset(fsets, subset = items %pin% "education")
inspect(sort(fsetsEducation[size(fsetsEducation)1],by = "allConfidence")[1 : 3])

复制代码

地板

Lisrelchen 发表于 2015-3-21 08:36:31

Example 4: Sampling

In this example, we show how sampling can be used in arules. We use again the Adult data set.

> data("Adult")
> Adult
> supp <- 0.05
> epsilon <- 0.1
> c <- 0.1
> n <- -2 * log(c)/ (supp * epsilon^2)
> n
> AdultSample <- sample(Adult, n, replace = TRUE)
> itemFrequencyPlot(AdultSample, population = Adult, support = supp,cex.names = 0.7)
#Alternatively, a sample can be compared with the population using the lift ratio (with lift
#= TRUE). The lift ratio for each item i is P(i|sample)/P(i|population) where the probabilities
#are estimated by the item frequencies. A lift ratio of one indicates that the items occur in the
#sample in the same proportion as in the population. A lift ratio greater than one indicates
#that the item is over-represented in the sample and vice versa. With this plot, large relative
#deviations for less frequent items can be identified visually (see Figure 8).
> itemFrequencyPlot(AdultSample, population = Adult,support = supp, lift = TRUE,cex.names = 0.9)
#To compare the speed-up reached by sampling we use the Eclat algorithm to mine frequent
#itemsets on both, the database and the sample and compare the system time (in seconds)
#used for mining.
> time <- system.time(itemsets <- eclat(Adult,parameter = list(support = supp), control = list(verbose = FALSE)))
> time
> timeSample <- system.time(itemsetsSample <- eclat(AdultSample,parameter = list(support = supp), control = list(verbose = FALSE)))
> timeSample
#The first element of the vector returned by system.time() gives the (user) CPU time needed
#for the execution of the statement in its argument. Therefore, mining the sample instead of
#the whole data base results in a speed-up factor of:
> # speed up
> time[1] / timeSample[1]
#To evaluate the accuracy for the itemsets mined from the sample, we analyze the difference between the two sets.
> itemsets
> itemsetsSample
#The two sets have roughly the same size. To check if the sets contain similar itemsets, we
#match the sets and see what fraction of frequent itemsets found in the database were also
#found in the sample.
> match <- match(itemsets, itemsetsSample, nomatch = 0)
> ## remove no matches
> sum(match>0) / length(itemsets)
Almost all frequent itemsets were found using the sample. The summaries of the support
of the frequent itemsets which were not found in the sample and the itemsets which were
frequent in the sample although they were infrequent in the database give:
> summary(quality(itemsets[which(!match)])$support)
> summary(quality(itemsetsSample[-match])$support)
#For the frequent itemsets which were found in the database and in the sample, we can calculate
#accuracy from the the error rate.
> supportItemsets <- quality(itemsets[which(match > 0)])$support
> supportSample <- quality(itemsetsSample[match])$support
> accuracy <- 1 - abs(supportSample - supportItemsets) / supportItemsets
> summary(accuracy)

复制代码

7楼

fjrong

发表于 2015-3-22 21:08:40

[Case Study]Mining Association Rules using R [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

[Case Study]Mining Association Rules using R [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群