楼主: Lisrelchen
1508 6

[Case Study]Mining Association Rules using R [推广有奖]

  • 0关注
  • 62粉丝

VIP

已卖:4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50288 个
通用积分
83.6906
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

楼主
Lisrelchen 发表于 2015-3-21 08:03:36 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Authors:

Michael Hahsler, Bettina Grün, Kurt Hornik

Title:

[download]
(15527)
arules - A Computational Environment for Mining Association Rules and Frequent Item Sets

Reference:

Vol. 14, Issue 15, Sep 2005Submitted 2005-04-15, Accepted 2005-09-29

Type:

Article

Abstract:

Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Association Case study Using Rules ATION databases presented provides Michael between

本帖被以下文库推荐

沙发
Lisrelchen 发表于 2015-3-21 08:08:15

Example 1: Analyzing and preparing a transaction data set


In this example, we show how a data set can be analyzed and manipulated before associations are mined. This is important for finding problems in the data set which could make the mined associations useless or at least inferior to associations mined on a properly prepared data set.
For the example, we look at the Epub transaction data contained in package arules. This data set contains downloads of documents from the Electronic Publication platform of the Vienna University of Economics and Business available via http://epub.wu-wien.ac.at from January 2003 to December 2008.

  1. library("arules")
  2. data("Epub")
  3. Epub

  4. summary(Epub)
  5. year <- strftime(as.POSIXlt(transactionInfo(Epub)[["TimeStamp"]]), "%Y")
  6. table(year)

  7. Epub2003 <- Epub[year == "2003"]
  8. length(Epub2003)
  9. image(Epub2003)
  10. transactionInfo(Epub2003[size(Epub2003)  20])
  11. inspect(Epub2003[1:5])

  12. as(Epub2003[1:5], "list")
  13. EpubTidLists <- as(Epub, "tidLists")
  14. EpubTidLists

  15. as(EpubTidLists[1:3], "list")
复制代码


藤椅
Lisrelchen 发表于 2015-3-21 08:10:11

Example 2: Preparing and mining a questionnaire data set


As a second example, we prepare and mine questionnaire data. We use the Adult data set from the UCI machine learning repository (Asuncion and Newman 2007) provided by package arules. This data set is similar to the marketing data set used by Hastie et al. (2001) in their chapter about association rule mining. The data originates from the U.S. census bureau database and contains 48842 instances with 14 attributes like age, work class, education, etc. In the original applications of the data, the attributes were used to predict the income level of individuals. We added the attribute income with levels small and large,
representing an income of ≤ USD 50,000 and > USD 50,000, respectively. This data is included in arules as the data set AdultUCI.
  1. data("AdultUCI")
  2. dim(AdultUCI)
  3. AdultUCI[1:2,]

  4. AdultUCI[["fnlwgt"]] <- NULL
  5. AdultUCI[["education-num"]] <- NULL
  6. AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
  7. + labels = c("Young", "Middle-aged", "Senior", "Old"))
  8. AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
  9. + c(0,25,40,60,168)),
  10. + labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
  11. AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
  12. + c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]0]),Inf)),
  13. + labels = c("None", "Low", "High"))
  14. AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
  15. + c(-Inf,0,
  16. + median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]0]),Inf)),
  17. + labels = c("none", "low", "high"))
  18. Adult <- as(AdultUCI, "transactions")
  19. Adult
  20. summary(Adult)
  21. itemFrequencyPlot(Adult, support = 0.1, cex.names=0.8)
  22. Next, we call the function apriori() to find all rules (the default association type for
  23. apriori()) with a minimum support of 1% and a confidence of 0.6.
  24. rules <- apriori(Adult,
  25. + parameter = list(support = 0.01, confidence = 0.6))
  26. rules
  27. summary(rules)

  28. rulesIncomeSmall <- subset(rules, subset = rhs %in% "income=small" & lift  1.2)
  29. rulesIncomeLarge <- subset(rules, subset = rhs %in% "income=large" & lift  1.2)

  30. inspect(head(sort(rulesIncomeSmall, by = "confidence"), n = 3))

  31. inspect(head(sort(rulesIncomeLarge, by = "confidence"), n = 3))

  32. write(rulesIncomeSmall, file = "data.csv", sep = ",", col.names = NA)
  33. write.PMML(rulesIncomeSmall, file = "data.xml")
复制代码


板凳
auirzxp 学生认证  发表于 2015-3-21 08:13:52
提示: 作者被禁止或删除 内容自动屏蔽

报纸
Lisrelchen 发表于 2015-3-21 08:17:30

Example 3: Extending arules with a new interest measure

  1. data("Adult")

  2. fsets <- eclat(Adult, parameter = list(support = 0.05),control = list(verbose=FALSE))

  3. #For the denominator of all-confidence we need to find all mined single items and their corresponding
  4. #support values. In the following we create a named vector where the names are the
  5. #column numbers of the items and the values are their support.

  6. singleItems <- fsets[size(items(fsets)) == 1]

  7. ## Get the col numbers we have support for

  8. singleSupport <- quality(singleItems)$support

  9. names(singleSupport) <- unlist(LIST(items(singleItems),decode = FALSE))

  10. head(singleSupport, n = 5)

  11. #Next, we can calculate the all-confidence

  12. itemsetList <- LIST(items(fsets), decode = FALSE)

  13. allConfidence <- quality(fsets)$support /sapply(itemsetList, function(x)+ max(singleSupport[as.character(x)]))

  14. quality(fsets) <- cbind(quality(fsets), allConfidence)

  15. summary(fsets)


  16. fsetsEducation <- subset(fsets, subset = items %pin% "education")

  17. inspect(sort(fsetsEducation[size(fsetsEducation)1],by = "allConfidence")[1 : 3])
复制代码

地板
Lisrelchen 发表于 2015-3-21 08:36:31

Example 4: Sampling


In this example, we show how sampling can be used in arules. We use again the Adult data set.

  1. > data("Adult")
  2. > Adult
  3. > supp <- 0.05
  4. > epsilon <- 0.1
  5. > c <- 0.1
  6. > n <- -2 * log(c)/ (supp * epsilon^2)
  7. > n
  8. > AdultSample <- sample(Adult, n, replace = TRUE)

  9. > itemFrequencyPlot(AdultSample, population = Adult, support = supp,cex.names = 0.7)
  10. #Alternatively, a sample can be compared with the population using the lift ratio (with lift
  11. #= TRUE). The lift ratio for each item i is P(i|sample)/P(i|population) where the probabilities
  12. #are estimated by the item frequencies. A lift ratio of one indicates that the items occur in the
  13. #sample in the same proportion as in the population. A lift ratio greater than one indicates
  14. #that the item is over-represented in the sample and vice versa. With this plot, large relative
  15. #deviations for less frequent items can be identified visually (see Figure 8).
  16. > itemFrequencyPlot(AdultSample, population = Adult,support = supp, lift = TRUE,cex.names = 0.9)
  17. #To compare the speed-up reached by sampling we use the Eclat algorithm to mine frequent
  18. #itemsets on both, the database and the sample and compare the system time (in seconds)
  19. #used for mining.
  20. > time <- system.time(itemsets <- eclat(Adult,parameter = list(support = supp), control = list(verbose = FALSE)))
  21. > time

  22. > timeSample <- system.time(itemsetsSample <- eclat(AdultSample,parameter = list(support = supp), control = list(verbose = FALSE)))
  23. > timeSample

  24. #The first element of the vector returned by system.time() gives the (user) CPU time needed
  25. #for the execution of the statement in its argument. Therefore, mining the sample instead of
  26. #the whole data base results in a speed-up factor of:
  27. > # speed up
  28. > time[1] / timeSample[1]

  29. #To evaluate the accuracy for the itemsets mined from the sample, we analyze the difference between the two sets.
  30. > itemsets

  31. > itemsetsSample

  32. #The two sets have roughly the same size. To check if the sets contain similar itemsets, we
  33. #match the sets and see what fraction of frequent itemsets found in the database were also
  34. #found in the sample.
  35. > match <- match(itemsets, itemsetsSample, nomatch = 0)
  36. > ## remove no matches
  37. > sum(match>0) / length(itemsets)

  38. Almost all frequent itemsets were found using the sample. The summaries of the support
  39. of the frequent itemsets which were not found in the sample and the itemsets which were
  40. frequent in the sample although they were infrequent in the database give:
  41. > summary(quality(itemsets[which(!match)])$support)

  42. > summary(quality(itemsetsSample[-match])$support)


  43. #For the frequent itemsets which were found in the database and in the sample, we can calculate
  44. #accuracy from the the error rate.
  45. > supportItemsets <- quality(itemsets[which(match > 0)])$support
  46. > supportSample <- quality(itemsetsSample[match])$support
  47. > accuracy <- 1 - abs(supportSample - supportItemsets) / supportItemsets
  48. > summary(accuracy)
复制代码

7
fjrong 在职认证  发表于 2015-3-22 21:08:40

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-19 20:13