楼主: Lisrelchen
1771 0

[Case Study]Data Mining kNN Algorithms using R [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Introduction

This chapter introduces the k-Nearest Neighbors (kNN) algorithm for classification. kNN, originally proposed by Fix and Hodges [1] is a very simple 'instance-based' learning algorithm. Despite its simplicity, it can offer very good performance on some problems. We present a high level overview of the algorithm, explaining the relevant parameters and implementation choices, followed by a step by step case study.


k-Nearest Neighbors

The kNN algorithm, like other instance-based algorithms, is unusual from a classification perspective in its lack of explicit model training. While a training dataset is required, it is used solely to populate a sample of the search space with instances whose class is known. No actual model or learning is performed during this phase; for this reason, these algorithms are also known as lazy learning algorithms. Different distance metrics can be used, depending on the nature of the data. Euclidean distance is typical for continuous variables, but other metrics can be used for categorical data. Specialized metrics are often useful for specific problems, such as text classification. When an instance whose class is unknown is presented for evaluation, the algorithm computes its k closest neighbors, and the class is assigned by voting among those neighbors. To prevent ties, one typically uses an odd choice of k for binary classification. For multiple classes, one can use plurality voting or majority voting. The latter can sometimes result in no class being assigned to an instance, while the former can result in classifications being made with very low support from the neighborhood. One can also weight each neighbor by an inverse function of its distance to the instance being classified.

The main advantages of kNN for classification are:

  • Very simple implementation.
  • Robust with regard to the search space; for instance, classes don't have to be linearly separable.
  • Classifier can be updated online at very little cost as new instances with known classes are presented.
  • Few parameters to tune: distance metric and k.

The main disadvantages of the algorithm are:

  • Expensive testing of each instance, as we need to compute its distance to all known instances. Specialized algorithms and heuristics exist for specific problems and distance functions, which can mitigate this issue. This is problematic for datasets with a large number of attributes. When the number of instances is much larger than the number of attributes, a R-tree or a kd-tree can be used to store instances, allowing for fast exact neighbor identification.
  • Sensitiveness to noisy or irrelevant attributes, which can result in less meaningful distance numbers. Scaling and/or feature selection are typically used in combination with kNN to mitigate this issue.
  • Sensitiveness to very unbalanced datasets, where most entities belong to one or a few classes, and infrequent classes are therefore often dominated in most neighborhoods. This can be alleviated through balanced sampling of the more popular classes in the training stage, possibly coupled with ensembles.

Algorithm Description

The training phase for kNN consists of simply storing all known instances and their class labels. A tabular representation can be used, or a specialized structure such as a kd-tree. If we want to tune the value of 'k' and/or perform feature selection, n-fold cross-validation can be used on the training dataset. The testing phase for a new instance 't', given a known set 'I' is as follows:

  • Compute the distance between 't' and each instance in 'I'
  • Sort the distances in increasing numerical order and pick the first 'k' elements
  • Compute and return the most frequent class in the 'k' nearest neighbors, optionally weighting each instance's class by the inverse of its distance to 't'
Available Implementations

There are at least three implementations of kNN classification for R, all available on CRAN:

  • knn
  • kknn
  • RWeka, which is a bridge to the popular WEKA [2] machine and datamining toolkit, and provides a kNN implementation as well as dozens of algorithms for classification, clustering, regression, and data engineering.

We choose RWeka for this tutorial, as it provides a lot more than simply kNN classification, and the sample session shown below can be used to generate other classifiers with little modification.


Installing and Running RWeka

RWeka is a CRAN package, so it can be installed from within R:

  1. > install.packages("RWeka", dependencies = TRUE)
复制代码

Most R graphical user interfaces also provide package management through their UIs. Once installed, RWeka can be loaded in as a library:

  1. > library(RWeka)
复制代码
It comes with several well-known datasets, which can be loaded in as ARFF files (Weka's default file format). We now load a sample dataset, the famous Iris dataset [3], and learn a kNN classifier for it, using default parameters:
  1. > iris <- read.arff(system.file("arff", "iris.arff", package = "RWeka"))> classifier <- IBk(class ~., data = iris)> summary(classifier)
复制代码


While in the above session we have only used default parameters, RWeka allows us to customize the KNN classifier in several ways, aside from setting the value of k:

  • We can weight neighbors by the inverse of their distance to the target instance, or by 1 - distance.
  • We can use leave-one-out cross-validation to choose the optimal value for k in the training data.
  • We can use a sliding window of training instances, so once the classifier knows about W instances, older instances are dropped as new ones are added.
  • We can customize the way the algorithm stores the known instances, which allows us to use kd-trees and similar data structures for faster neighbor search in large datasets.

Case Study

We will now perform a more detailed exploration of the Iris dataset, using cross-validation for real test statistics, and also performing some parameter tuning.


Dataset

The Iris dataset contains 150 instances, corresponding to three equally-frequent species of iris plant (Iris setosa, Iris versicolour, and Iris virginica). An Iris versicolor is shown below, courtesy of Wikimedia Commons.


Iris versicolor






Each instance contains four attributes:sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. The next picture shows each attribute plotted against the others, with the different classes in color.


Plotting the Iris attributes






Execution and Results

We will generate a kNN classifier, but we'll let RWeka automatically find the best value for k, between 1 and 20. We'll also use 10-fold cross validation to evaluate our classifier:

  1. > classifier <- IBk(class ~ ., data = iris,
  2.                     control = Weka_control(K = 20, X = TRUE))
  3. > evaluate_Weka_classifier(classifier, numFolds = 10)
复制代码

As cross-validation generates random partitions of the dataset, your results may vary. Typically, however, the model will make fewer than 10 mistakes.

Analysis

This simple case study shows that a kNN classifier makes few mistakes in a dataset that, although simple, is not linearly separable, as shown in the scatterplots and by a look at the confusion matrix, where all misclassifications are between Iris Versicolor and Iris Virginica instances. The case study also shows how RWeka makes it trivially easy to learn classifiers (and predictors, and clustering models as well), and to experiment with their parameters. In this classic dataset, a kNN classifier makes very few mistakes.

References
  • ^ Fix, E., Hodges, J.L. (1951); Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
  • ^ Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1).
  • ^ Fisher,R.A. (1936); The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, Part II, 179-188.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Mining Case study Algorithms Algorithm Using learning overview problems relevant present

本帖被以下文库推荐

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-28 21:59