楼主: oliyiyi
1744 2

3 methods to deal with outliers [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors. If possible, outliers should be excluded from the data set. However, detecting that anomalous instances might be very difficult, and is not always possible.

Introduction

Machine learning algorithms are very sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results.

Along this article, we are going to talk about 3 different methods of dealing with outliers:

  • Univariate method: This method looks for data points with extreme values on one variable.
  • Multivariate method: Here we look for unusual combinations on all the variables.
  • Minkowski error: This method reduces the contribution of potential outliers in the training process.

To illustrate that methods, we will use a data set obtained from the following function.

y = sin(π·x)

Once we have our data set, we replace two y values for other ones that are far from our function. The next graph depicts this data set.

The points A=(-0.5,-1.5) and B=(0.5,0.5) are outliers. Point A is outside the range defined by the y data, while Point B is inside that range. As we will see, that makes them of different nature, and we will need different methods to detect and treat them.

1. Univariate method

One of the simplest methods for detecting outliers is the use of box plots. A box plot is a graphical display for describing the distribution of the data. Box plots use the median and the lower and upper quartiles.

The Tukey’s method defines an outlier as those values of the data set that fall far from the central point, the median. The maximum distance to the center of the data that is going to be allowed is called the cleaning parameter. Id the cleaning parameter is very large, the test becomes less sensitive to outliers. On the contrary, if it is too small, a lot of values will be detected as outliers.

The following chart shows the box plot for the variable y. The minimum of the variable is -1.5, the first quartile is -0.707, the second quartile or median is 0, the third quartile is 0.588 and the maximum is 0.988.

As we can see, the minimum is far away from the first quartile and the median. If we set the cleaning parameter to 0.6, the Tukey’s method will detect Point A as an outlier, and clean it from the data set.

Plotting again the box plot for that variable, we can notice that the outlier has been removed. As a consequence, the distribution of the data is now much better. Now, the minimum of y is -0.9858, the first quartile is -0.588, the second quartile or median is 0.078, the third quartile is 0.707 and the maximum is 0.988.

However, this univariate method has not detected Point B, and therefore we are not finished.

2. Multivariate method

Outliers do not need to be extreme values. Therefore, as we have seen with Point B, the univariate method does not always work well. The multivariate method tries to solve that by building a model using all the data available, and then cleaning those instances with errors above a given value.

In this case, we have trained a neural network using all the available data (but Point B, which was excluded by the univariate method). Once we have our predictive model, we perform a linear regression analysis in order to obtain the next graph. The predicted values are plotted versus the actual ones as squares. The coloured line indicates the best linear fit. The grey line would indicate a perfect fit.

As we can see, there is a point that falls too far from the model. This point is spoiling the model, so we can think that it is another outlier.

To find that point quantitatively, we can calculate the maximum errors between the outputs from the model and the targets in the data. The following table lists the 5 instances with maximum errors.

We can notice that instance 11 stands out for having a large error in comparison with the others (0.430 versus 0.069,…). If we look at the linear regression graph, we can see that this instance matches the point that is far away from the model.

If we select 20% of maximum error, this method identifies Point B as an outlier and cleans it from the data set. We can see that by performing again a linear regression analysis.

There are no more outliers in our data set so the generalization capabilities of our model will improve notably.

3. Minkowski error

Now, we are going to talk about a different method for dealing with outliers. Unlike the univariate and multivariate methods, it doesn’t detect and clean the outliers. Instead, it reduces the impact that outliers will have in the model.

The Minkowski error is a loss index that is more insensitive to outliers than the standard sum squared error. The sum squared error raises each instance error to the square, making a too big contribution of outliers to the total error. The Minkowski error solves that by raising each instance error to a number smaller than 2, for instance 1.5. This reduces the contribution of outliers to the total error. For instance, if an outlier has an error of 10, the squared error for that instance will be 100, while the Minkowski error will be 31.62.

To illustrate this method, we are going to build two different neural network models from our data set contaning two outliers (A and B). The architecture selected for this network is 1:24:1. The first one will be created with the sum squared error, and the second one with the Minkowski error.

The model trained with sum squared error is plotted in the next figure. As we can see, two outliers are spoiling the model.

Now, we are going to train the same neural network with the Minkowski error. The resulting model is depicted next. As we can see, the Minkowski error has made the training process more insensitive to outliers than the sum squared error.

As a result, Minkowski error has improved the quality of our model notably.

Conclusions

We have seen that outliers are one of the main problems when building a predictive model. Indeed, they cause data scientists to achieve poorer results than they could. To solve that, we need effective methods deal with that spurious points and remove them.

In this article, we have seen 3 different methods for dealing with outliers: the univariate method, the multivariate method and the Minkowski error. These methods are complementary and, if our data set has many and difficult outliers, we might need to try them all.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Outliers outlier Methods Method Liers methods

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
paulinokok 发表于 2017-2-25 18:54:07 |只看作者 |坛友微信交流群
thank yhou
已有 1 人评分论坛币 收起 理由
oliyiyi + 5 精彩帖子

总评分: 论坛币 + 5   查看全部评分

使用道具

藤椅
ivyvivi 学生认证  发表于 2017-4-5 16:45:15 |只看作者 |坛友微信交流群
这样看起来Minkowski error的方法确实很好,请问这段文字出自哪里呢?想去看看全部的,谢谢你楼主

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 23:36