请选择 进入手机版 | 继续访问电脑版
楼主: oliyiyi
1228 0

Removing Outliers Using Standard Deviation in Python [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

oliyiyi 发表于 2017-2-17 14:58:12 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

Standard Deviation, a quick recap


Standard deviation is a metric of variance i.e. how much the individual data points are spread out from the mean.



From Wikipedia.

For example, consider the two data sets:

     27 23 25 22 23 20 20 25 29 29


and

     12 31 31 16 28 47 9 5 40 47


Both have the same mean 25. However, the first dataset has values closer to the mean and the second dataset has values more spread out.

To be more precise, the standard deviation for the first dataset is 3.13 and for the second set is 14.67.

However, it's not easy to wrap your head around numbers like 3.13 or 14.67. Right now, we only know that the second data set is more “spread out” than the first one.

Let’s put this to a more practical use.


What is normal distribution?


When we perform analytics, we often come across data that follow a pattern with values rallying around a mean and having almost equal results below and above it e.g.

  • height of people,
  • blood pressure values
  • test marks

Such values follow a normal distribution.

According to the Wikipedia article on normal distribution, about 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations.

This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.


Remove Outliers Using Normal Distribution and S.D.


I applied this rule successfully when I had to clean up data from millions of IoT devices generating heating equipment data. Each data point contained the electricity usage at a point of time.

However, sometimes the devices weren’t 100% accurate and would give very high or very low values.

We needed to remove these outlier values because they were making the scales on our graph unrealistic. The challenge was that the number of these outlier values was never fixed. Sometimes we would get all valid values and sometimes these erroneous readings would cover as much as 10% of the data points.

Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2*SD) and any points below (Mean - 2*SD) before plotting the frequencies.

You don’t have to use 2 though, you can tweak it a little to get a better outlier detection formula for your data.

Here’s an example using Python programming. The dataset is a classic normal distribution but as you can see, there are some values like 10, 20 which will disturb our analysis and ruin the scales on our graphs.


import numpy

arr = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 763, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543]

elements = numpy.array(arr)

mean = numpy.mean(elements, axis=0)

sd = numpy.std(elements, axis=0)

final_list = [x for x in arr if (x > mean - 2 * sd)]

final_list = [x for x in final_list if (x < mean + 2 * sd)]

print(final_list)





view rawstandard-dev-.py hosted with ❤ by GitHub



As you case see, we removed the outlier values and if we plot this dataset, our plot will look much better.

  [386, 479, 627, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563,   527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566,   554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444,   578, 405, 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565,   415, 486, 668, 414, 665, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476,   437, 483, 529, 363, 711, 543]

As you can see, we were able to remove outliers. I wouldn’t recommend this method for all statistical analysis though, outliers have an import function in statistics and they are there for a reason!

But in our case, the outliers were clearly because of error in the data and the data was in a normal distribution so standard deviation made sense.

Bio: Punit Jajodia is an entrepreneur and software developer from Kathmandu, Nepal. Versatility is his biggest strength, as he has worked on a variety of projects from real-time 3D simulations on the browser and big data analytics to Windows application development. He's also the co-founder of Programiz.com, one of the largest tutorial websites on Python and R.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-18 19:48