人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Naive Bayes: A Baseline Model for Machine Learning C ...

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: oliyiyi

1624 3

Naive Bayes: A Baseline Model for Machine Learning Classification Performance [推广有奖]

1关注
184
粉丝

版主

泰斗

还不是VIP/贵宾

TA的文库 其他...

计量文库

威望: 7 级
论坛币: 271951 个
通用积分: 31269.3519
学术水平: 1435 点
热心指数: 1554 点
信用等级: 1345 点
经验: 383775 点
帖子: 9598
精华: 66
在线时间: 5468 小时
注册时间: 2007-5-21
最后登录: 2024-4-18

楼主

oliyiyi 发表于 2019-5-8 19:42:49 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

本帖隐藏的内容

By Asel Mendis, KDnuggets.

comments

Bayes Theorem

The above equation represents Bayes Theorem in which it describes the probability of an event occurring P(A) based on our prior knowledge of events that may be related to that event P(B).

Lets explore the parts of Bayes Theorem:

P(A|B) - Posterior Probability
- The conditional probability that event A occurs given that event B has occurred.
P(A) - Prior Probability
- The probability of event A.
P(B) - Evidence
- The probability of event B.
P(B|A) - Likelihood
- The conditional probability of B occurring given event A has occurred.

Now, lets explore the parts of Bayes Theorem through the eyes of someone conducting machine learning:

P(A|B) - Posterior Probability
- The conditional probability of the response variable (target variable) given the training data inputs.
P(A) - Prior Probability
- The probability of the response variable (target variable).
P(B) - Evidence
- The probability of the training data.
P(B|A) - Likelihood
- The conditional probability of the training data given the response variable.

P(c|x) - Posterior probability of the target/class (c) given predictors (x).
P(c) - Prior probability of the class (target).
P(x|c) - Probability of the predictor (x) given the class/target (c).
P(x) - Prior probability of the predictor (x).

Example of using Bayes theorem:
I'll be using the tennis weather dataset.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：performance Performan Baseline Learning machine

相关帖子

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具举报

沙发

oliyiyi 发表于 2019-5-8 19:43:38 |只看作者 |坛友微信交流群

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline

tennis = pd.read_csv('tennis.csv')tennis

outlooktemphumiditywindyplay
0sunnyhothighFalseno
1sunnyhothighTrueno
2overcasthothighFalseyes
3rainymildhighFalseyes
4rainycoolnormalFalseyes
5rainycoolnormalTrueno
6overcastcoolnormalTrueyes
7sunnymildhighFalseno
8sunnycoolnormalFalseyes
9rainymildnormalFalseyes
10sunnymildnormalTrueyes
11overcastmildhighTrueyes
12overcasthotnormalFalseyes
13rainymildhighTrueno

Lets take a look at how each category looks when inside a frequency table:

outlook = tennis.groupby(['outlook', 'play']).size()temp = tennis.groupby(['temp', 'play']).size()humidity = tennis.groupby(['humidity', 'play']).size()windy = tennis.groupby(['windy', 'play']).size()play = tennis.play.value_counts()

print(temp)print('------------------')print(humidity)print('------------------')print(windy)print('------------------')print(outlook)print('------------------')print('play')print(play)

temp playcool no 1 yes 3hot no 2 yes 2mild no 2 yes 4dtype: int64------------------humidity playhigh no 4 yes 3normal no 1 yes 6dtype: int64------------------windy playFalse no 2 yes 6True no 3 yes 3dtype: int64------------------outlook playovercast yes 4rainy no 2 yes 3sunny no 3 yes 2dtype: int64------------------playyes 9no 5Name: play, dtype: int64

What is the probability of playing tennis given it is rainy?

P(rain|play=yes)
- frequency of (outlook=rainy) when (play=yes) / frequency of (play=yes) = 3/9
P(play=yes)
- frequency of (play=yes) / total(play) = 9/14
P(outlook=rainy)
- frequency of (outlook=rainy) / total(outlook) = 5/14

(3/9)*(9/14)/(5/14)

0.6

The probability of playing tennis when it is rainy is 60%. The process is very simple once you obtain the frequencies for each category.

Here is a simple function to help any newbies remember the parts of Bayes equation:

def bayestheorem(): print('Posterior [P(c|x)] - Posterior probability of the target/class (c) given predictors (x)'), print('Prior [P(c)] - Prior probability of the class (target)'), print('Likelihood [P(x|c)] - Probability of the predictor (x) given the class/target (c)'), print('Evidence [P(x)] - Prior probability of the predictor (x))')

Here is a simple function to calculate the posterior probability for you, but you must be able to find each part of bayes equation yourself.

def bayesposterior(prior, likelihood, evidence, string): print('Prior=', prior), print('Likelihood=', likelihood), print('Evidence=', evidence), print('Equation =','(Prior*Likelihood)/Evidence') print(string, (prior*likelihood)/evidence)

Lets see another way to find the posterior probability this time using contingency tables in Python:

ct = pd.crosstab(tennis['outlook'], tennis['play'], margins = True)print(ct)

no yes rowtotalovercast 0 4 4rainy 2 3 5sunny 3 2 5coltotal 5 9 14ct.columns = ["no","yes","rowtotal"]ct.index= ["overcast","rainy","sunny","coltotal"]ct / ct.loc["coltotal","rowtotal"]

noyesrowtotal
overcast0.0000000.2857140.285714
rainy0.1428570.2142860.357143
sunny0.2142860.1428570.357143
coltotal0.3571430.6428571.000000

To only get the column total

ct / ct.loc["coltotal"]

noyesrowtotal
overcast0.00.4444440.285714
rainy0.40.3333330.357143
sunny0.60.2222220.357143
coltotal1.01.0000001.000000

To only get the row total

ct.div(ct["rowtotal"], axis=0)

noyesrowtotal
overcast0.0000001.0000001.0
rainy0.4000000.6000001.0
sunny0.6000000.4000001.0
coltotal0.3571430.6428571.0

These tables are all pandas dataframe objects. Therefore using pandas subsetting and the bayesposterior function I made, we can arrive at the same conclusion:

bayesposterior(prior = ct.iloc[1,1]/ct.iloc[3,1], likelihood = ct.iloc[3,1]/ct.iloc[3,2], evidence = ct.iloc[1,2]/ct.iloc[3,2], string = 'Probability of Tennis given Rain =')

Prior= 0.3333333333333333Likelihood= 0.6428571428571429Evidence= 0.35714285714285715Equation = (Prior*Likelihood)/EvidenceProbability of Tennis given Rain = 0.6
Naive Bayes Algorithm

Naive Bayes is a supervised Machine Learning algorithm inspired by the Bayes theorem. It works on the principles of conditional probability. Naive Bayes is a classification algorithm for binary and multi-class classification. The Naive Bayes algorithm uses the probabilities of each attribute belonging to each class to make a prediction.

Example
What is the probability of playing tennis when it is sunny, hot, highly humid and windy? So using the tennis dataset, we need to use the Naive Bayes method to predict the probability of someone playing tennis given the mentioned weather conditions.

pd.crosstab(tennis['outlook'], tennis['play'], margins = True)

playnoyesAll
outlook
overcast044
rainy235
sunny325
All5914
pd.crosstab(tennis['temp'], tennis['play'], margins = True)

playnoyesAll
temp
cool134
hot224
mild246
All5914
pd.crosstab(tennis['humidity'], tennis['play'], margins = True)

playnoyesAll
humidity
high437
normal167
All5914
pd.crosstab(tennis['windy'], tennis['play'], margins = True)

playnoyesAll
windy
False268
True336
All5914
pd.crosstab(index=tennis['play'],columns="count", margins=True)

col_0countAll
play
no55
yes99
All1414

Now by using the above contingency tables, we will go through how the Naive Bayes algorithm calculates the posterior probability.

Calculate P(x|play=yes). In this case x refers to all the predictors 'outlook', 'temp', 'humidity' and 'windy'.
- P(sunny|play=yes)→2/9
- P(hot|play=yes)→2/9
- P(high|play=yes)→3/9
- P(True|play=yes)→3/9
p_x_yes = ((2/9)*(2/9)*(3/9)*(3/9))print('The probability of the predictors given playing tennis is', '%.3f'%p_x_yes)

The probability of the predictors given playing tennis is 0.005
Calculate P(x|play=no) using the same method as above.
- P(sunny|play=no)→3/5
- P(hot|play=no)→2/5
- P(high|play=no)→4/5
- P(True|play=no)→3/5
p_x_no = ((3/5)*(2/5)*(4/5)*(3/5))print('The probability of the predictors given not playing tennis is ', '%.3f'%p_x_no)

The probability of the predictors given not playing tennis is 0.115
Calculate P(play=yes) and P(play=no)
- P(play=yes)→9/14
- P(play=yes)→5/14
yes = (9/14)no = (5/14)print('The probability of playing tennis is', '%.3f'% yes)print('The probability of not playing tennis is', '%.3f'% no)

The probability of playing tennis is 0.643The probability of not playing tennis is 0.357

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具举报

藤椅

oliyiyi 发表于 2019-5-8 19:44:04 |只看作者 |坛友微信交流群

Calculate the probability of playing and not playing tennis given the predictors

yes_x = p_x_yes*yesprint('The probability of playing tennis given the predictors is', '%.3f'%yes_x)no_x = p_x_no*noprint('The probability of not playing tennis given the predictors is', '%.3f'%no_x)

The probability of playing tennis given the predictors is 0.004 The probability of not playing tennis given the predictors is 0.041
The prediction will be whichever probability is higher
if yes_x > no_x: print('The probability of playing tennis when the outlook is sunny, the temperature is hot, there is high humidity and windy is higher')else: print('The probability of not playing tennis when the outlook is sunny, the temperature is hot, there is high humidity and windy is higher')

The probability of not playing tennis is higher when the outlook is sunny, the temperature is hot, there is high humidity and it is windy.

Type of Naive Bayes Algorithm

Python's Scikitlearn gives the user access to the following 3 Naive Bayes models.

Gaussian
- The gaussian NB Alogorithm assumes all contnuous features (predictors) and all follow a Gaussian (Normal Distribution).
Multinomial
- Multinomial NB is suited for discrete data that have frequencies and counts. Spam Filtering and Text/Document Classification are two very well-known use cases.
Bernoulli
- Bernoulli is similar to Multinomial except it is for boolean/binary features. Like the multinomial method it can be used for spam filtering and document classification in which binary terms (i.e. word occurrence in a document represented with True or False).

Lets implement a Multinomial and Gaussian Model with Scikitlearn

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import *

缺少币币的网友请访问有奖回帖集合：
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具举报