楼主: oliyiyi
1342 4

How To Create A Perfect Decision Tree [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

Let us consider a scenario where a new planet is discovered by a group of  astronomers. Now the question is whether it could be ‘the next earth?’ The answer to this question will revolutionize the way people live. Well, literally!

There are n number of deciding factors which need to be thoroughly researched to take an intelligent decision. These factors can be whether water is present on the planet, what is the temperature, whether the surface is prone to continuous storms, flora and fauna survives the climate or not, etc.

Let us create a decision tree to find out whether we have discovered a new habitat.

The habitable temperature falls into the range 0 to 100 Celsius.

Whether water is present or not?

Whether  flora and fauna flourishes?

The planet has a stormy surface?

Thus, we a have a decision tree with us.

Classification Rules:

Classification rules are the cases in which all the scenarios are taken into consideration and a class variable is assigned to each.

Class Variable :

Each leaf node is assigned a class-variable. A class-variable is the final output which leads to our decision.

Let us derive the classification rules from the Decision Tree created:

1. If Temperature is not between 273 to 373K, -> Survival Difficult

2. If Temperature is between 273 to 373K, and  water is not present, -> Survival Difficult

3. If Temperature is between 273 to 373K, water is present, and flora and fauna is not present -> Survival Difficult

4. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and stormy surface is not present -> Survival Probable

5. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and stormy surface is present -> Survival Difficult

Decision Tree

A decision tree has the following constituents :

Root Node : The factor of ‘temperature’ is considered as the root in this case.

Internal Node : The nodes with one incoming edge and 2 or more outgoing edges.

Leaf Node : This is the terminal node with no out-going edge.

As the decision tree is now constructed, starting from the root-node we check the test condition and assign the control to one of the outgoing edges, and so the condition is again tested and a node is assigned. The decision tree is said to be complete when all the test conditions lead to leaf node. The leaf node contains the class-labels, which vote in-favor or against the decision.

Now, you might think why did we start with the ‘temperature’ attribute at the root? If you choose any other attribute, the decision tree constructed will be different.

Correct. For a particular set of attributes, there can be numerous different trees created. We need to choose the optimal tree which is done by following an algorithmic approach. We will now see ‘the greedy approach’ to create a perfect decision tree.

The Greedy Approach

“Greedy Approach is based on the concept of heuristic Problem Solving by making optimal local choice at each node. By making these local optimal choices, we reach the approximate optimal solution globally.”

The algorithm can be summarized as :

1. At each stage (node), pick out the best feature as the test condition.

2. Now split the node into the possible outcomes (internal nodes).

3. Repeat the above steps till all the test conditions have been exhausted into leaf nodes.

When you start to implement the algorithm, the first question is: ‘How to pick the starting test condition?’

The answer to this question lies in the values of ‘Entropy’ and ‘Information Gain’. Let us see what are they and how do they impact our decision tree creation.

Entropy : Entropy in Decision Tree stands for homogeneity. If the data is completely homogenous, the entropy is 0, else if the data is divided (50-50%) entropy is 1.

Information Gain : Information Gain is the dec/inc in Entropy value when the node is split.

An attribute should have the highest information gain to be selected for splitting. Based on the computed values of Entropy and Information Gain, we choose the best attribute at any particular step.

Let us consider the following data:

There can be n number of decision trees that can be formulated from these set of attributes.

Tree Creation Trial-1 :

Here we take up the attribute ‘Student’ as the initial test condition.

Tree Creation Trial-2 :

Similarly, why to choose ‘Student’ ? We can choose ‘Income’ as the test condition.

Creating the Perfect Decision Tree With Greedy Approach

Let us follow the ‘Greedy Approach’ and construct the optimal decision tree.

There are two classes involved : ‘Yes’ i.e. whether the person buys a computer or ‘No’ i.e. he does not. To calculate Entropy and Information Gain, we are computing the value of Probability for each of these 2 classes.
»Positive: For ‘buys_computer=yes’ probability will come out to be  :

»Negative: For ‘buys_computer=no’ probability comes out to be :

Entropy in D : We now put calculate the Entropy by putting probability values in the formula stated above.

We have already classified the values of Entropy, which are:
Entropy =0  : Data is completely homogenous (pure)
Entropy =1   : Data is divided into 50- 50 % (impure)
Our value of Entropy is 0.940, which means our set is almost impure.
Let’s delve deep, to find out the suitable attribute and calculate the Information Gain.
What is information gain if we split on “Age”?
This data represents how many people falling into a specific age bracket, buy and do not buy the product.
For example, for people with Age 30 or less, 2 people buy (Yes) and 3 people do not buy (No) the product, the Info (D)  is calculated for these 3 category of people, that is represented in the last column.

The Info (D) for the age attribute is computed by the total of these 3 range of age values. Now, the question is what is the ‘information gain’ if we split on ‘Age’ attribute.
The difference of the total Information value ( 0.940) and the information computed for age attribute (0.694) gives the ‘information gain’.

This is the deciding factor for whether we should split at ‘Age’ or any other attribute. Similarly, we calculate the ‘information gain’ for the rest of the attributes:
Information Gain (Age) =0.246
Information Gain (Income) =0.029
Information Gain (Student) = 0.151
Information Gain (credit_rating) =0.048

On comparing these values of gain for all the attributes, we find out that the ‘information gain’ for ‘Age’ is the highest. Thus, splitting at ‘age’ is a good decision.
Similarly, at each split we compare the information gain to find out whether that attribute should be chosen for split or not.
Thus, the optimal tree created looks like :

The classification rules for this tree can be jotted down as:

If person’s age is less than 30 and he is not a student, he will not buy the product.

Age(<30) ^ student(no) = NO

If person’s age is less than 30 and he is a student, he will buy the product.
Age(<30) ^ student(yes) = YES

If person’s age is between 31 and 40, he is most likely to buy.

Age(31…40) = YES

If person’s age is greater than 40, and has an excellent credit rating, he will not buy.

Age(>40) ^ credit_rating(excellent) = NO

If person’s age is greater than 40, with a fair credit rating, he will probably buy.
Age(>40) ^ credit_rating(fair) = Yes

Thus, we achieve the perfect Decision Tree!!



























二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Decision perfect Create creat Tree Create

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
西门高 发表于 2017-2-22 22:34:52 |只看作者 |坛友微信交流群
学习一下

使用道具

藤椅
ekscheng 发表于 2017-2-22 22:41:04 |只看作者 |坛友微信交流群

使用道具

板凳
robertdoo 发表于 2017-2-22 22:44:27 |只看作者 |坛友微信交流群
谢谢!!

使用道具

报纸
h2h2 发表于 2017-2-23 00:53:20 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 09:17