楼主: ReneeBK
1781 8

[Case Study]Data Mining using Decision Trees [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4900份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49655 个
通用积分
55.9937
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2016-5-12 10:27:54 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. Data Mining using Decision Trees
  2. #### Part 1: Decision Trees -------------------

  3. ## Understanding Decision Trees ----
  4. # calculate entropy of a two-class segment
  5. -0.60 * log2(0.60) - 0.40 * log2(0.40)

  6. curve(-x * log2(x) - (1 - x) * log2(1 - x),
  7.       col="red", xlab = "x", ylab = "Entropy", lwd=4)

  8. ## Example: Identifying Risky Bank Loans ----
  9. ## Step 2: Exploring and preparing the data ----
  10. credit <- read.csv("credit.csv")
  11. str(credit)

  12. # look at two characteristics of the applicant
  13. table(credit$checking_balance)
  14. table(credit$savings_balance)

  15. # look at two characteristics of the loan
  16. summary(credit$months_loan_duration)
  17. summary(credit$amount)

  18. # look at the class variable
  19. table(credit$default)

  20. # create a random sample for training and test data
  21. # use set.seed to use the same random number sequence as the tutorial
  22. set.seed(12345)
  23. credit_rand <- credit[order(runif(1000)), ]

  24. # compare the credit and credit_rand data frames
  25. summary(credit$amount)
  26. summary(credit_rand$amount)
  27. head(credit$amount)
  28. head(credit_rand$amount)

  29. # split the data frames
  30. credit_train <- credit_rand[1:900, ]
  31. credit_test  <- credit_rand[901:1000, ]

  32. # check the proportion of class variable
  33. prop.table(table(credit_train$default))
  34. prop.table(table(credit_test$default))

  35. ## Step 3: Training a model on the data ----
  36. # build the simplest decision tree
  37. library(C50)
  38. credit_model <- C5.0(credit_train[-17], credit_train$default)

  39. # display simple facts about the tree
  40. credit_model

  41. # display detailed information about the tree
  42. summary(credit_model)

  43. ## Step 4: Evaluating model performance ----
  44. # create a factor vector of predictions on test data
  45. credit_pred <- predict(credit_model, credit_test)

  46. # cross tabulation of predicted versus actual classes
  47. library(gmodels)
  48. CrossTable(credit_test$default, credit_pred,
  49.            prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
  50.            dnn = c('actual default', 'predicted default'))

  51. ## Step 5: Improving model performance ----

  52. ## Boosting the accuracy of decision trees
  53. # boosted decision tree with 10 trials
  54. credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
  55.                        trials = 10)
  56. credit_boost10
  57. summary(credit_boost10)

  58. credit_boost_pred10 <- predict(credit_boost10, credit_test)
  59. CrossTable(credit_test$default, credit_boost_pred10,
  60.            prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
  61.            dnn = c('actual default', 'predicted default'))

  62. # boosted decision tree with 100 trials (not shown in text)
  63. credit_boost100 <- C5.0(credit_train[-17], credit_train$default,
  64.                         trials = 100)
  65. credit_boost_pred100 <- predict(credit_boost100, credit_test)
  66. CrossTable(credit_test$default, credit_boost_pred100,
  67.            prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
  68.            dnn = c('actual default', 'predicted default'))

  69. ## Making some mistakes more costly than others
  70. # create a cost matrix
  71. error_cost <- matrix(c(0, 1, 4, 0), nrow = 2)
  72. error_cost

  73. # apply the cost matrix to the tree
  74. credit_cost <- C5.0(credit_train[-17], credit_train$default,
  75.                           costs = error_cost)
  76. credit_cost_pred <- predict(credit_cost, credit_test)

  77. CrossTable(credit_test$default, credit_cost_pred,
  78.            prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
  79.            dnn = c('actual default', 'predicted default'))
复制代码

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Mining Case study Decision Using study credit

本帖被以下文库推荐

沙发
ReneeBK 发表于 2016-5-12 10:32:28
  1. /**
  2. *
  3. */

  4. package dt;

  5. import java.util.*;

  6. public class DecisionTree {
  7.   /**
  8.    * Contains the set of available attributes.
  9.    */
  10.   private LinkedHashSet<String> attributes;

  11.   /**
  12.    * Maps a attribute name to a set of possible decisions for that attribute.
  13.    */
  14.   private Map<String, Set<String> > decisions;
  15.   private boolean decisionsSpecified;

  16.   /**
  17.    * Contains the examples to be processed into a decision tree.
  18.    *
  19.    * The 'attributes' and 'decisions' member variables should be updated
  20.    * prior to adding examples that refer to new attributes or decisions.
  21.    */
  22.   private Examples examples;

  23.   /**
  24.    * Indicates if the provided data has been processed into a decision tree.
  25.    *
  26.    * This value is initially false, and is reset any time additional data is
  27.    * provided.
  28.    */
  29.   private boolean compiled;

  30.   /**
  31.    * Contains the top-most attribute of the decision tree.
  32.    *
  33.    * For a tree where the decision requires no attributes,
  34.    * the rootAttribute yields a boolean classification.
  35.    *
  36.    */
  37.   private Attribute rootAttribute;

  38.   private Algorithm algorithm;

  39.   public DecisionTree() {
  40.     algorithm = null;
  41.     examples = new Examples();
  42.     attributes = new LinkedHashSet<String>();
  43.     decisions = new HashMap<String, Set<String> >();
  44.     decisionsSpecified = false;
  45.   }

  46.   private void setDefaultAlgorithm() {
  47.     if ( algorithm == null )
  48.       setAlgorithm(new ID3Algorithm(examples));
  49.   }

  50.   public void setAlgorithm(Algorithm algorithm) {
  51.     this.algorithm = algorithm;
  52.   }

  53.   /**
  54.    * Saves the array of attribute names in an insertion ordered set.
  55.    *
  56.    * The ordering of attribute names is used when addExamples is called to
  57.    * determine which values correspond with which names.
  58.    *
  59.    */
  60.   public DecisionTree setAttributes(String[] attributeNames) {
  61.     compiled = false;

  62.     decisions.clear();
  63.     decisionsSpecified = false;

  64.     attributes.clear();

  65.     for ( int i = 0 ; i < attributeNames.length ; i++ )
  66.       attributes.add(attributeNames[i]);

  67.     return this;
  68.   }

  69.   /**
  70.    */
  71.   public DecisionTree setDecisions(String attributeName, String[] decisions) {
  72.     if ( !attributes.contains(attributeName) ) {
  73.       // TODO some kind of warning or something
  74.       return this;
  75.     }

  76.     compiled = false;
  77.     decisionsSpecified = true;

  78.     Set<String> decisionsSet = new HashSet<String>();
  79.     for ( int i = 0 ; i < decisions.length ; i++ )
  80.       decisionsSet.add(decisions[i]);

  81.     this.decisions.put(attributeName, decisionsSet);

  82.     return this;
  83.   }

  84.   /**
  85.    */
  86.   public DecisionTree addExample(String[] attributeValues, boolean classification) throws UnknownDecisionException {
  87.     String[] attributes = this.attributes.toArray(new String[0]);

  88.     if ( decisionsSpecified )
  89.       for ( int i = 0 ; i < attributeValues.length ; i++ )
  90.         if ( !decisions.get(attributes[i]).contains(attributeValues[i]) ) {
  91.           throw new UnknownDecisionException(attributes[i], attributeValues[i]);
  92.         }

  93.     compiled = false;

  94.     examples.add(attributes, attributeValues, classification);
  95.    
  96.     return this;
  97.   }

  98.   public DecisionTree addExample(Map<String, String> attributes, boolean classification) throws UnknownDecisionException {
  99.     compiled = false;

  100.     examples.add(attributes, classification);

  101.     return this;
  102.   }

  103.   public boolean apply(Map<String, String> data) throws BadDecisionException {
  104.     compile();

  105.     return rootAttribute.apply(data);
  106.   }

  107.   private Attribute compileWalk(Attribute current, Map<String, String> chosenAttributes, Set<String> usedAttributes) {
  108.     // if the current attribute is a leaf, then there are no decisions and thus no
  109.     // further attributes to find.
  110.     if ( current.isLeaf() )
  111.       return current;

  112.     // get decisions for the current attribute (from this.decisions)
  113.     String attributeName = current.getName();

  114.     // remove this attribute from all further consideration
  115.     usedAttributes.add(attributeName);

  116.     for ( String decisionName : decisions.get(attributeName) ) {
  117.       // overwrite the attribute decision for each value considered
  118.       chosenAttributes.put(attributeName, decisionName);

  119.       // find the next attribute to choose for the considered decision
  120.       // build the subtree from this new attribute, pre-order
  121.       // insert the newly-built subtree into the open decision slot
  122.       current.addDecision(decisionName, compileWalk(algorithm.nextAttribute(chosenAttributes, usedAttributes), chosenAttributes, usedAttributes));
  123.     }

  124.     // remove the attribute decision before we walk back up the tree.
  125.     chosenAttributes.remove(attributeName);

  126.     // return the subtree so that it can be inserted into the parent tree.
  127.     return current;
  128.   }

  129.   public void compile() {
  130.     // skip compilation if already done.
  131.     if ( compiled )
  132.       return;

  133.     // if no algorithm is set beforehand, select the default one.
  134.     setDefaultAlgorithm();

  135.     Map<String, String> chosenAttributes = new HashMap<String, String>();
  136.     Set<String> usedAttributes = new HashSet<String>();

  137.     if ( !decisionsSpecified )
  138.       decisions = examples.extractDecisions();

  139.     // find the root attribute (either leaf or non)
  140.     // walk the tree, adding attributes as needed under each decision
  141.     // save the original attribute as the root attribute.
  142.     rootAttribute = compileWalk(algorithm.nextAttribute(chosenAttributes, usedAttributes), chosenAttributes, usedAttributes);

  143.     compiled = true;
  144.   }

  145.   public String toString() {
  146.     compile();

  147.     if ( rootAttribute != null )
  148.       return rootAttribute.toString();
  149.     else
  150.       return "";
  151.   }

  152.   public Attribute getRoot() {
  153.     return rootAttribute;
  154.   }
  155. }
复制代码


https://github.com/saebyn/java-decision-tree/

藤椅
ReneeBK 发表于 2016-5-12 10:33:40

Decision Tree in Java

  1. /**
  2. *
  3. */

  4. package dt;

  5. import java.util.*;


  6. class Examples {
  7.   class Example {
  8.     private Map<String, String> values;
  9.     private boolean classifier;
  10.   
  11.     public Example(String[] attributeNames, String[] attributeValues,
  12.                    boolean classifier) {
  13.       assert(attributeNames.length == attributeValues.length);
  14.       values = new HashMap<String, String>();

  15.       for ( int i = 0 ; i < attributeNames.length ; i++ ) {
  16.         values.put(attributeNames[i], attributeValues[i]);
  17.       }
  18.    
  19.       this.classifier = classifier;
  20.     }

  21.     public Example(Map<String, String> attributes, boolean classifier) {
  22.       this.classifier = classifier;
  23.       this.values = attributes;
  24.     }

  25.     public Set<String> getAttributes() {
  26.       return values.keySet();
  27.     }

  28.     public String getAttributeValue(String attribute) {
  29.       return values.get(attribute);
  30.     }

  31.     public boolean matchesClass(boolean classifier) {
  32.       return classifier == this.classifier;
  33.     }
  34.   }

  35.   private List<Example> examples;

  36.   public Examples() {
  37.     examples = new LinkedList<Example>();
  38.   }
  39.   
  40.   public void add(String[] attributeNames, String[] attributeValues,
  41.                   boolean classifier) {
  42.     examples.add(new Example(attributeNames, attributeValues, classifier));
  43.   }

  44.   public void add(Map<String, String> attributes, boolean classifier) {
  45.     examples.add(new Example(attributes, classifier));
  46.   }

  47.   /**
  48.    * Returns the number of examples where the attribute has the specified
  49.    * 'decision' value
  50.    */
  51.   int countDecisions(String attribute, String decision) {
  52.     int count = 0;

  53.     for ( Example e : examples ) {
  54.       if ( e.getAttributeValue(attribute).equals(decision) )
  55.         count++;
  56.     }

  57.     return count;
  58.   }

  59.   /**
  60.    * Returns a map from each attribute name to a set of all values used in the
  61.    * examples for that attribute.
  62.    */
  63.   public Map<String, Set<String> > extractDecisions() {
  64.     Map<String, Set<String> > decisions = new HashMap<String, Set<String> >();

  65.     for ( String attribute : extractAttributes() ) {
  66.       decisions.put(attribute, extractDecisions(attribute));
  67.     }

  68.     return decisions;
  69.   }

  70.   public int countNegative(String attribute, String decision,
  71.                            Map<String, String> attributes) {
  72.     return countClassifier(false, attribute, decision, attributes);
  73.   }
  74.   
  75.   public int countPositive(String attribute, String decision,
  76.                            Map<String, String> attributes) {
  77.     return countClassifier(true, attribute, decision, attributes);
  78.   }
  79.   
  80.   public int countNegative(Map<String, String> attributes) {
  81.     return countClassifier(false, attributes);
  82.   }
  83.   
  84.   public int countPositive(Map<String, String> attributes) {
  85.     return countClassifier(true, attributes);
  86.   }
  87.   
  88.   public int count(String attribute, String decision, Map<String, String> attributes) {
  89.     attributes = new HashMap(attributes);
  90.     attributes.put(attribute, decision);

  91.     return count(attributes);
  92.   }
  93.   
  94.   public int count(Map<String, String> attributes) {
  95.     int count = 0;

  96. nextExample:
  97.     for ( Example e : examples ) {
  98.       for ( Map.Entry<String, String> attribute : attributes.entrySet() )
  99.         if ( !(e.getAttributeValue(attribute.getKey()).equals(attribute.getValue())) )
  100.           continue nextExample;

  101.       // All of the provided attributes match the example.
  102.       count++;
  103.     }

  104.     return count;
  105.   }
  106.   
  107.   public int countClassifier(boolean classifier, Map<String, String> attributes) {
  108.     int count = 0;

  109. nextExample:
  110.     for ( Example e : examples ) {
  111.       for ( Map.Entry<String, String> attribute : attributes.entrySet() )
  112.         if ( !(e.getAttributeValue(attribute.getKey()).equals(attribute.getValue())) )
  113.           continue nextExample;

  114.       // All of the provided attributes match the example.
  115.       // If the example matches the classifier, then include it in the count.
  116.       if ( e.matchesClass(classifier) )
  117.         count++;
  118.     }

  119.     return count;
  120.   }

  121.   public int countClassifier(boolean classifier, String attribute,
  122.                              String decision, Map<String, String> attributes) {
  123.     attributes = new HashMap(attributes);
  124.     attributes.put(attribute, decision);

  125.     return countClassifier(classifier, attributes);
  126.   }
  127.   
  128.   /**
  129.    * Returns the number of examples.
  130.    */
  131.   public int count() {
  132.     return examples.size();
  133.   }

  134.   /**
  135.    * Returns a set of attribute names used in the examples.
  136.    */
  137.   public Set<String> extractAttributes() {
  138.     Set<String> attributes = new HashSet<String>();

  139.     for ( Example e : examples ) {
  140.       attributes.addAll(e.getAttributes());
  141.     }

  142.     return attributes;
  143.   }

  144.   private Set<String> extractDecisions(String attribute) {
  145.     Set<String> decisions = new HashSet<String>();

  146.     for ( Example e : examples ) {
  147.       decisions.add(e.getAttributeValue(attribute));
  148.     }

  149.     return decisions;
  150.   }
  151. }

  152. /**
  153. *
  154. */

  155. package dt;

  156. import java.util.*;

  157. public class DecisionTree {
  158.   /**
  159.    * Contains the set of available attributes.
  160.    */
  161.   private LinkedHashSet<String> attributes;

  162.   /**
  163.    * Maps a attribute name to a set of possible decisions for that attribute.
  164.    */
  165.   private Map<String, Set<String> > decisions;
  166.   private boolean decisionsSpecified;

  167.   /**
  168.    * Contains the examples to be processed into a decision tree.
  169.    *
  170.    * The 'attributes' and 'decisions' member variables should be updated
  171.    * prior to adding examples that refer to new attributes or decisions.
  172.    */
  173.   private Examples examples;

  174.   /**
  175.    * Indicates if the provided data has been processed into a decision tree.
  176.    *
  177.    * This value is initially false, and is reset any time additional data is
  178.    * provided.
  179.    */
  180.   private boolean compiled;

  181.   /**
  182.    * Contains the top-most attribute of the decision tree.
  183.    *
  184.    * For a tree where the decision requires no attributes,
  185.    * the rootAttribute yields a boolean classification.
  186.    *
  187.    */
  188.   private Attribute rootAttribute;

  189.   private Algorithm algorithm;

  190.   public DecisionTree() {
  191.     algorithm = null;
  192.     examples = new Examples();
  193.     attributes = new LinkedHashSet<String>();
  194.     decisions = new HashMap<String, Set<String> >();
  195.     decisionsSpecified = false;
  196.   }

  197.   private void setDefaultAlgorithm() {
  198.     if ( algorithm == null )
  199.       setAlgorithm(new ID3Algorithm(examples));
  200.   }

  201.   public void setAlgorithm(Algorithm algorithm) {
  202.     this.algorithm = algorithm;
  203.   }

  204.   /**
  205.    * Saves the array of attribute names in an insertion ordered set.
  206.    *
  207.    * The ordering of attribute names is used when addExamples is called to
  208.    * determine which values correspond with which names.
  209.    *
  210.    */
  211.   public DecisionTree setAttributes(String[] attributeNames) {
  212.     compiled = false;

  213.     decisions.clear();
  214.     decisionsSpecified = false;

  215.     attributes.clear();

  216.     for ( int i = 0 ; i < attributeNames.length ; i++ )
  217.       attributes.add(attributeNames[i]);

  218.     return this;
  219.   }

  220.   /**
  221.    */
  222.   public DecisionTree setDecisions(String attributeName, String[] decisions) {
  223.     if ( !attributes.contains(attributeName) ) {
  224.       // TODO some kind of warning or something
  225.       return this;
  226.     }

  227.     compiled = false;
  228.     decisionsSpecified = true;

  229.     Set<String> decisionsSet = new HashSet<String>();
  230.     for ( int i = 0 ; i < decisions.length ; i++ )
  231.       decisionsSet.add(decisions[i]);

  232.     this.decisions.put(attributeName, decisionsSet);

  233.     return this;
  234.   }

  235.   /**
  236.    */
  237.   public DecisionTree addExample(String[] attributeValues, boolean classification) throws UnknownDecisionException {
  238.     String[] attributes = this.attributes.toArray(new String[0]);

  239.     if ( decisionsSpecified )
  240.       for ( int i = 0 ; i < attributeValues.length ; i++ )
  241.         if ( !decisions.get(attributes[i]).contains(attributeValues[i]) ) {
  242.           throw new UnknownDecisionException(attributes[i], attributeValues[i]);
  243.         }

  244.     compiled = false;

  245.     examples.add(attributes, attributeValues, classification);
  246.    
  247.     return this;
  248.   }

  249.   public DecisionTree addExample(Map<String, String> attributes, boolean classification) throws UnknownDecisionException {
  250.     compiled = false;

  251.     examples.add(attributes, classification);

  252.     return this;
  253.   }

  254.   public boolean apply(Map<String, String> data) throws BadDecisionException {
  255.     compile();

  256.     return rootAttribute.apply(data);
  257.   }

  258.   private Attribute compileWalk(Attribute current, Map<String, String> chosenAttributes, Set<String> usedAttributes) {
  259.     // if the current attribute is a leaf, then there are no decisions and thus no
  260.     // further attributes to find.
  261.     if ( current.isLeaf() )
  262.       return current;

  263.     // get decisions for the current attribute (from this.decisions)
  264.     String attributeName = current.getName();

  265.     // remove this attribute from all further consideration
  266.     usedAttributes.add(attributeName);

  267.     for ( String decisionName : decisions.get(attributeName) ) {
  268.       // overwrite the attribute decision for each value considered
  269.       chosenAttributes.put(attributeName, decisionName);

  270.       // find the next attribute to choose for the considered decision
  271.       // build the subtree from this new attribute, pre-order
  272.       // insert the newly-built subtree into the open decision slot
  273.       current.addDecision(decisionName, compileWalk(algorithm.nextAttribute(chosenAttributes, usedAttributes), chosenAttributes, usedAttributes));
  274.     }

  275.     // remove the attribute decision before we walk back up the tree.
  276.     chosenAttributes.remove(attributeName);

  277.     // return the subtree so that it can be inserted into the parent tree.
  278.     return current;
  279.   }

  280.   public void compile() {
  281.     // skip compilation if already done.
  282.     if ( compiled )
  283.       return;

  284.     // if no algorithm is set beforehand, select the default one.
  285.     setDefaultAlgorithm();

  286.     Map<String, String> chosenAttributes = new HashMap<String, String>();
  287.     Set<String> usedAttributes = new HashSet<String>();

  288.     if ( !decisionsSpecified )
  289.       decisions = examples.extractDecisions();

  290.     // find the root attribute (either leaf or non)
  291.     // walk the tree, adding attributes as needed under each decision
  292.     // save the original attribute as the root attribute.
  293.     rootAttribute = compileWalk(algorithm.nextAttribute(chosenAttributes, usedAttributes), chosenAttributes, usedAttributes);

  294.     compiled = true;
  295.   }

  296.   public String toString() {
  297.     compile();

  298.     if ( rootAttribute != null )
  299.       return rootAttribute.toString();
  300.     else
  301.       return "";
  302.   }

  303.   public Attribute getRoot() {
  304.     return rootAttribute;
  305.   }
  306. }
  307. 复制代码


  308. https://github.com/saebyn/java-decision-tree/
复制代码

板凳
ReneeBK 发表于 2016-5-12 10:36:23

Decision Tree using Python

  1. import math

  2. #find item in a list
  3. def find(item, list):
  4.     for i in list:
  5.         if item(i):
  6.             return True
  7.         else:
  8.             return False

  9. #find most common value for an attribute
  10. def majority(attributes, data, target):
  11.     #find target attribute
  12.     valFreq = {}
  13.     #find target in data
  14.     index = attributes.index(target)
  15.     #calculate frequency of values in target attr
  16.     for tuple in data:
  17.         if (valFreq.has_key(tuple[index])):
  18.             valFreq[tuple[index]] += 1
  19.         else:
  20.             valFreq[tuple[index]] = 1
  21.     max = 0
  22.     major = ""
  23.     for key in valFreq.keys():
  24.         if valFreq[key]>max:
  25.             max = valFreq[key]
  26.             major = key
  27.     return major

  28. #Calculates the entropy of the given data set for the target attr
  29. def entropy(attributes, data, targetAttr):

  30.     valFreq = {}
  31.     dataEntropy = 0.0
  32.    
  33.     #find index of the target attribute
  34.     i = 0
  35.     for entry in attributes:
  36.         if (targetAttr == entry):
  37.             break
  38.         ++i
  39.    
  40.     # Calculate the frequency of each of the values in the target attr
  41.     for entry in data:
  42.         if (valFreq.has_key(entry[i])):
  43.             valFreq[entry[i]] += 1.0
  44.         else:
  45.             valFreq[entry[i]]  = 1.0

  46.     # Calculate the entropy of the data for the target attr
  47.     for freq in valFreq.values():
  48.         dataEntropy += (-freq/len(data)) * math.log(freq/len(data), 2)
  49.         
  50.     return dataEntropy

  51. def gain(attributes, data, attr, targetAttr):
  52.     """
  53.     Calculates the information gain (reduction in entropy) that would
  54.     result by splitting the data on the chosen attribute (attr).
  55.     """
  56.     valFreq = {}
  57.     subsetEntropy = 0.0
  58.    
  59.     #find index of the attribute
  60.     i = attributes.index(attr)

  61.     # Calculate the frequency of each of the values in the target attribute
  62.     for entry in data:
  63.         if (valFreq.has_key(entry[i])):
  64.             valFreq[entry[i]] += 1.0
  65.         else:
  66.             valFreq[entry[i]]  = 1.0
  67.     # Calculate the sum of the entropy for each subset of records weighted
  68.     # by their probability of occuring in the training set.
  69.     for val in valFreq.keys():
  70.         valProb        = valFreq[val] / sum(valFreq.values())
  71.         dataSubset     = [entry for entry in data if entry[i] == val]
  72.         subsetEntropy += valProb * entropy(attributes, dataSubset, targetAttr)

  73.     # Subtract the entropy of the chosen attribute from the entropy of the
  74.     # whole data set with respect to the target attribute (and return it)
  75.     return (entropy(attributes, data, targetAttr) - subsetEntropy)

  76. #choose best attibute
  77. def chooseAttr(data, attributes, target):
  78.     best = attributes[0]
  79.     maxGain = 0;
  80.     for attr in attributes:
  81.         newGain = gain(attributes, data, attr, target)
  82.         if newGain>maxGain:
  83.             maxGain = newGain
  84.             best = attr
  85.     return best

  86. #get values in the column of the given attribute
  87. def getValues(data, attributes, attr):
  88.     index = attributes.index(attr)
  89.     values = []
  90.     for entry in data:
  91.         if entry[index] not in values:
  92.             values.append(entry[index])
  93.     return values

  94. def getExamples(data, attributes, best, val):
  95.     examples = [[]]
  96.     index = attributes.index(best)
  97.     for entry in data:
  98.         #find entries with the give value
  99.         if (entry[index] == val):
  100.             newEntry = []
  101.             #add value if it is not in best column
  102.             for i in range(0,len(entry)):
  103.                 if(i != index):
  104.                     newEntry.append(entry[i])
  105.             examples.append(newEntry)
  106.     examples.remove([])
  107.     return examples

  108. def makeTree(data, attributes, target, recursion):
  109.     recursion += 1
  110.     #Returns a new decision tree based on the examples given.
  111.     data = data[:]
  112.     vals = [record[attributes.index(target)] for record in data]
  113.     default = majority(attributes, data, target)

  114.     # If the dataset is empty or the attributes list is empty, return the
  115.     # default value. When checking the attributes list for emptiness, we
  116.     # need to subtract 1 to account for the target attribute.
  117.     if not data or (len(attributes) - 1) <= 0:
  118.         return default
  119.     # If all the records in the dataset have the same classification,
  120.     # return that classification.
  121.     elif vals.count(vals[0]) == len(vals):
  122.         return vals[0]
  123.     else:
  124.         # Choose the next best attribute to best classify our data
  125.         best = chooseAttr(data, attributes, target)
  126.         # Create a new decision tree/node with the best attribute and an empty
  127.         # dictionary object--we'll fill that up next.
  128.         tree = {best:{}}
  129.    
  130.         # Create a new decision tree/sub-node for each of the values in the
  131.         # best attribute field
  132.         for val in getValues(data, attributes, best):
  133.             # Create a subtree for the current value under the "best" field
  134.             examples = getExamples(data, attributes, best, val)
  135.             newAttr = attributes[:]
  136.             newAttr.remove(best)
  137.             subtree = makeTree(examples, newAttr, target, recursion)
  138.    
  139.             # Add the new subtree to the empty dictionary object in our new
  140.             # tree/node we just created.
  141.             tree[best][val] = subtree
  142.    
  143.     return tree
复制代码

https://github.com/NinjaSteph/DecisionTree/

报纸
ReneeBK 发表于 2016-5-12 10:39:52
  1. function [classifications] = ClassifyByTree(tree, attributes, instance)
  2. % ClassifyByTree   Classifies data instance by given tree
  3. % args:
  4. %       tree            - tree data structure
  5. %       attributes      - cell array of attribute strings (no CLASS)
  6. %       instance        - data including correct classification (end col.)
  7. % return:
  8. %       classifications     - 2 numbers, first given by tree, 2nd given by
  9. %                             instance's last column
  10. % tree struct:
  11. %       value               - will be the string for the splitting
  12. %                             attribute, or 'true' or 'false' for leaf
  13. %       left                - left pointer to another tree node (left means
  14. %                             the splitting attribute was false)
  15. %       right               - right pointer to another tree node (right
  16. %                             means the splitting attribute was true)

  17. % Store the actual classification
  18. actual = instance(1, length(instance));

  19. % Recursion with 3 cases

  20. % Case 1: Current node is labeled 'true'
  21. % So trivially return the classification as 1
  22. if (strcmp(tree.value, 'true'));
  23.     classifications = [1, actual];
  24.     return
  25. end

  26. % Case 2: Current node is labeled 'false'
  27. % So trivially return the classification as 0
  28. if (strcmp(tree.value, 'false'));
  29.     classifications = [0, actual];
  30.     return
  31. end

  32. % Case 3: Current node is labeled an attribute
  33. % Follow correct branch by looking up index in attributes, and recur
  34. index = find(ismember(attributes,tree.value)==1);
  35. if (instance(1, index)); % attribute is true for this instance
  36.     % Recur down the right side
  37.     classifications = ClassifyByTree(tree.right, attributes, instance);
  38. else
  39.     % Recur down the left side
  40.     classifications = ClassifyByTree(tree.left, attributes, instance);
  41. end

  42. return
  43. end
复制代码

https://github.com/gwheaton/ID3-Decision-Tree

地板
ReneeBK 发表于 2016-5-12 10:41:30
  1. % George Wheaton
  2. % EECS 349
  3. % Homework 1 Problem 7
  4. % October 7, 2012

  5. % ID3 Decision Tree Algorithm

  6. function[] = decisiontree(inputFileName, trainingSetSize, numberOfTrials,...
  7.                           verbose)
  8. % DECISIONTREE Create a decision tree by following the ID3 algorithm
  9. % args:
  10. %   inputFileName       - the fully specified path to input file
  11. %   trainingSetSize     - integer specifying number of examples from input
  12. %                         used to train the dataset
  13. %   numberOfTrials      - integer specifying how many times decision tree
  14. %                         will be built from a randomly selected subset
  15. %                         of the training examples
  16. %   verbose             - string that must be eiher '1' or '0', if '1'
  17. %                         output includes training and test sets, else
  18. %                         it will only contain description of tree and
  19. %                         results for the trials

  20. % Read in the specified text file contain the examples
  21. fid = fopen(inputFileName, 'rt');
  22. dataInput = textscan(fid, '%s');
  23. % Close the file
  24. fclose(fid);

  25. % Reformat the data into attribute array and data matrix of 1s and 0s for
  26. % true or false
  27. i = 1;
  28. % First store the attributes into a cell array
  29. while (~strcmp(dataInput{1}{i}, 'CLASS'));
  30.     i = i + 1;
  31. end
  32. attributes = cell(1,i);
  33. for j=1:i;
  34.     attributes{j} = dataInput{1}{j};
  35. end

  36. % NOTE: The classification will be the final attribute in the data rows
  37. % below
  38. numAttributes = i;
  39. numInstances = (length(dataInput{1}) - numAttributes) / numAttributes;
  40. % Then store the data into matrix
  41. data = zeros(numInstances, numAttributes);
  42. i = i + 1;
  43. for j=1:numInstances
  44.     for k=1:numAttributes
  45.         data(j, k) = strcmp(dataInput{1}{i}, 'true');
  46.         i = i + 1;
  47.     end
  48. end

  49. % Here is where the trials start
  50. for i=1:numberOfTrials;
  51.    
  52.     % Print the trial number
  53.     fprintf('TRIAL NUMBER: %d\n\n', i);
  54.    
  55.     % Split data into training and testing sets randomly
  56.     % Use randsample to get a vector of row numbers for the training set
  57.     rows = sort(randsample(numInstances, trainingSetSize));
  58.     % Initialize two new matrices, training set and test set
  59.     trainingSet = zeros(trainingSetSize, numAttributes);
  60.     testingSetSize = (numInstances - trainingSetSize);
  61.     testingSet = zeros(testingSetSize, numAttributes);
  62.     % Loop through data matrix, copying relevant rows to each matrix
  63.     training_index = 1;
  64.     testing_index = 1;
  65.     for data_index=1:numInstances;
  66.         if (rows(training_index) == data_index);
  67.             trainingSet(training_index, :) = data(data_index, :);
  68.             if (training_index < trainingSetSize);
  69.                 training_index = training_index + 1;
  70.             end
  71.         else
  72.             testingSet(testing_index, :) = data(data_index, :);
  73.             if (testing_index < testingSetSize);
  74.                 testing_index = testing_index + 1;
  75.             end
  76.         end
  77.     end
  78.    
  79.     % If verbose, print out training set
  80.     if (verbose);
  81.         for ii=1:numAttributes;
  82.             fprintf('%s\t', attributes{ii});
  83.         end
  84.         fprintf('\n');
  85.         for ii=1:trainingSetSize;
  86.             for jj=1:numAttributes;
  87.                 if (trainingSet(ii, jj));
  88.                     fprintf('%s\t', 'true');
  89.                 else
  90.                     fprintf('%s\t', 'false');
  91.                 end
  92.             end
  93.             fprintf('\n');
  94.         end
  95.     end
  96.    
  97.     % Estimate the expected prior probability of TRUE and FALSE based on
  98.     % training set
  99.     if (sum(trainingSet(:, numAttributes)) >= trainingSetSize);
  100.         expectedPrior = 'true';
  101.     else
  102.         expectedPrior = 'false';
  103.     end
  104.    
  105.     % Construct a decision tree on the training set using the ID3 algorithm
  106.     activeAttributes = ones(1, length(attributes) - 1);
  107.     new_attributes = attributes(1:length(attributes)-1);
  108.     tree = ID3(trainingSet, attributes, activeAttributes);
  109.    
  110.     % Print out the tree
  111.     fprintf('DECISION TREE STRUCTURE:\n');
  112.     PrintTree(tree, 'root');
  113.    
  114.     % Run tree and expected prior against testing set, recording
  115.     % classifications
  116.     % The second column is for actual classification, first for calculated
  117.     ID3_Classifications = zeros(testingSetSize,2);
  118.     ExpectedPrior_Classifications = zeros(testingSetSize,2);
  119.     ID3_numCorrect = 0; ExpectedPrior_numCorrect = 0;
  120.     for k=1:testingSetSize; %over the testing set
  121.         % Call a recursive function to follow the tree nodes and classify
  122.         ID3_Classifications(k,:) = ...
  123.             ClassifyByTree(tree, new_attributes, testingSet(k,:));
  124.         
  125.         ExpectedPrior_Classifications(k, 2) = testingSet(k,numAttributes);
  126.         if (expectedPrior);
  127.             ExpectedPrior_Classifications(k, 1) = 1;
  128.         else
  129.             ExpectedPrior_Classifications(k, 0) = 0;
  130.         end
  131.         
  132.         if (ID3_Classifications(k,1) == ID3_Classifications(k, 2)); %correct
  133.             ID3_numCorrect = ID3_numCorrect + 1;
  134.         end
  135.         if (ExpectedPrior_Classifications(k,1) == ExpectedPrior_Classifications(k,2));
  136.             ExpectedPrior_numCorrect = ExpectedPrior_numCorrect + 1;
  137.         end     
  138.     end
  139.    
  140.     % If verbose, print the testing data with final two columns ID3 Class
  141.     % and Prior Class
  142.     if (verbose);
  143.         for ii=1:numAttributes;
  144.             fprintf('%s\t', attributes{ii});
  145.         end
  146.         fprintf('%s\t%s\t', 'ID3 Class', 'Prior Class');
  147.         fprintf('\n');
  148.         for ii=1:testingSetSize;
  149.             for jj=1:numAttributes;
  150.                 if (testingSet(ii, jj));
  151.                     fprintf('%s\t', 'true');
  152.                 else
  153.                     fprintf('%s\t', 'false');
  154.                 end
  155.             end
  156.             if (ID3_Classifications(ii,1));
  157.                 fprintf('%s\t', 'true');
  158.             else
  159.                 fprintf('%s\t', 'false');
  160.             end
  161.             if (ExpectedPrior_Classifications(ii,1));
  162.                 fprintf('%s\t', 'true');
  163.             else
  164.                 fprintf('%s\t', 'false');
  165.             end
  166.             fprintf('\n');
  167.         end
  168.     end
  169.    
  170.     % Calculate the proportions correct and print out
  171.     if (testingSetSize);
  172.         ID3_Percentage = round(100 * ID3_numCorrect / testingSetSize);
  173.         ExpectedPrior_Percentage = round(100 * ExpectedPrior_numCorrect / testingSetSize);
  174.     else
  175.         ID3_Percentage = 0;
  176.         ExpectedPrior_Percentage = 0;
  177.     end
  178.     ID3_Percentages(i) = ID3_Percentage;
  179.     ExpectedPrior_Percentages(i) = ExpectedPrior_Percentage;
  180.    
  181.     fprintf('\tPercent of test cases correctly classified by an ID3 decision tree = %d\n' ...
  182.         , ID3_Percentage);
  183.     fprintf('\tPercent of test cases correctly classified by using prior probabilities from the training set = %d\n\n' ...
  184.         , ExpectedPrior_Percentage);
  185. end

  186. meanID3 = round(mean(ID3_Percentages));
  187. meanPrior = round(mean(ExpectedPrior_Percentages));

  188. % Print out remaining details
  189. fprintf('example file used = %s\n', inputFileName);
  190. fprintf('number of trials = %d\n', numberOfTrials);
  191. fprintf('training set size for each trial = %d\n', trainingSetSize);
  192. fprintf('testing set size for each trial = %d\n', testingSetSize);
  193. fprintf('mean performance (percentage correct) of decision tree over all trials = %d\n', meanID3);
  194. fprintf('mean performance (percentage correct) of prior probability from training set = %d\n\n', meanPrior);
  195. end
复制代码


https://github.com/gwheaton/ID3-Decision-Tree

7
ReneeBK 发表于 2016-5-12 10:43:47

Decision Tree using Javascript

  1. //ID3 Decision Tree Algorithm


  2. //main algorithm and prediction functions

  3. var id3 = function(_s,target,features){
  4.     var targets = _.unique(_s.pluck(target));
  5.     if (targets.length == 1){
  6.         console.log("end node! "+targets[0]);
  7.         return {type:"result", val: targets[0], name: targets[0],alias:targets[0]+randomTag() };
  8.     }
  9.     if(features.length == 0){
  10.         console.log("returning the most dominate feature!!!");
  11.         var topTarget = mostCommon(_s.pluck(target));
  12.         return {type:"result", val: topTarget, name: topTarget, alias: topTarget+randomTag()};
  13.     }
  14.     var bestFeature = maxGain(_s,target,features);
  15.     var remainingFeatures = _.without(features,bestFeature);
  16.     var possibleValues = _.unique(_s.pluck(bestFeature));
  17.     console.log("node for "+bestFeature);
  18.     var node = {name: bestFeature,alias: bestFeature+randomTag()};
  19.     node.type = "feature";
  20.     node.vals = _.map(possibleValues,function(v){
  21.         console.log("creating a branch for "+v);
  22.         var _newS = _(_s.filter(function(x) {return x[bestFeature] == v}));
  23.         var child_node = {name:v,alias:v+randomTag(),type: "feature_value"};
  24.         child_node.child =  id3(_newS,target,remainingFeatures);
  25.         return child_node;
  26.         
  27.     });
  28.     return node;
  29. }

  30. var predict = function(id3Model,sample) {
  31.     var root = id3Model;
  32.     while(root.type != "result"){
  33.         var attr = root.name;
  34.         var sampleVal = sample[attr];
  35.         var childNode = _.detect(root.vals,function(x){return x.name == sampleVal});
  36.         root = childNode.child;
  37.     }
  38.     return root.val;
  39. }



  40. //necessary math functions

  41. var entropy = function(vals){
  42.     var uniqueVals = _.unique(vals);
  43.     var probs = uniqueVals.map(function(x){return prob(x,vals)});
  44.     var logVals = probs.map(function(p){return -p*log2(p) });
  45.     return logVals.reduce(function(a,b){return a+b},0);
  46. }

  47. var gain = function(_s,target,feature){
  48.     var attrVals = _.unique(_s.pluck(feature));
  49.     var setEntropy = entropy(_s.pluck(target));
  50.     var setSize = _s.size();
  51.     var entropies = attrVals.map(function(n){
  52.         var subset = _s.filter(function(x){return x[feature] === n});
  53.         return (subset.length/setSize)*entropy(_.pluck(subset,target));
  54.     });
  55.     var sumOfEntropies =  entropies.reduce(function(a,b){return a+b},0);
  56.     return setEntropy - sumOfEntropies;
  57. }

  58. var maxGain = function(_s,target,features){
  59.     return _.max(features,function(e){return gain(_s,target,e)});
  60. }

  61. var prob = function(val,vals){
  62.     var instances = _.filter(vals,function(x) {return x === val}).length;
  63.     var total = vals.length;
  64.     return instances/total;
  65. }

  66. var log2 = function(n){
  67.     return Math.log(n)/Math.log(2);
  68. }


  69. var mostCommon = function(l){
  70.    return  _.sortBy(l,function(a){
  71.         return count(a,l);
  72.     }).reverse()[0];
  73. }

  74. var count = function(a,l){
  75.     return _.filter(l,function(b) { return b === a}).length
  76. }

  77. var randomTag = function(){
  78.     return "_r"+Math.round(Math.random()*1000000).toString();
  79. }

  80. //Display logic

  81. var drawGraph = function(id3Model,divId){
  82.     var g = new Array();
  83.     g = addEdges(id3Model,g).reverse();
  84.     window.g = g;
  85.     var data = google.visualization.arrayToDataTable(g.concat(g));
  86.     var chart = new google.visualization.OrgChart(document.getElementById(divId));
  87.     google.visualization.events.addListener(chart, 'ready',function(){
  88.     _.each($('.google-visualization-orgchart-node'),function(x){
  89.         var oldVal = $(x).html();
  90.         if(oldVal){
  91.             var cleanVal = oldVal.replace(/_r[0-9]+/,'');
  92.             $(x).html(cleanVal);
  93.         }
  94. });
  95.     });
  96.     chart.draw(data, {allowHtml: true});

  97. }

  98. var addEdges = function(node,g){
  99.     if(node.type == 'feature'){
  100.         _.each(node.vals,function(m){
  101.             g.push([m.alias,node.alias,'']);
  102.             g = addEdges(m,g);
  103.         });
  104.         return g;
  105.     }
  106.     if(node.type == 'feature_value'){

  107.         g.push([node.child.alias,node.alias,'']);
  108.         if(node.child.type != 'result'){
  109.             g = addEdges(node.child,g);
  110.         }
  111.         return g;
  112.     }
  113.     return g;
  114. }


  115. var renderSamples = function(samples,$el,model,target,features){
  116.     _.each(samples,function(s){
  117.         var features_for_sample = _.map(features,function(x){return s[x]});
  118.         $el.append("<tr><td>"+features_for_sample.join('</td><td>')+"</td><td><b>"+predict(model,s)+"</b></td><td>actual: "+s[target]+"</td></tr>");
  119.     })
  120. }

  121. var renderTrainingData = function(_training,$el,target,features){
  122.     _training.each(function(s){
  123.         $el.append("<tr><td>"+_.map(features,function(x){return s[x]}).join('</td><td>')+"</td><td>"+s[target]+"</td></tr>");
  124.     })
  125. }

  126. var calcError = function(samples,model,target){
  127.     var total = 0;
  128.     var correct = 0;
  129.     _.each(samples,function(s){
  130.         total++;
  131.         var pred = predict(model,s);
  132.         var actual = s[target];
  133.         if(pred == actual){
  134.             correct++;
  135.         }
  136.     });
  137.     return correct/total;
  138. }
复制代码


https://github.com/willkurt/ID3-Decision-Tree/tree/master/js

8
ReneeBK 发表于 2016-5-12 10:54:07
  1. > library("party")

  2. > str(iris)
  3. 'data.frame':   150 obs. of  5 variables:
  4. $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  5. $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
  6. $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
  7. $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
  8. $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
复制代码

  1. > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

  2. > print(iris_ctree)

  3.          Conditional inference tree with 4 terminal nodes

  4. Response:  Species
  5. Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
  6. Number of observations:  150

  7. 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
  8.   2)*  weights = 50
  9. 1) Petal.Length > 1.9
  10.   3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
  11.     4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
  12.       5)*  weights = 46
  13.     4) Petal.Length > 4.8
  14.       6)*  weights = 8
  15.   3) Petal.Width > 1.7
  16.     7)*  weights = 46

  17. > plot(iris_ctree)
复制代码
  1. > plot(iris_ctree, type="simple")
复制代码

http://www.rdatamining.com/examples/decision-tree

9
ReneeBK 发表于 2016-5-12 10:56:02

Decision Tree using R

  1. # Load the party package. It will automatically load other dependent packages.
  2. library(party)

  3. # Print some records from data set readingSkills.
  4. print(head(readingSkills))
  5. When we execute the above code, it produces the following result and chart −

  6.   nativeSpeaker   age   shoeSize      score
  7. 1           yes     5   24.83189   32.29385
  8. 2           yes     6   25.95238   36.63105
  9. 3            no    11   30.42170   49.60593
  10. 4           yes     7   28.66450   40.28456
  11. 5           yes    11   31.88207   55.46085
  12. 6           yes    10   30.07843   52.83124
  13. Loading required package: methods
  14. Loading required package: grid
  15. ...............................
  16. ...............................
  17. Example
  18. We will use the ctree() function to create the decision tree and see its graph.

  19. # Load the party package. It will automatically load other dependent packages.
  20. library(party)

  21. # Create the input data frame.
  22. input.dat <- readingSkills[c(1:105),]

  23. # Give the chart file a name.
  24. png(file = "decision_tree.png")

  25. # Create the tree.
  26.   output.tree <- ctree(
  27.   nativeSpeaker ~ age + shoeSize + score,
  28.   data = input.dat)

  29. # Plot the tree.
  30. plot(output.tree)

  31. # Save the file.
  32. dev.off()
复制代码

http://www.tutorialspoint.com/r/r_decision_tree.htm

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-17 10:26