Friday, 1 May 2015

Creating Decision Tree with R

                                              DECISION TREE USING R

Today ill show how to make Decision Tree using R. But before that let me tell you how it works, then it ll give you a better understanding of it and will easily be able to catch up with codes and making of decision trees.
As we know decision tree builds classification  or regression models in the form of a tree structure
It breaks down  a data set  into smaller  and smaller  subsets while at the same time an associated decision tree is incrementally developed. The final result is with decision  nodes and leaf nodes

  •      A decision node ( the target variable ) has two or more branches
  •      Leaf node represents a classification or decision

The top most  decision  mode in a tree which corresponds to the best predictor  called root node.
 Also it handles can handles numerical as well as categorical data.

A decision tree is built top down  from a  root node  and involves partitioning the data into subsets that contain instances with similar values (homogeneous). The ID3 algorithm uses entropy  to calculate  the homogeneity of a sample. If the sample is completely  homogeneous  the entropy is zero and if the sample  is  equally divided then it has entropy of one.
              Entropy= -plog2p - qlog2q

Information Gain  is based on the decrease  in entropy after a dataset is split on an attribute
              Gain(t,x)=Entropy(t)-Entropy(t,x)
Constructing a decision  tree is all about   finding  attribute that returns the highest information gain (i.e. the most homogeneous branches )

Enough of talking now lets get our hands dirty
The packages used are "ISLR"  and "rpart"
The data set used in this example is "Carseats"  which is available with ISLR package.
Here ill show you how to create a decision tree and how to prune if it is required.

>library(ISLR)
>library(rpart)
>range(Carseats$Sales) # to see the range
[1]  0.00 16.27
>high=ifelse(Carseats$Sales >=8, "good", "bad") # to change sales in categorical
>high=as.factor(high)
>carseats=Carseats
>carseats$high=high
>head(carseats)

CompPrice Income Advertising Population Price ShelveLoc Age Education Urban  US high
1       138     73          11        276   120                       Bad  42        17   Yes Yes  yes
2       111     48          16        260    83                        Good  65        10   Yes Yes  yes
3       113     35          10        269    80                        Medium  59        12   Yes Yes  yes
4       117    100           4        466    97                        Medium  55        14   Yes Yes   no
5       141     64           3        340   128                        Bad  38        13   Yes  No   no
6       124    113          13        501    72                      Bad  78        16    No Yes  yes

Here we have loaded the two required packages ISLR & rpart  and then calculated the range of our target variable as it will help us to convert sales(which is our target variable into categorical) & we see that the range is 16. so in the next step we will divide the car sales which are greater or equal to 8 as good and rest as bad. later we converted into factor to make it easier and attached with the original dataset.

>carseats=carseats[,-1]
>train= sample(1:nrow(carseats),nrow(carseats)/2)
>test=-train
>traindata=carseats[train,]
>testdata=carseats[test,]
>testhigh=carseats$high[test]

Now we have removed the original sales variable as in the above step we have converted into Dichotomous variable , which is our target variable. Then we'll divide the dataset into train and test to later on check the accuracy of our model

>treemod=tree(high~.,traindata)
>plot(treemod)

>text(treemod, pretty = 0)
>tree_pred=predict(treemod,testdata, type="class")
>mean(tree_pred != testhigh)
[1] 0.23

Now here we have produced the tree model  and fit the tree model using the training data
If u don't use pretty it wont give names of labels gives real names of the category. We can see that the tree is very populated , now we will check the model using the test data , with next two codes we have calculated the misclassification  error , which came out to be very high at 23% so, therefore we need to prune the tree and for that  we have to do cross validation to choose how many levels to prune

>set.seed(2)
>cv_tree=cv.tree(treemod, FUN= prune.misclass)## to create pruned tree
>names(cv_tree)   ## dev = cross validation error rate.
>plot(cv_tree$size,cv_tree$dev,type="b"

we have set the seed & cv.tree is used to create the pruned tree. By looking at the figure we have to check for the minimum deviation i.e cross validation error rate though we have to be careful the size should not be big.

>prunedmod=prune.misclass(treemod, best=5)
>plot(prunedmod)
>text(prunedmod , pretty= 0)

>tree_predict=predict(prunedmod, testdata, type= "class")
>mean(tree_predict != testhigh)
[1] 0.10

Now after pruning the tree and setting the deviation we got the above figure which seems to look good but we have to check the misclassification error rate and thus in next commands we did and got it to be 10%  which is lower than the previous model. Thus this is how we create a decision tree and prune it. 

2 comments: