DECISION TREE USING R
Today ill show how to make Decision Tree using R. But before that let me tell you how it works, then it ll give you a better understanding of it and will easily be able to catch up with codes and making of decision trees.
As we know decision tree builds classification or regression models in the form of a tree structure
It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is with decision nodes and leaf nodes
The top most decision mode in a tree which corresponds to the best predictor called root node.
Also it handles can handles numerical as well as categorical data.
A decision tree is built top down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). The ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided then it has entropy of one.
Entropy= -plog2p - qlog2q
Information Gain is based on the decrease in entropy after a dataset is split on an attribute
Gain(t,x)=Entropy(t)-Entropy(t,x)
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e. the most homogeneous branches )
Enough of talking now lets get our hands dirty
The packages used are "ISLR" and "rpart"
The data set used in this example is "Carseats" which is available with ISLR package.
Here ill show you how to create a decision tree and how to prune if it is required.
>library(ISLR)
>library(rpart)
CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US high
Today ill show how to make Decision Tree using R. But before that let me tell you how it works, then it ll give you a better understanding of it and will easily be able to catch up with codes and making of decision trees.
As we know decision tree builds classification or regression models in the form of a tree structure
It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is with decision nodes and leaf nodes
- A decision node ( the target variable ) has two or more branches
- Leaf node represents a classification or decision
The top most decision mode in a tree which corresponds to the best predictor called root node.
Also it handles can handles numerical as well as categorical data.
A decision tree is built top down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). The ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided then it has entropy of one.
Entropy= -plog2p - qlog2q
Information Gain is based on the decrease in entropy after a dataset is split on an attribute
Gain(t,x)=Entropy(t)-Entropy(t,x)
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e. the most homogeneous branches )
Enough of talking now lets get our hands dirty
The packages used are "ISLR" and "rpart"
The data set used in this example is "Carseats" which is available with ISLR package.
Here ill show you how to create a decision tree and how to prune if it is required.
>library(ISLR)
>library(rpart)
>range(Carseats$Sales) # to see the range
[1] 0.00 16.27
>high=ifelse(Carseats$Sales >=8, "good", "bad") # to change sales in categorical
>high=as.factor(high)
>carseats=Carseats
>carseats$high=high
>head(carseats)
CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US high
1 138 73 11 276 120 Bad 42 17 Yes Yes yes
2 111 48 16 260 83 Good 65 10 Yes Yes yes
3 113 35 10 269 80 Medium 59 12 Yes Yes yes
4 117 100 4 466 97 Medium 55 14 Yes Yes no
5 141 64 3 340 128 Bad 38 13 Yes No no
6 124 113 13 501 72 Bad 78 16 No Yes yes
Here we have loaded the two required packages ISLR & rpart and then calculated the range of our target variable as it will help us to convert sales(which is our target variable into categorical) & we see that the range is 16. so in the next step we will divide the car sales which are greater or equal to 8 as good and rest as bad. later we converted into factor to make it easier and attached with the original dataset.
>carseats=carseats[,-1]
>train= sample(1:nrow(carseats),nrow(carseats)/2)
>test=-train
>traindata=carseats[train,]
>testdata=carseats[test,]
>testhigh=carseats$high[test]
Now we have removed the original sales variable as in the above step we have converted into Dichotomous variable , which is our target variable. Then we'll divide the dataset into train and test to later on check the accuracy of our model
>text(treemod, pretty = 0)
>tree_pred=predict(treemod,testdata, type="class")
>mean(tree_pred != testhigh)
[1] 0.23
Now here we have produced the tree model and fit the tree model using the training data
If u don't use pretty it wont give names of labels gives real names of the category. We can see that the tree is very populated , now we will check the model using the test data , with next two codes we have calculated the misclassification error , which came out to be very high at 23% so, therefore we need to prune the tree and for that we have to do cross validation to choose how many levels to prune
>set.seed(2)
>cv_tree=cv.tree(treemod, FUN= prune.misclass)## to create pruned tree
>names(cv_tree) ## dev = cross validation error rate.
>plot(cv_tree$size,cv_tree$dev,type="b"
we have set the seed & cv.tree is used to create the pruned tree. By looking at the figure we have to check for the minimum deviation i.e cross validation error rate though we have to be careful the size should not be big.
>prunedmod=prune.misclass(treemod, best=5)
>plot(prunedmod)
>text(prunedmod , pretty= 0)
>tree_predict=predict(prunedmod, testdata, type= "class")
>mean(tree_predict != testhigh)
[1] 0.10
>tree_pred=predict(treemod,testdata, type="class")
>mean(tree_pred != testhigh)
[1] 0.23
Now here we have produced the tree model and fit the tree model using the training data
If u don't use pretty it wont give names of labels gives real names of the category. We can see that the tree is very populated , now we will check the model using the test data , with next two codes we have calculated the misclassification error , which came out to be very high at 23% so, therefore we need to prune the tree and for that we have to do cross validation to choose how many levels to prune
>set.seed(2)
>cv_tree=cv.tree(treemod, FUN= prune.misclass)## to create pruned tree
>names(cv_tree) ## dev = cross validation error rate.
>plot(cv_tree$size,cv_tree$dev,type="b"
we have set the seed & cv.tree is used to create the pruned tree. By looking at the figure we have to check for the minimum deviation i.e cross validation error rate though we have to be careful the size should not be big.
>prunedmod=prune.misclass(treemod, best=5)
>plot(prunedmod)
>text(prunedmod , pretty= 0)
>tree_predict=predict(prunedmod, testdata, type= "class")
>mean(tree_predict != testhigh)
[1] 0.10
Now after pruning the tree and setting the deviation we got the above figure which seems to look good but we have to check the misclassification error rate and thus in next commands we did and got it to be 10% which is lower than the previous model. Thus this is how we create a decision tree and prune it.
Nice article sharma. Gi
ReplyDeleteThanks :)
DeleteStick for more posts .