-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy pathTreeBasedTechniques.Rmd
222 lines (114 loc) · 5.6 KB
/
TreeBasedTechniques.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "Decision Trees in R"
output:
html_document: default
html_notebook: default
---
### This is a article on how to implement Tree based Learning Technique in R to do Predictive Modelling.
Trees involve stratifying or sagmenting the Predictor($X_i$) space into a number of simple Regions.The tree based Methods generate a set of $Splitting \ Rules$ which are used to sagment the Predictor Space.These techniques of sagmenting and stratifying data into different Regions $R_j$ are called __Decision Trees__.Decision Trees are used in both Regression and Classification Problems.
These are Statistical Learning Techniques which are easier to understand and Simpler in terms of interpretablity.
The Rules generated are of form --
$$R_j = If (X = X_1 \cap X_2 \cap ....X_p) --> Y_i $$.
-----------
#### Loading the required Packages
```{r,warning=FALSE,message=FALSE}
require(ISLR) #package containing data
require(ggplot2)
require(tree)
#Using the Carseats data set
attach(Carseats)
?Carseats
```
Carseats is a simulated data set containing sales of child car seats at 400 different stores.
```{r}
#Checking the distribution of Sales
ggplot(aes(x = Sales),data = Carseats) +
geom_histogram(color="black",fill = 'purple',alpha = 0.6, bins=30) +
labs(x = "Unit Sales in Thousands", y = "Frequency")
```
As the histogram suggests - It is Normally distributed
Highest frequency of around 8000 Unit Sales
```{r}
#Making a Factor variable from Sales
HighSales<-ifelse(Sales <= 8,"No","Yes")
head(HighSales)
#Making a Data frame
Carseats<-data.frame(Carseats,HighSales)
```
----------
## Fitting a Binary Classification Tree
Now we are going to fit a __Tree__ to the Carseats Data to predict if we are going to have High Sales or not.The __tree()__ function uses a *__Top-down Greedy__* approch to fit a Tree which is also known as *__Recursive Binary Splitting__*.It is Greedy because it dosen't finds the best split amongst all possible splits,but only the best splits at the immediate place its looking i.e the best Split at that particular step.
```{r}
#We will use the tree() function to fit a Desicion Tree
?tree
#Excluding the Sales atrribute
CarTree<-tree(HighSales ~ . -Sales , data = Carseats,split = c("deviance","gini"))
#split argument split to specify the splitting criterion to use.
CarTree #Outputs a Tree with various Splits at different Variables and Response at Terminals Nodes
#The numeric values within the braces are the Proportions of Yes and No for each split.
#Summary of the Decision Tree
summary(CarTree)
```
The summary of the Model consists of the imporatant variables used for splitting the data which minimizes the deviance(Error) Rate and another Splitting criterion used is __Gini Indes__, which is also called *Purity Index*.
---------------
#Plotting the Decision Tree
```{r}
plot(CarTree)
#Adding Predictors as text to plot
text(CarTree ,pretty = 1 )
```
This tree is quiet Complicated and hard to understand due to lots of Splits and lots of variables included in the predictor space.The leaf nodes consists of the Response value i.e __Yes / No __.
---------------
### Splitting data to Training and Test Set
```{r}
set.seed(1001)
#A training sample of 250 examples sampled without replacement
train<-sample(1:nrow(Carseats), 250)
#Fitting another Model
tree1<-tree(HighSales ~ .-Sales , data = Carseats, subset = train)
summary(tree1)
#Plotting
plot(tree1);text(tree1)
```
Now the tree is somewhat different and detailed but is quiet hard to interpret too due to lots of splits.
__Predicting on Test Set__
```{r}
#Predicting the Class labels for Test set
pred<-predict(tree1, newdata = Carseats[-train,],type = "class")
head(pred)
#Confusion Matrix to check number of Misclassifications
with(Carseats[-train,],table(pred,HighSales))
#Misclassification Error Rate on Test Set
mean(pred!=Carseats[-train,]$HighSales)
```
The __Diagonals__ are the correctly classified Test Examples , whereas the __off-diagonals__ represent the misclassified examples.The Mean Error Rate is $\text{26%}$.
The above tree was grown to Full length and might have lots of variables in it which might be degrading the Perfomance.We will now use 10 fold __Cross Validation__ to *Prune* the Tree.
------
## Pruning The tree using Cross Validation
```{r}
#10 fold CV
#Performing Cost Complexity Pruning
cv.tree1<-cv.tree(tree1, FUN=prune.misclass)
cv.tree1
plot(cv.tree1)
#Deviance minimum for tree size 15 i.e 15 Splits
prune.tree1<-prune.misclass(tree1,best = 15)
plot(prune.tree1);text(prune.tree1)
```
__Testing the pruned Tree on Test Set__ -
```{r}
pred1<-predict(prune.tree1 , Carseats[-train,],type="class")
#Confusion Matrix
with(Carseats[-train,],table(pred1,HighSales))
#Misclassification Rate
ErrorPrune<-mean(pred1!=Carseats[-train,]$HighSales)
ErrorPrune
#Error reduced to 25 %
```
------------
## Conclusion
As we can notice by the perfomance on Test Set the Pruned Tree dosen't performs better as the Error rate reduced only by a factor of 0.1 % i.e from 26% to 25%.
It's just that Pruning lead us to a more __simpler__ Tree with *lesser Splits and a subset of predictors* which is somewhat easier to interpret and understand.
Usually *Trees* don't actually give good perfomance on Test Sets , and is called a __Weak Learner__.
Applying Ensembling Techniques such as __Random Forests , Bagging and Boosting__ improves the Perfomance of Trees a lot by combining a lot of Trees trained on samples from training examples and finally *__combining(averaging)__* the Trees to form a single Strong Tree which performs nicely.
Hope you guys liked the article , make sure to share and like it.