Data Mining With R: May 2015

Tuesday, 26 May 2015

TIME SERIES using R

FITTING ARIMA MODEL in R

Auto Regressive Integrated Moving Average ( ARIMA) model is generalisation of Auto Regressive Moving Average Model (ARMA) and used to predict future points in Time series.
But what is time series , Acc to google "A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average." and more in a general way it is a series of values of a quantity obtained at successive times, often with equal intervals between them.

ARIMA models are defined for stationary time series , stationary time series is one whose mean, variance, autocorrelation are all constant over time. But for checking that the time series is stationary or not , we have several statistical tests for them namely.

1) For the Box.test, if p-value < 0.05 => stationary

2) For the adf.test, if p-value < 0.05 => stationary

3) For the kpss.test, if p-value > 0.05 => stationary

We will use time series data on the age of death of successive kings of England, starting with William the Conqueror.We will load the data in R and see the plot

> kings <- scan("http://robjhyndman.com/tsdldata/misc/kings.dat",skip=3)
> plot.ts(kings)

Now we check whether the time series is stationary or not using our statistical test.

> kpss.test(kings)

KPSS Test for Level Stationarity

data: kings
KPSS Level = 0.6906, Truncation lag parameter = 1, p-value = 0.0144

> adf.test(skirts)
Augmented Dickey-Fuller Test

data: kings
Dickey-Fuller = -2.1132, Lag order = 3, p-value = 0.529
alternative hypothesis: stationary

As we have mentioned before for the kpss test, p-value should be greater than 0.05 for a time series to be stationary , but whereas in our case it is the opposite which suggests that it is not stationary.

Also with the adf.test that we have performed above ,the p-value is greater than 0.05 whereas it should be less than 0.05 for a stationary time series. Also we can look at the plot of the data and as according to stationary definition mean , variance etc should be constant over time which is not.

Thus here we concluded that the time series is not stationary and we cannot use ARIMA on this , but don't worry its not the end.
Now we will difference the time series until we obtain a stationary time series.But what does differencing mean?. The first difference of a time series is the series of changes from one period to the next. If Y_t denotes the value of the time series Y at period t, then the first difference of Y at period t is equal to Y_t-Y_t-1.Now we will difference the time series.If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used.
We can use diff() in R to difference the time series

> king_diff=diff(kings, difference=1)
> plot.ts(king_diff)

Now we will check whether this differencing has worked for us to make it stationary or not

> kpss.test(king_diff)

KPSS Test for Level Stationarity

data: king_diff
KPSS Level = 0.0347, Truncation lag parameter = 1, p-value = 0.1

> adf.test(king_diff)

Augmented Dickey-Fuller Test

data: king_diff
Dickey-Fuller = -4.0754, Lag order = 3, p-value = 0.01654
alternative hypothesis: stationary

The above tests gives us the positive results. the p-value of kpss.test is greater than 0.05 and for adf.test it is less than 0.05 , hence suggesting us the time series is now stationary.Since we have differenced the series by 1, an ARIMA(p,1,q) model is probably appropriate for the time series of the age of death of the kings of England.

Now we have made our time series stationary, its time to select appropriate ARIMA model , which means finding the values of most appropriate values of p and q for an ARIMA(p,d,q) model. To do this, you usually need to examine the correlogram and partial correlogram of the stationary time series. First we will see the acf plot.

>acf(king_diff)

> acf(king_diff, plot=F)
0 1 2 3 4 5 6 7 8 9 10 11 12
1.000 -0.360 -0.162 -0.050 0.227 -0.042 -0.181 0.095 0.064 -0.116 -0.071 0.206 -0.017
13 14 15 16
-0.212 0.130 0.114 -0.009

We see from the correlogram that the autocorrelation at lag 1 (-0.360) exceeds the significance bounds, but all other autocorrelations between lags 1-15 do not exceed the significance bounds
Now we see the pacf plot

> pacf(king_diff)

> pacf(king_diff, plot=F)
1 2 3 4 5 6 7 8 9 10 11 12 13
-0.360 -0.335 -0.321 0.005 0.025 -0.144 -0.022 -0.007 -0.143 -0.167 0.065 0.034 -0.161
14 15 16
0.036 0.066 0.081

The partial correlogram shows that the partial autocorrelations at lags 1, 2 and 3 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag (lag 1: -0.360, lag 2: -0.335, lag 3:-0.321). The partial autocorrelations tail off to zero after lag 3.
Since the Corrlelogram is 0 after lag 1 and the partial correlelogram tails of to 0 after lag 3.
Thus we can have ARIMA(3,1,1) model

FORECASTING ARIMA MODEL

Now that we have our ARIMA(3,1,1) model we can forecast the future values with the help of "forecast" package.

> king_arima=arima(kings, order=c(3,1,1))

> king_forecast=forecast.Arima(king_arima, h=3)

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
43 64.79235 45.69668 83.88803 35.58804 93.99667
44 67.48443 46.98649 87.98238 36.13553 98.83334
45 68.43173 47.31180 89.55165 36.13159 100.73186

The original time series for the English kings includes the ages at death of 42 English kings. The forecast.Arima() function gives us a forecast of the age of death of the next 3 English kings (kings 43-45), as well as 80% and 95% prediction intervals for those predictions. The age of death of the 42nd English king was 56 years (the last observed value in our time series), and the ARIMA model gives the forecasted age at death of the next five kings
as 64.8 years.We can also plot the ages, as can be seen below

> plot.forecast(king_forecast)

Monday, 4 May 2015

CLUSTERING WITH R

Today ill show you howto do k-means , k-medoid , Hierarchical and density based clustering in R

K- MEANS CLUSTERING

K-means is one of the simplest unsupervised learning algorithms ,it aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The number of clusters should match the data. An incorrect choice of the number of clusters will invalidate the whole process. An empirical way to find the best number of clusters is to try K-means clustering with different number of clusters and measure the resulting sum of squares.

ADVANTAGES

Fast, robust and easier to understand.
Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object, and t is # iterations. Normally, k, t, d << n.
Gives best result when data set are distinct or well separated from each other.

DISADVANTAGE

Requires apriori specification of the number of cluster centers
If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters.
Fails for categorical data.
It uses centroid to represent the cluster therefore unable to handle noisy data and outliers.
Algorithm fails for non-linear data set.
It does not work well with clusters (in the original data) of Different size and Different density

Now ill show k-means clustering with Iris dataset.

>data(iris)

>str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We’ll remove the Species from the data as we have mentioned it does not work with categorical data

>iris1=iris[,-5]

>head(iris1)

Sepal.Length Sepal.Width Petal.Length Petal.Width

1          5.1         3.5          1.4         0.2

2          4.9         3.0          1.4         0.2

3          4.7         3.2          1.3         0.2

4          4.6         3.1          1.5         0.2

5          5.0         3.6          1.4         0.2

6          5.4         3.9          1.7         0.4

Now we’ll do clustering with iris1 dataset and use the function kmeans() for it and will be using 3 clusters

>k_mean=kmeans(iris1,3)

>k_mean

K-means clustering with 3 clusters of sizes 96, 21, 33

Cluster means:

  Sepal.Length Sepal.Width Petal.Length Petal.Width

1     6.314583    2.895833     4.973958   1.7031250

2     4.738095    2.904762     1.790476   0.3523810

3     5.175758    3.624242     1.472727   0.2727273

Clustering vector:

  [1] 3 2 2 2 3 3 3 3 2 2 3 3 2 2 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 2 2 3 3 2 3 2 3 3 1 1 1 1 1 1

 [57] 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1

[113] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:

[1] 118.651875  17.669524   6.432121

 (between_SS / total_SS =  79.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"

[8] "iter"         "ifault"

Now the clustering result is then compared with the class labeled “Species” to check similar objects are grouped together

>table(iris$Species,k_mean$cluster)

                   1   2    3

  setosa     50   0   0

  versicolor  0 48   2

  virginica    0 14 36

We see that setosa can be easily clustered in one cluster but versicolor and virginica have some amount of overlap. Now we will see graphically .

>plot(iris1[c("Sepal.Length", "Sepal.Width")], col = k_mean$cluster)

>points(k_mean$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3,pch = 3, cex=2)

We can see that the black dots are separated as one cluster whereas some of the red and green dots are overlapped as we saw in the table for versicolor and virginica.

K- MEDOID CLUSTERING

K-medoids clustering uses medoids to represent the cluster rather than centroid. A medoid is the most centrally located data object in a cluster.Here, k data objects are selected randomly as medoids to represent k cluster and remaining all data objects are placed in a cluster having medoid nearest (or most similar) to that data object. After processing all data objects, new medoid is determined which can represent cluster in a better way and the entire process is repeated. Again all data objects are bound to the clusters based on the new medoids. In each iteration, medoids change their location step by step.This process is continued until no any medoid move. As a result, k clusters are found representing a set of n data objects.

We ll show k-medoids clustering with functions pam() and pamk().The k-medoids clustering is more robust than k-means in presence of outliers. PAM (Partitioning Around Medoids) is a classic algorithm for k-medoids clustering. While the PAM algorithm is ine cient for clustering large data, the CLARA algorithm is an enhanced technique of PAM by drawing multiple samples of data, applying PAM on each sample and then returning the best clustering. It performs better than PAM on larger

data

We will "fpc" package for this clustering because in this package it does not require a user to choose k. Instead, it calls the function pam() or clara() to perform a partitioning around medoids clustering with the number of clusters estimated by optimum average

silhouette width.

I have used the same iris data which was used in the k-means clustering

>medoid=pamk(iris1)

>medoid$nc

[1] 2

>table(medoid$pamobject$clustering,iris$Species)

setosa versicolor virginica

1 50 1 0

2 0 49 50

>layout(matrix(c(1,2),1,2))

>plot(medoid$pamobject)

>layout(matrix(1))

We see that it automatically created 2 clusters.Then we check the cluster against the actual Species.The layout function will will help us to see 2 graphs in one screen then we plot the clusters. We see on the left side the 2 clusters one for the setosa and one is the combination of versicolor and verginica and the red line show us the distance between the two cluster.On the right is the silhouette, In the silhouette, a large Si (almost 1) suggests that the corresponding observations are very well clustered, a small Si (around 0) means that the observation lies between two clusters, and observations with a negative Si are probably placed in the wrong cluster. Since the average Si are respectively 0.81 and 0.62 in the above silhouette, the identified two clusters are well clustered.
You can also specify the no. of cluster if you are sure.

HIERARCHICAL CLUSTERING

The hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, an pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Both this algorithm are exactly reverse of each other.In here ill be covering the Divisive ,the top down approach.

In this example ill use the same iris data.

>hclust=hclust(dist(iris1))

>plot(hclust)

In the above figure we can see that the there we can create 3 clusters , its on us where to cut the line.We can see setosa is easily separated ( in red color box) whereas the vesicolor and verginica are in the next box respectively.

Also we can use first hierarchical clustering to get to know the no.of clusters and then use those no of clusters in the k-means clustering. This is an alternate way to know the k value.

DENSITY - BASED CLUSTERING

The DBSCAN algorithm from package "fpc" provides a densitybased clustering for numeric data. The idea of density-based clustering is to group objects into one cluster if they are connected to one another by densely populated area. There are two key parameters in DBSCAN :

ˆ eps: reachability distance, which defines the size of neighbourhood; and
ˆ MinPts: minimum number of points.

If the number of points in the neighbourhood of point is no less than MinPts, then is a dense point. All the points in its neighbourhood are density-reachable from and are put into the same cluster as alpha.The strengths of density-based clustering are that it can discover clusters with various shapes and sizes and is insensitive to noise. As a comparison, the k-means algorithm tends to find clusters with sphere shape and with similar sizes.

In this again we will use same Iris data

>densityclus<- dbscan(iris1, eps=0.42, MinPts=5)
>table(densityclus$cluster,iris$Species)

setosa versicolor virginica

0 2 10 17

1 48 0 0

2 0 37 0

3 0 3 33

In the above table, "1" to "3" in the rst column are three identi ed clusters, while "0" stands for noises or outliers, i.e., objects that are not assigned to any clusters. The noises are shown as black circles in the figure below.

>plot(densityclus,iris1)

>plotcluster(iris1, densityclus$cluster)

Again we can see satosa in red marked as 1 is clustered cleanly whereas versicolor and verginica there is some overlap (green "2" and blue "3" colour ).

Friday, 1 May 2015

Creating Decision Tree with R

DECISION TREE USING R

Today ill show how to make Decision Tree using R. But before that let me tell you how it works, then it ll give you a better understanding of it and will easily be able to catch up with codes and making of decision trees.
As we know decision tree builds classification or regression models in the form of a tree structure
It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is with decision nodes and leaf nodes

A decision node ( the target variable ) has two or more branches
Leaf node represents a classification or decision

The top most decision mode in a tree which corresponds to the best predictor called root node.
Also it handles can handles numerical as well as categorical data.

A decision tree is built top down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). The ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided then it has entropy of one.
Entropy= -plog2p - qlog2q

Information Gain is based on the decrease in entropy after a dataset is split on an attribute
Gain(t,x)=Entropy(t)-Entropy(t,x)
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e. the most homogeneous branches )

Enough of talking now lets get our hands dirty
The packages used are "ISLR" and "rpart"
The data set used in this example is "Carseats" which is available with ISLR package.
Here ill show you how to create a decision tree and how to prune if it is required.

>library(ISLR)
>library(rpart)

>range(Carseats$Sales) # to see the range

[1] 0.00 16.27

>high=ifelse(Carseats$Sales >=8, "good", "bad") # to change sales in categorical

>high=as.factor(high)

>carseats=Carseats

>carseats$high=high

>head(carseats)

CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US high

1 138 73 11 276 120 Bad 42 17 Yes Yes yes

2 111 48 16 260 83 Good 65 10 Yes Yes yes

3 113 35 10 269 80 Medium 59 12 Yes Yes yes

4 117 100 4 466 97 Medium 55 14 Yes Yes no

5 141 64 3 340 128 Bad 38 13 Yes No no

6 124 113 13 501 72 Bad 78 16 No Yes yes

Here we have loaded the two required packages ISLR & rpart and then calculated the range of our target variable as it will help us to convert sales(which is our target variable into categorical) & we see that the range is 16. so in the next step we will divide the car sales which are greater or equal to 8 as good and rest as bad. later we converted into factor to make it easier and attached with the original dataset.

>carseats=carseats[,-1]

>train= sample(1:nrow(carseats),nrow(carseats)/2)

>test=-train

>traindata=carseats[train,]

>testdata=carseats[test,]

>testhigh=carseats$high[test]

Now we have removed the original sales variable as in the above step we have converted into Dichotomous variable , which is our target variable. Then we'll divide the dataset into train and test to later on check the accuracy of our model

>treemod=tree(high~.,traindata)

>plot(treemod)

>text(treemod, pretty = 0)

>tree_pred=predict(treemod,testdata, type="class")
>mean(tree_pred != testhigh)
[1] 0.23

Now here we have produced the tree model and fit the tree model using the training data
If u don't use pretty it wont give names of labels gives real names of the category. We can see that the tree is very populated , now we will check the model using the test data , with next two codes we have calculated the misclassification error , which came out to be very high at 23% so, therefore we need to prune the tree and for that we have to do cross validation to choose how many levels to prune

>set.seed(2)
>cv_tree=cv.tree(treemod, FUN= prune.misclass)## to create pruned tree
>names(cv_tree) ## dev = cross validation error rate.
>plot(cv_tree$size,cv_tree$dev,type="b"

we have set the seed & cv.tree is used to create the pruned tree. By looking at the figure we have to check for the minimum deviation i.e cross validation error rate though we have to be careful the size should not be big.

>prunedmod=prune.misclass(treemod, best=5)
>plot(prunedmod)
>text(prunedmod , pretty= 0)

>tree_predict=predict(prunedmod, testdata, type= "class")
>mean(tree_predict != testhigh)
[1] 0.10

Now after pruning the tree and setting the deviation we got the above figure which seems to look good but we have to check the misclassification error rate and thus in next commands we did and got it to be 10% which is lower than the previous model. Thus this is how we create a decision tree and prune it.

Data Mining With R