Data Mining With R: OUTLIER DETECTION USING R

OUTLIER DETECTION – With LOF ( Local Outlier Factor ) & Clustering

Univariate Outlier Detection

First I’ll show the univariate outlier method then followed by its implication on the multivariate data. In this example Univariate outlier detection is done with the function boxplot.stats()

boxplot.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE)

In the result returned by the above function, one component is out, which gives a list of outliers. More specifically, it lists data points lying beyond the extremes of the whiskers. An argument of coefficient can be used to control how far the whiskers extend out from the box of a boxplot.

Lets explain it with an example

> set.seed(701)

> x<- rnorm(140)

> summary(x)

  Min.  1st Qu.   Median     Mean  3rd Qu.     Max.

-2.46100 -0.56250  0.03992  0.03355  0.66270  2.92900

> boxplot.stats(x)$out

[1]  2.929094 -2.461077  2.608763

>boxplot(x)

In the above code first we have set the seed and then created x the random deviates of 140 instances , then was passed with summary function which gives the quartile values and then the boxplot.stats is used to pull out the outliers that were present in “x” and with that we can see 3 numbers and finally with boxplot() we made a boxplot which shows the outliers , we can see 3 circles , 2 above the upper whisker and one below the lower whisker which we can guess when we first saw outlier numbers.

The above method also can be used to detect the outlier in a multivariate data . Enough of talking lets show it with an example

>y<- rnorm(140)

>boxplot.stats(y)$out

[1] 2.469998 2.576715 2.386554

>df<-data.frame(x,y)

>head(df)

           x           y

1  0.5465374  1.48274691

2  0.3066107 -1.66959326

3  0.6155518 -0.66057890

4 -0.3204095  0.40181322

5  0.9276831  0.03933041

6  0.4898570  0.01733414

> (a <- which(x %in% boxplot.stats(x)$out))

 [1]  79 113 129

> (b <- which(y %in% boxplot.stats(y)$out))

 [1]   9  19 114

> detach(df)

> (outlier_all <- union(a,b))

 [1]  79 113 129   9  19 114

>  plot(df)

>  points(df[outlier_all,], col="red", pch="x", cex=2)

Again I created random deviates named y of same length as we need to create a data frame.

We then checked for the outliers in Y and got 3 numbers. Then we created the data frame with the help of function data.frame passing onto x and y. After this to mark the outliers in the scatter plot we gathered the index of the outliers with

(a <- which(x %in% boxplot.stats(x)$out))

For x

And

(b <- which(y %in% boxplot.stats(y)$out))

For y

Then to mark all the outliers a union is required for all the indexes of the outliers which we generated with the above code and then with the next two commands plot was created and the outliers were marked.

Though the domain knowledge is required to detect the outliers in the real world application.

OUTLIER DETECTION WITH LOF

The local outlier factor (LOF) is an algorithm proposed for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbors.
As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.

This method has its own advantages as well as disadvantages. The advantage is that Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.

The disadvantage is that the resulting values are quotient-values (For example, when dividing 6 by 3, the quotient is 2) and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. Also it works only on numeric data.

The packages required are DMwR [Torgo, 2010] and dprep .

Enough of talking lets see some examples.

In this example our very own Species data is used which is inbuilt available in R

>library("DMwR") ## the library is loaded with this command

>data(iris)

>head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

6          5.4         3.9          1.7         0.4  setosa

>iris1<- iris[,1:4] ## as we know it works with numerical data so we have removed “Species”

>outlier.scores <- lofactor(iris1, k=5) ## Function lofactor() calculates local outlier factors using the LOF algorithm , where k is the number of neighbors used for calculating local outlier factors

>plot(density(outlier.scores)) ## this willgive the density plot of the outliers

>outliers <- order(outlier.scores, decreasing=T)[1:5] ## with this top 5 outliers are picked

>print(outliers) ## it is used to print the index of outliers

[1]  42 107  23 110  63

>print(iris1[outliers,])

Sepal.Length Sepal.Width Petal.Length Petal.Width

42           4.5         2.3          1.3         0.3

107          4.9         2.5          4.5         1.7

23           4.6         3.6          1.0         0.2

110          7.2         3.6          6.1         2.5

63           6.0         2.2          4.0         1.0

OUTLIER DETECTION BY CLUSTERING

Cluster analysis or clustering is the task of assigning a set of objects into groups called clusters so that the objects in the same cluster are more similar in some sense to each other than to those in other clusters. Clustering is unsupervised classification means we have no predefined classes.

In this we’ll use K-means algorithm to detect the outliers. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells

As we know each observation belong to the cluster with the nearest mean , we can calculate the distance (or dissimilarity) between each object and its cluster center, and pick those with largest

distances as outliers.

We’ll continue the example with the iris data only.

>iris1<- iris[,1:4] ## as we know k-means algorithm is not directly applicable to categorical data we removed the column “Species”.

>kmeans <- kmeans(iris1, centers=3) ## with remaining data we do clustering with 3 centres(start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length.)

>kmeans$centers ## to find the centers

Sepal.Length Sepal.Width Petal.Length Petal.Width

1     5.901613    2.748387     4.393548    1.433871

2     6.850000    3.073684     5.742105    2.071053

3     5.006000    3.428000     1.462000    0.246000

>centers <- kmeans$centers[kmeans$cluster, ] ## this gives us the cluster number assign to each observation

>head(centers <- kmeans$centers[kmeans$cluster, ])   ## top  6 observation with the cluster no.

  Sepal.Length Sepal.Width Petal.Length Petal.Width

3        5.006       3.428        1.462       0.246

3        5.006       3.428        1.462       0.246

3        5.006       3.428        1.462       0.246

3        5.006       3.428        1.462       0.246

3        5.006       3.428        1.462       0.246

3        5.006       3.428        1.462       0.246

>distances <- sqrt(rowSums((iris1 - centers)^2)) ## this creates the distance from the center

>outliers <- order(distances, decreasing=T)

>print(outliers)

>plot(iris1[,c("Sepal.Length", "Sepal.Width")], pch="o", col=kmeans.result$cluster, cex=0.3)

>points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3,pch=2, cex=1.5)

>points(iris1[outliers, c("Sepal.Length", "Sepal.Width")], pch="+", col=4, cex=1.5)

The above plot shows us the cluster centers and the outliers in blue colour with the plus sign.

Data Mining With R

Thursday, 30 April 2015

OUTLIER DETECTION USING R

No comments:

Post a Comment

Blog Archive

About Me