OUTLIER DETECTION
– With LOF ( Local Outlier Factor )
& Clustering
Univariate Outlier Detection
First I’ll
show the univariate outlier method then followed by its implication on the multivariate
data. In this example Univariate outlier detection is done with the function boxplot.stats()
boxplot.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE)
In the result returned by the above function, one
component is out, which gives a list of outliers. More specifically, it lists data
points lying beyond the extremes of the whiskers. An argument of coefficient can be used to control how far the whiskers
extend out from the box of a boxplot.
Lets explain it
with an example
> set.seed(701)
> x<- rnorm(140)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.46100 -0.56250 0.03992 0.03355 0.66270 2.92900
> boxplot.stats(x)$out
> boxplot.stats(x)$out
[1] 2.929094 -2.461077 2.608763
>boxplot(x)
In the above code first we have set the seed and then created
x the random deviates of
140 instances , then was passed with summary function which gives the quartile
values and then the boxplot.stats is used to pull out the outliers that were
present in “x” and with that we can see 3 numbers and finally with boxplot() we made a boxplot
which shows the outliers , we can see 3 circles , 2 above the upper whisker and one below the lower
whisker which we can guess when we first saw outlier numbers.
The above method also can be used to detect the outlier
in a multivariate data . Enough of talking lets show it with an example
>y<- rnorm(140)
>boxplot.stats(y)$out
[1] 2.469998 2.576715 2.386554
>df<-data.frame(x,y)
>head(df)
x y
1 0.5465374 1.48274691
2 0.3066107 -1.66959326
3 0.6155518 -0.66057890
4 -0.3204095 0.40181322
5 0.9276831 0.03933041
6 0.4898570 0.01733414
> (a <- which(x %in% boxplot.stats(x)$out))
[1] 79 113 129
> (b <- which(y %in% boxplot.stats(y)$out))
[1] 9 19 114
> detach(df)
> (outlier_all <- union(a,b))
[1] 79 113 129 9 19 114
> plot(df)
> points(df[outlier_all,], col="red", pch="x", cex=2)
Again I created random deviates named y of same length as we need to create
a data frame.
We then checked for the outliers in Y and got 3 numbers. Then we created the data frame with the help of function data.frame passing onto x and y. After this to mark the outliers in the scatter plot we gathered the index of the outliers with
(a <- which(x %in% boxplot.stats(x)$out))
For x
And
(b <-
which(y %in% boxplot.stats(y)$out))
For y
Then to mark all the outliers a union is required for all
the indexes of the outliers which we generated with the above code and then
with the next two commands plot was created and the outliers were marked.
Though the domain knowledge is required to detect the outliers
in the real world application.
OUTLIER DETECTION WITH
LOF
The local outlier factor (LOF) is an algorithm proposed for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbors.
As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.
This method has its own advantages as well as disadvantages. The advantage is that Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.
The disadvantage is that the resulting values are quotient-values (For example, when dividing 6 by 3, the quotient is 2) and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. Also it works only on numeric data.
As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.
This method has its own advantages as well as disadvantages. The advantage is that Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.
The disadvantage is that the resulting values are quotient-values (For example, when dividing 6 by 3, the quotient is 2) and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. Also it works only on numeric data.
The packages required are DMwR [Torgo,
2010] and dprep .
Enough of talking lets see
some examples.
In this example our very
own Species data is used which is inbuilt available in R
>library("DMwR") ## the library is loaded with this command
>data(iris)
>head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
>iris1<- iris[,1:4] ## as we know it works with numerical
data so we have removed “Species”
>outlier.scores <-
lofactor(iris1, k=5) ## Function lofactor() calculates local
outlier factors using the LOF algorithm , where k is the number of neighbors used
for calculating local outlier factors
>plot(density(outlier.scores)) ## this willgive the density plot of the outliers
>outliers <-
order(outlier.scores, decreasing=T)[1:5]
## with this top 5 outliers are
picked
>print(outliers) ## it is used to print the index of outliers
[1] 42 107 23 110 63
>print(iris1[outliers,])
Sepal.Length Sepal.Width Petal.Length Petal.Width
42 4.5 2.3 1.3 0.3
107 4.9 2.5 4.5 1.7
23 4.6 3.6 1.0 0.2
110 7.2 3.6 6.1 2.5
63 6.0 2.2 4.0 1.0
OUTLIER DETECTION BY CLUSTERING
Cluster analysis or clustering is the task of assigning a
set of objects into groups called clusters so that the objects in the same
cluster are more similar in some sense to each other than to those in other
clusters. Clustering is unsupervised classification means we have no predefined
classes.
In this we’ll use K-means algorithm to detect the
outliers. k-means clustering
aims to partition n observations
into k clusters
in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the
data space into Voronoi
cells
As we know each
observation belong to the cluster with the nearest mean , we can calculate the distance (or
dissimilarity) between each object and its cluster center, and pick those with
largest
distances as outliers.
We’ll continue the example with the iris data only.
>iris1<- iris[,1:4] ## as we know k-means algorithm is not
directly applicable to categorical data we removed the column “Species”.
>kmeans <-
kmeans(iris1, centers=3) ## with
remaining data we do clustering with 3 centres(start with a large value
for
k
and
keep removing centroids (reducing k) until it no longer reduces the description
length.)
>kmeans$centers ## to find the centers
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.901613 2.748387 4.393548 1.433871
2 6.850000 3.073684 5.742105 2.071053
3 5.006000 3.428000 1.462000 0.246000
>centers <-
kmeans$centers[kmeans$cluster, ] ##
this gives us the cluster number assign to each observation
>head(centers <- kmeans$centers[kmeans$cluster, ]) ## top 6 observation with the cluster no.
Sepal.Length Sepal.Width Petal.Length Petal.Width
3 5.006 3.428 1.462 0.246
3 5.006 3.428 1.462 0.246
3 5.006 3.428 1.462 0.246
3 5.006 3.428 1.462 0.246
3 5.006 3.428 1.462 0.246
3 5.006 3.428 1.462 0.246
>distances <-
sqrt(rowSums((iris1 - centers)^2)) ##
this creates the distance from the center
>outliers <- order(distances, decreasing=T)
>print(outliers)
>plot(iris1[,c("Sepal.Length",
"Sepal.Width")], pch="o", col=kmeans.result$cluster,
cex=0.3)
>points(kmeans.result$centers[,c("Sepal.Length",
"Sepal.Width")], col=1:3,pch=2, cex=1.5)
>points(iris1[outliers,
c("Sepal.Length", "Sepal.Width")], pch="+",
col=4, cex=1.5)
The above plot shows us
the cluster centers and the outliers in blue colour with the plus sign.
No comments:
Post a Comment