2015年12月6日 星期日

Assessing clustering tendency: A vital issue - Unsupervised

Clustering algorithms, including partitioning methods (K-means, PAM, CLARA and FANNY) and hierarchical clustering, are used to split the dataset into groups or clusters of similar objects.
Before applying any clustering method on the dataset, a natural question is:
Does the dataset contains any inherent clusters?
A big issue, in unsupervised machine learning, is that clustering methods will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.
Therefore before choosing a clustering approach, the analyst has to decide whether the dataset contains meaningful clusters (i.e nonrandom structures) or not. If yes, then how many clusters are there. This process is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.

In this chapter:
  • We describe why we should evaluate the clustering tendency (i.e., clusterability) before applying any cluster analysis on a dataset.
  • We describe statistical and visual methods for assessing the clustering tendency
  • R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters

1 Required packages

The following R packages are required in this chapter:
  • factoextra for data visualization
  • clustertend for assessing clustering tendency
  • seriation for visually assessment of cluster tendency
  1. factoextra can be installed as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
 
  1. Install clustertend and seriation:
install.packages("clustertend")
install.packages("seriation")
 
  1. Load required packages:
library(factoextra)
library(clustertend)
library(seriation)
 

2 Data preparation

We’ll use two datasets: the built-in R dataset faithful and a simulated dataset.

2.1 faithful dataset

faithful dataset contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).
# Load the data
data("faithful")
df <- faithful
head(df)
  
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
  
An illustration of the data can be drawn using ggplot2 package as follow:
library("ggplot2")
ggplot(df, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density2d() # Add 2d density estimation
  
Clustering tendency - R data visualization

2.2 Random uniformly distributed dataset

The R code below generates a random uniform data with the same dimension as the faithful dataset. The function runif(n, min, max) is used for generating uniform distribution on the interval from min to max.
# Generate random dataset
set.seed(123)
n <- nrow(df)
random_df <- data.frame(
  x = runif(nrow(df), min(df$eruptions), max(df$eruptions)),
  y = runif(nrow(df), min(df$waiting), max(df$waiting)))
# Plot the data
ggplot(random_df, aes(x, y)) + geom_point()
  
Clustering tendency - R data visualization

Note that for a given real dataset, random uniform data can be generated in a single line function call as follow:
random_df <- apply(df, 2, 
                function(x, n){runif(n, min(x), (max(x)))}, n)
   

3 Why assessing clustering tendency?

As shown above, we know that faithful dataset contains 2 real clusters. However the randomly generated dataset doesn’t contain any meaningful clusters.
The R code below computes k-means clustering and/or hierarchical clustering on the two datasets. The function fviz_cluster() and fviz_dend() [in factoextra] will be used to visualize the results.
library(factoextra)
set.seed(123)
# K-means on faithful dataset
km.res1 <- kmeans(df, 2)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)
 
Clustering tendency - R data visualization
# K-means on the random dataset
km.res2 <- kmeans(random_df, 2)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)
 
Clustering tendency - R data visualization
# Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 2,  cex = 0.5)
 
Clustering tendency - R data visualization
It can be seen that, k-means algorithm and hierarchical clustering impose a classification on the random uniformly distributed dataset even if there are no meaningful clusters present in it.
Clustering tendency assessment methods are used to avoid this issue.

4 Methods for assessing clustering tendency

Clustering tendency assessment determines whether a given dataset contains meaningful clusters (i.e., non-random structure).
In this section, we’ll describe two methods for determining the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods (Visual Assessment of cluster Tendency (VAT) algorithm).

4.1 Hopkins statistic

Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution. In other words it tests the spatial randomness of the data.

4.1.1 Algorithm

Let D be a real dataset. The Hopkins statistic can be calculated as follow:

  1. Sample uniformly n points (p1,…, pn) from D.
  2. For each point piD, find it’s nearest neighbor pj; then compute the distance between pi and pj and denote it as xi=dist(pi,pj)
  3. Generate a simulated dataset (randomD) drawn from a random uniform distribution with n points (q1,…, qn) and the same variation as the original real dataset D.
  4. For each point qirandomD, find it’s nearest neighbor qj in D; then compute the distance between qi and qj and denote it yi=dist(qi,qj)
  5. Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random dataset divided by the sum of the mean nearest neighbor distances in the real and across the simulated dataset.
The formula is defined as follow:
H=i=1nyii=1nxi+i=1nyi
A value of H about 0.5 means that i=1nyi and i=1nxi are close to each other, and thus the data D is uniformly distributed.
The null and the alternative hypotheses are defined as follow:
  • Null hypothesis: the dataset D is uniformly distributed (i.e., no meaningful clusters)
  • Alternative hypothesis: the dataset D is not uniformly distributed (i.e., contains meaningful clusters)
If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.

4.1.2 R function for computing Hopkins statistic

The function hopkins() [in clustertend package] can be used to statistically evaluate clustering tendency in R. The simplified format is:
hopkins(data, n, byrow = F, header = F)
   

  • data: a data frame or matrix
  • n: the number of points to be selected from the data
  • byrow: logical value. If FALSE (default), the variables is taken by columns, otherwise the variables is taken by rows
  • header: logical. If FALSE (the default) the first column (or row) will be deleted in the calculation
library(clustertend)
# Compute Hopkins statistic for faithful dataset
set.seed(123)
hopkins(faithful, n = nrow(faithful)-1)
   
## $H
## [1] 0.1588201
   
# Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)
   
## $H
## [1] 0.5388899
   
It can be seen that faithful dataset is highly clusterable (the H value = 0.15 which is far below the threshold 0.5). However the random_df dataset is not clusterable (H=0.53)

4.2 VAT: Visual Assessment of cluster Tendency

The visual assessment of cluster tendency (VAT) has been originally described by Bezdek and Hathaway (2002). This approach can be used to visually inspect the clustering tendency of the dataset.

4.2.1 VAT Algorithm

The algorithm of VAT is as follow:

  1. Compute the dissimilarity (DM) matrix between the objects in the dataset using Euclidean distance measure
  2. Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
  3. The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT

4.2.2 R functions for VAT

We start by scaling the data using the function scale(). Next we compute the dissimilarity matrix between observations using the function dist(). finally the function dissplot() [in the package seriation] is used to display an ordered dissimilarity image.
The R code below computes VAT algorithm for the faithful dataset
library("seriation")
# faithful data: ordered dissimilarity image
df_scaled <- scale(faithful)
df_dist <- dist(df_scaled) 
dissplot(df_dist)
   
Clustering tendency - R data visualization
The gray level is proportional to the value of the dissimilarity between observations: pure black if dist(xi,xj)=0 and pure white if dist(xi,xj)=1. Objects belonging to the same cluster are displayed in consecutive order. 
The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.
The figure above suggests two clusters represented by two well-formed black blocks.
The same analysis can be done with the random dataset:
# faithful data: ordered dissimilarity image
random_df_scaled <- scale(random_df)
random_df_dist <- dist(random_df_scaled) 
dissplot(random_df_dist)
   
Clustering tendency - R data visualization
It can be seen that the random_df dataset doesn’t contain any evident clusters.
Now, we can perform k-means on faithful dataset and add cluster labels on the dissimilarity plot:
set.seed(123)
km.res <- kmeans(scale(faithful), 2)
dissplot(df_dist, labels = km.res$cluster)
   
Clustering tendency - R data visualization
After showing that the data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.

5 A single function for Hopkins statistic and VAT

The function get_clust_tendency() [in factoextra package] can be used to compute Hopkins statistic and provides also an ordered dissimilarity image using ggplot2, in a single function call. The ordering of dissimilarity matrix is done using hierarchical clustering.
# Cluster tendency
clustend <- get_clust_tendency(scale(faithful), 100)
 
Clustering tendency - R data visualization
# Hopkins statistic
clustend$hopkins_stat
 
## [1] 0.1482683
 
# Customize the plot
clustend$plot + 
  scale_fill_gradient(low = "steelblue", high = "white")
 
Clustering tendency - R data visualization

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

沒有留言:

張貼留言