2015年12月20日 星期日

BayesiaLab-Like Network Graphs for Free with R

My screen has been filled with ads from BayesiaLab since I downloaded their free book. Just as I began to have regrets, I received an email invitation to try out their demo datasets. I was especially interested in their perfume ratings data. In this monadic product test, each of 1,321 French women was presented with only one of 11 perfumes and asked to evaluate on a 10-point scale a series of fragrance-related adjectives along with a few user-imagery descriptors. I have added the 6-point purchase intent item to the analysis in order to assess its position in this network.
Can we start by looking at the partial correlation network? I will refer you to my post on Driver Analysis vs. Partial Correlation Analysis and will not repeat that more detailed overview.
Each of the nodes is a variable (e.g., purchase intent is located on the far right). An edge drawn between any two nodes shows the partial correlation between those two nodes after controlling for all the other variables in the network. The color indicates the sign of the partial correlation with green for positive and red for negative. The size of the partial correlation is indicated by the thickest of the edge.
Simply scanning the map reveals the underlying structure of global connections among even more strongly joined regions:
  • Northwest – In Love / Romantic / Passionate / Radiant,
  • Southwest – Bold / Active / Character / Fulfilled / Trust / Free, 
  • Mid-South – Classical / Tenacious / Quality / Timeless / High End, 
  • Mid-North – Wooded / Spiced, 
  • Center – Chic / Elegant / Rich / Modern, 
  • Northeast – Sweet / Fruity / Flowery / Fresh, and
  • Southeast – Easy to Wear / Please Others / Pleasure. 
Unlike the Probabilistic Structural Equation Model (PSEM) in Chapter 8 of BayesiaLab’s book, my network is undirected because I can find no justification for assigning causality. Yet, the structure appears to be much the same for the two analyses, for example, compare this partial correlation network with BayesiaLab’s Figure 8.2.3.
All this looks very familiar to those of us who have analyzed consumer rating scales. First, we expect negative skew and high collinearity because consumers tend to give ratings in the upper end of the scale and their responses often are highly intercorrelated. In fact, the first principal component accounted for 64% of the total variation, and it would have been higher had Wooded and Spiced been excluded from the battery.
A more cautious researcher might stop with extracting a single dimension and simply concluding that the women either liked or disliked the perfumes they tested and rated everything either uniformly higher or lower. They would speak of halo effects and question whether any more than an overall score could be extracted from the data. Nevertheless, as we see from the above partial correlation network, there is an interpretable local structure even when all the variables are highly interrelated.
I have discussed this issue before in a post about separating global from specific factors. The bifactor model outlined in that post provides another view into the structure of the perfume rating data. What if there were a global factor explaining what we might call the “halo effect” (i.e., uniformly high correlations among all the variables) and then additional specific factors accounting for the extra correlation among different subsets of variables (e.g., the regions in the above partial correlation network map)?
The bifactor diagram shown below may not be pretty with so many variables to be arrayed. However, you can see the high factor loadings radiating out from the global factor g and how the specific factors F1* through F6* provide a secondary level of local structure corresponding to the regions identified in the above network.
I will end with a technical note. The 1321 observations were nested within the 11 perfumes with each respondent seeing only one perfume. Although we would not expect the specific perfume rated to alter the correlations (factorial invariance), mean-level differences between the perfumes could inflate the correlations calculated over the entire sample. In order to test this, I reran the analysis with deviation scores by subtracting the corresponding mean perfume score from each respondent’s original ratings. The results were essentially the same.
R Code Needed to Import CSV File and Produce Plots
# Set working directory and import data file
setwd("C:/directory where file located")
perfume<-read.csv("Perfume.csv", sep=";")
apply(perfume, 2, function(x) table(x,useNA="always"))
 
# Calculates Sparse Partial Correlation Matrix
library("qgraph")
sparse_matrix<-EBICglasso(cor(perfume[,2:48]), n=1321)
qgraph(sparse_matrix, layout="spring", 
       label.scale=FALSE, labels=names(perfume)[2:48],
       label.cex=1, node.width=.5)
 
library(psych)
# Purchase Intent Not Included
scree(perfume[,3:48])
omega(perfume[,3:48], nfactors=6)

2015年12月6日 星期日

Tools Galore: R, Python, SAS …

If you have been in analytics profession for couple of years, you would have had your fair share of discussions on right tool for given job. You may have your favourite one, or you may be master of many. If you are new to analytics, you may be learning one now, sometimes driven by interest, but often times driven by what your organization supports or works with. Irrespective of your experience in world of data science, you would have constantly faced challenges and frustrations of keeping pace with new and upcoming tools and software programs.
As this field grows and new tools and technologies keep emerging – and emergence of Big Data hasn’t helped the matter – one must evaluate effort in mastering yet another language or tool or mastering new machine learning theory. Let’s take a look around and see what’s out there and what they are good for.
 analytics tools

SAS – Statistical Analysis Software

SAS is one of the most common tools out there for data processing and model development. When analytics function started emerging in the financial service sector couple of decades ago, SAS became common choice because of its simplicity and lot of support and documentation. SAS comes handy both for step by step data processing and automated scripting. All is well, except, SAS wasn’t and isn’t cheap. SAS also has limited capabilities for visualizations and almost no support for parallelization.

R

R is perhaps first choice of most data scientists and modelers because R was designed for this very purpose. R has robust support for machine learning algorithms and its library base keeps on growing every day. There will hardly be any discipline – be it bio-technology or geo-spatial processing – where a ready package will not be available in R. R is also fairly easy to learn and has good tutorials and support available – though being free product, support is often in forums and examples rather than in documentations. R is also great for data and plot visualizations, which is almost always necessary for data analysis.

Python

Python isn’t new, per se, but Python for analytics is recent phenomenon. Another free language/software, Python has great capabilities overall for general purpose functional programming. Python works well for web-scrapping, text processing, file manipulations, and simple or complex visualizations. Python is advancing – but not yet there – in dealing with structured data and analytical models compared to R. Python also doesn’t support data visualization in as much detail.

SQL

SQL – MySQL that is – isn’t generally considered a data analysis language since it has no support for model training, but SQL is great for data manipulation and summarization. SQL is also fairly easy to learn and often used a pre-processing tool for other model development environment.

And more

Of course, there are many more tools for general purpose data analysis, let alone specialized tools for specific applications. Interpretive languages like C++ and Java are still go-to if system integration and speed of production run-time is important. Other high level languages like Ruby and Perl also support data handling. MATLAB comes handy for data processing, visualization and optimization but perhaps not for model development. By the way, when I say, a tool doesn’t support model development readily; it means tool doesn’t have native support for training a, say, logistic regression model. That doesn’t prevent anyone from coding the gradient descent function of logistic cost function manually.

Some words on Big Data Tools

While general purpose analysis tools occupy our work days for most of us, none of us should be immune to market trends and future of analytics. Hadoop started off Big Data revolution, but thankfully, over time many technologies have to come to abstract out complexities of manually writing mapper and reducer and provide a general wrapper to work with Big Data. Pig, Hive, and Google Big Query provide SQL-like environment for handling large tables, while Spark provides general purpose data processing and analytic modeling functionalities. Storm is currently considered most suitable for streaming data handling, and MangoDB, Vertica and CouchBase provide advance data storage solutions.

Confused?

CriterionSASSQLRPython
CostPaidMySQL Paid, but free variants like SQLite availableFreeFree
Learning DifficultyLow - MediumLowMedium - HighMedium - High
Data ManipulationHighMediumHighHigh
Analytic ModelingHighLowHighMedium
Graphical CapabilityLowLowHighMedium
Text ProcessingLowLowMediumHigh
Big DataMediumLowLowMedium
Common UsagesHighLowHighMedium
If you feel confused by laundry list of tools, then I can understand. Every situation is different, of course, but I feel that R and Python are both high level languages and generally simple to pick up (yes, they take time to master) so you should be comfortable on working on any. If you are looking as first language to pick up for machine learning then I recommend R. Maybe you can foray in Python if your work goes beyond structured data. Otherwise, invest in learning open-source technologies and tools since world is generally moving away from enterprise software products.

Assessing clustering tendency: A vital issue - Unsupervised

Clustering algorithms, including partitioning methods (K-means, PAM, CLARA and FANNY) and hierarchical clustering, are used to split the dataset into groups or clusters of similar objects.
Before applying any clustering method on the dataset, a natural question is:
Does the dataset contains any inherent clusters?
A big issue, in unsupervised machine learning, is that clustering methods will return clusters even if the data does not contain any clusters. In other words, if you blindly apply a clustering analysis on a dataset, it will divide the data into clusters because that is what it supposed to do.
Therefore before choosing a clustering approach, the analyst has to decide whether the dataset contains meaningful clusters (i.e nonrandom structures) or not. If yes, then how many clusters are there. This process is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.

In this chapter:
  • We describe why we should evaluate the clustering tendency (i.e., clusterability) before applying any cluster analysis on a dataset.
  • We describe statistical and visual methods for assessing the clustering tendency
  • R lab sections containing many examples are also provided for computing clustering tendency and visualizing clusters

1 Required packages

The following R packages are required in this chapter:
  • factoextra for data visualization
  • clustertend for assessing clustering tendency
  • seriation for visually assessment of cluster tendency
  1. factoextra can be installed as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
 
  1. Install clustertend and seriation:
install.packages("clustertend")
install.packages("seriation")
 
  1. Load required packages:
library(factoextra)
library(clustertend)
library(seriation)
 

2 Data preparation

We’ll use two datasets: the built-in R dataset faithful and a simulated dataset.

2.1 faithful dataset

faithful dataset contains the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA).
# Load the data
data("faithful")
df <- faithful
head(df)
  
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
  
An illustration of the data can be drawn using ggplot2 package as follow:
library("ggplot2")
ggplot(df, aes(x=eruptions, y=waiting)) +
  geom_point() +  # Scatter plot
  geom_density2d() # Add 2d density estimation
  
Clustering tendency - R data visualization

2.2 Random uniformly distributed dataset

The R code below generates a random uniform data with the same dimension as the faithful dataset. The function runif(n, min, max) is used for generating uniform distribution on the interval from min to max.
# Generate random dataset
set.seed(123)
n <- nrow(df)
random_df <- data.frame(
  x = runif(nrow(df), min(df$eruptions), max(df$eruptions)),
  y = runif(nrow(df), min(df$waiting), max(df$waiting)))
# Plot the data
ggplot(random_df, aes(x, y)) + geom_point()
  
Clustering tendency - R data visualization

Note that for a given real dataset, random uniform data can be generated in a single line function call as follow:
random_df <- apply(df, 2, 
                function(x, n){runif(n, min(x), (max(x)))}, n)
   

3 Why assessing clustering tendency?

As shown above, we know that faithful dataset contains 2 real clusters. However the randomly generated dataset doesn’t contain any meaningful clusters.
The R code below computes k-means clustering and/or hierarchical clustering on the two datasets. The function fviz_cluster() and fviz_dend() [in factoextra] will be used to visualize the results.
library(factoextra)
set.seed(123)
# K-means on faithful dataset
km.res1 <- kmeans(df, 2)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)
 
Clustering tendency - R data visualization
# K-means on the random dataset
km.res2 <- kmeans(random_df, 2)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)
 
Clustering tendency - R data visualization
# Hierarchical clustering on the random dataset
fviz_dend(hclust(dist(random_df)), k = 2,  cex = 0.5)
 
Clustering tendency - R data visualization
It can be seen that, k-means algorithm and hierarchical clustering impose a classification on the random uniformly distributed dataset even if there are no meaningful clusters present in it.
Clustering tendency assessment methods are used to avoid this issue.

4 Methods for assessing clustering tendency

Clustering tendency assessment determines whether a given dataset contains meaningful clusters (i.e., non-random structure).
In this section, we’ll describe two methods for determining the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods (Visual Assessment of cluster Tendency (VAT) algorithm).

4.1 Hopkins statistic

Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution. In other words it tests the spatial randomness of the data.

4.1.1 Algorithm

Let D be a real dataset. The Hopkins statistic can be calculated as follow:

  1. Sample uniformly n points (p1,…, pn) from D.
  2. For each point piD, find it’s nearest neighbor pj; then compute the distance between pi and pj and denote it as xi=dist(pi,pj)
  3. Generate a simulated dataset (randomD) drawn from a random uniform distribution with n points (q1,…, qn) and the same variation as the original real dataset D.
  4. For each point qirandomD, find it’s nearest neighbor qj in D; then compute the distance between qi and qj and denote it yi=dist(qi,qj)
  5. Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random dataset divided by the sum of the mean nearest neighbor distances in the real and across the simulated dataset.
The formula is defined as follow:
H=i=1nyii=1nxi+i=1nyi
A value of H about 0.5 means that i=1nyi and i=1nxi are close to each other, and thus the data D is uniformly distributed.
The null and the alternative hypotheses are defined as follow:
  • Null hypothesis: the dataset D is uniformly distributed (i.e., no meaningful clusters)
  • Alternative hypothesis: the dataset D is not uniformly distributed (i.e., contains meaningful clusters)
If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.

4.1.2 R function for computing Hopkins statistic

The function hopkins() [in clustertend package] can be used to statistically evaluate clustering tendency in R. The simplified format is:
hopkins(data, n, byrow = F, header = F)
   

  • data: a data frame or matrix
  • n: the number of points to be selected from the data
  • byrow: logical value. If FALSE (default), the variables is taken by columns, otherwise the variables is taken by rows
  • header: logical. If FALSE (the default) the first column (or row) will be deleted in the calculation
library(clustertend)
# Compute Hopkins statistic for faithful dataset
set.seed(123)
hopkins(faithful, n = nrow(faithful)-1)
   
## $H
## [1] 0.1588201
   
# Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)
   
## $H
## [1] 0.5388899
   
It can be seen that faithful dataset is highly clusterable (the H value = 0.15 which is far below the threshold 0.5). However the random_df dataset is not clusterable (H=0.53)

4.2 VAT: Visual Assessment of cluster Tendency

The visual assessment of cluster tendency (VAT) has been originally described by Bezdek and Hathaway (2002). This approach can be used to visually inspect the clustering tendency of the dataset.

4.2.1 VAT Algorithm

The algorithm of VAT is as follow:

  1. Compute the dissimilarity (DM) matrix between the objects in the dataset using Euclidean distance measure
  2. Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
  3. The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT

4.2.2 R functions for VAT

We start by scaling the data using the function scale(). Next we compute the dissimilarity matrix between observations using the function dist(). finally the function dissplot() [in the package seriation] is used to display an ordered dissimilarity image.
The R code below computes VAT algorithm for the faithful dataset
library("seriation")
# faithful data: ordered dissimilarity image
df_scaled <- scale(faithful)
df_dist <- dist(df_scaled) 
dissplot(df_dist)
   
Clustering tendency - R data visualization
The gray level is proportional to the value of the dissimilarity between observations: pure black if dist(xi,xj)=0 and pure white if dist(xi,xj)=1. Objects belonging to the same cluster are displayed in consecutive order. 
The VAT detects the clustering tendency in a visual form by counting the number of square shaped dark blocks along the diagonal in a VAT image.
The figure above suggests two clusters represented by two well-formed black blocks.
The same analysis can be done with the random dataset:
# faithful data: ordered dissimilarity image
random_df_scaled <- scale(random_df)
random_df_dist <- dist(random_df_scaled) 
dissplot(random_df_dist)
   
Clustering tendency - R data visualization
It can be seen that the random_df dataset doesn’t contain any evident clusters.
Now, we can perform k-means on faithful dataset and add cluster labels on the dissimilarity plot:
set.seed(123)
km.res <- kmeans(scale(faithful), 2)
dissplot(df_dist, labels = km.res$cluster)
   
Clustering tendency - R data visualization
After showing that the data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.

5 A single function for Hopkins statistic and VAT

The function get_clust_tendency() [in factoextra package] can be used to compute Hopkins statistic and provides also an ordered dissimilarity image using ggplot2, in a single function call. The ordering of dissimilarity matrix is done using hierarchical clustering.
# Cluster tendency
clustend <- get_clust_tendency(scale(faithful), 100)
 
Clustering tendency - R data visualization
# Hopkins statistic
clustend$hopkins_stat
 
## [1] 0.1482683
 
# Customize the plot
clustend$plot + 
  scale_fill_gradient(low = "steelblue", high = "white")
 
Clustering tendency - R data visualization

6 Infos

This analysis has been performed using R software (ver. 3.2.1)

Determining the optimal number of clusters: 3 must known methods - Unsupervised Machine Learning

he first step in clustering analysis is to assess whether the dataset is clusterable. This has been described in a chapter entitled: Assessing Clustering Tendency.
Partitioning methods, such as k-means clustering require also the users to specify the number of clusters to be generated.
One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?
Unfortunately, there is no definitive answer to this question. The optimal clustering is somehow subjective and depend on the method used for measuring similarities and the parameters used for partitioning.
A simple and popular solution consists of inspecting the dendrogram produced using hierarchical clustering to see if it suggests a particular number of clusters. Unfortunately this approach is, again, subjective.
In this article, we’ll describe different methods for determining the optimal number of clustersfor k-meansPAM and hierarchical clustering . These methods include direct methods and statistical testing methods.

  • Direct methods consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
  • Testing methods consists of comparing evidence against null hypothesis. An example is the gap statistic.
In addition to elbowsilhouette and gap statistic methods, there are more than thirty other indices and methods that have been published for identifying the optimal number of clusters. We’ll provide R codes for computing all these 30 indices in order to decide the best number of clusters using the “majority rule”.
For each of these methods:
  • We’ll describe the basic idea, the algorithm and the key mathematical concept
  • We’ll provide easy-o-use R codes with many examples for determining the optimal number of clusters and visualizing the output

1 Required packages

The following package will be used:
  • cluster for computing pam and for analyzing cluster silhouettes
  • factoextra for visualizing clusters using ggplot2 plotting system
  • NbClust for finding the optimal number of clusters
Install factoextra package as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
 
The remaining packages can be installed using the code below:
pkgs <- c("cluster",  "NbClust")
install.packages(pkgs)
 
Load packages:
library(factoextra)
library(cluster)
library(NbClust)
 

2 Data preparation

The data set iris is used. We start by excluding the species column and scaling the data using the function scale():
# Load the data
data(iris)
head(iris)
 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
 
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])
 
This iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

3 Example of partitioning method results

The functions kmeans() [in stats package] and pam() [in cluster package] are described in this section. We’ll split the data into 3 clusters as follow:
# K-means clustering
set.seed(123)
km.res <- kmeans(iris.scaled, 3, nstart = 25)
# k-means group number of each observation
km.res$cluster
 
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
 
# Visualize k-means clusters
fviz_cluster(km.res, data = iris.scaled, geom = "point",
             stand = FALSE, frame.type = "norm")
 
Optimal number of clusters - R data visualization
# PAM clustering
library("cluster")
pam.res <- pam(iris.scaled, 3)
pam.res$cluster
 
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
 
# Visualize pam clusters
fviz_cluster(pam.res, stand = FALSE, geom = "point",
             frame.type = "norm")
 
Optimal number of clusters - R data visualization
Read more about partitioning methods: Partitioning clustering

4 Example of hierarchical clustering results

The built-in R function hclust() is used:
# Compute pairewise distance matrices
dist.res <- dist(iris.scaled, method = "euclidean")
# Hierarchical clustering results
hc <- hclust(dist.res, method = "complete")
# Visualization of hclust
plot(hc, labels = FALSE, hang = -1)
# Add rectangle around 3 groups
rect.hclust(hc, k = 3, border = 2:4) 
 
Optimal number of clusters - R data visualization
# Cut into 3 groups
hc.cut <- cutree(hc, k = 3)
head(hc.cut, 20)
 
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 
Read more about hierarchical clustering: Hierarchical clustering

5 Three popular methods for determining the optimal number of clusters

In this section we describe the three most popular methods including: i) Elbow method, ii) silhouette method and iii) gap statistic.

5.1 Elbow method

5.1.1 Concept

Recall that, the basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized:
minimize(k=1kW(Ck)),
Where Ck is the kth cluster and W(Ck) is the within-cluster variation.
The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

5.1.2 Algorithm

The optimal number of clusters can be defined as follow:

  1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
  2. For each k, calculate the total within-cluster sum of square (wss)
  3. Plot the curve of wss according to the number of clusters k.
  4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

5.1.3 R codes

5.1.3.1 Elbow method for k-means clustering

set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
data <- iris.scaled
wss <- sapply(1:k.max, 
        function(k){kmeans(data, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)
    
Optimal number of clusters - R data visualization
The elbow method suggests 3 cluster solutions.
The elbow method is implemented in factoextra package and can be easily computed using the function fviz_nbclust(), which format is:
fviz_nbclust(x, FUNcluster, method = c("silhouette", "wss"))
    

  • x: numeric matrix or data frame
  • FUNcluster: a partitioning function such as kmeans, pam, clara etc
  • method: the method to be used for determining the optimal number of clusters.
The R code below computes the elbow method for kmeans():
fviz_nbclust(iris.scaled, kmeans, method = "wss") +
    geom_vline(xintercept = 3, linetype = 2)
    
Optimal number of clusters - R data visualization
Three clusters are suggested.

5.1.3.2 Elbow method for PAM clustering

It’s possible to use the function fviz_nbclust() as follow:
fviz_nbclust(iris.scaled, pam, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)
    
Optimal number of clusters - R data visualization
Three clusters are suggested.

5.1.3.3 Elbow method for hierarchical clustering

We’ll use a helper function hcut() [in factoextra package] which will compute hierarchical clustering (HC) algorithm and cut the dendrogram in k clusters:
fviz_nbclust(iris.scaled, hcut, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)
    
Optimal number of clusters - R data visualization
Three clusters are suggested.
Note that, the elbow method is sometimes ambiguous. An alternative is the average silhouette method (Kaufman and Rousseeuw [1990]) which can be also used with any clustering approach.

5.2 Average silhouette method

5.2.1 Concept

The average silhouette approach we’ll be described comprehensively in the chapter cluster validation statistics. Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).

5.2.2 Algorithm

The algorithm is similar to the elbow method and can be computed as follow:

  1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
  2. For each k, calculate the average silhouette of observations (avg.sil)
  3. Plot the curve of avg.sil according to the number of clusters k.
  4. The location of the maximum is considered as the appropriate number of clusters.

5.2.3 R codes

The function silhouette() [in cluster package] is used to compute the average silhouette width.

5.2.3.1 Average silhouette method for k-means clustering

The R code below determine the optimal number of clusters K for k-means clustering:
library(cluster)
k.max <- 15
data <- iris.scaled
sil <- rep(0, k.max)
# Compute the average silhouette width for 
# k = 2 to k = 15
for(i in 2:k.max){
  km.res <- kmeans(data, centers = i, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(data))
  sil[i] <- mean(ss[, 3])
}
# Plot the  average silhouette width
plot(1:k.max, sil, type = "b", pch = 19, 
     frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)
    
Optimal number of clusters - R data visualization
The function fviz_nbclust() [in factoextra package] can be also used. It just requires the clusterpackage to be installed:
require(cluster)
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")
    
Optimal number of clusters - R data visualization
Two clusters are suggested.

5.2.3.2 Average silhouette method for PAM clustering

require(cluster)
fviz_nbclust(iris.scaled, pam, method = "silhouette")
    
Optimal number of clusters - R data visualization
Two clusters are suggested.

5.2.3.3 Average silhouette method for hierarchical clustering

require(cluster)
fviz_nbclust(iris.scaled, hcut, method = "silhouette",
             hc_method = "complete")
    
Optimal number of clusters - R data visualization
Three clusters are suggested.

5.3 Conclusions about elbow and silhouette methods

  • Three cluster solutions are suggested using k-meansPAM and hierarchical clustering in combination with the elbow method.
  • The average silhouette method gives two cluster solutions using k-means and PAM algorithms. Combining hierarchical clustering and silhouette method returns 3 clusters
According to these observations, it’s possible to define k = 3 as the optimal number of clusters in the data.
The disadvantage of elbow and average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statisticwhich provides a statistical procedure to formalize the elbow/silhouette heuristic in order to estimate the optimal number of clusters.

5.4 Gap statistic method

5.4.1 Concept

The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The approach can be applied to any clustering method (K-means clusteringhierarchical clustering, …).
The gap statistic compares the total within intracluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.
Recall that, the total within intra-cluster variation for a given k clusters is the total within sum of square (wk).
The reference dataset is generated using Monte Carlo simulations of the sampling process. That is, for each variable (xi) in the data set we compute its range [min(xi),max(xj)] and generate values for the n points uniformly from the interval min to max.
Note that, the function runif(n, min, max) can be used to generate random uniform distribution.
For the observed data and the the reference data, the total intracluster variation is computed using different values of k. The gap statistic for a given k is defined as follow:
Gapn(k)=En{log(Wk)}log(Wk)
Where En denotes the expectation under a sample of size n from the reference distribution. Enis defined via bootstrapping (B) by generating B copies of the reference datasets and, by computing the average log(Wk).
Note that, the logarithm of the Wk values is used, as they can be quite large.
The gap statistic measures the deviation of the observed Wk value from its expected value under the null hypothesis.
The estimate of the optimal clusters k̂  will be value that maximize Gapn(k) (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the uniform distribution of points.
Note that, using B = 500 gives quite precise results so that the gap plot is basically unchanged after an another run.
The standard deviation (sdk) of log(Wk) is also computed in order to define the standard error (sk) of the simulation as follow:
sk=sdk×1+1/B‾‾‾‾‾‾‾√

Finally, a more robust approach is to choose the optimal number of clusters K as the smallest k such that:
Gap(k)Gap(k+1)sk+1
That is, we choose the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1.

5.4.2 Algorithm

The algorithm involves the following steps (Read the original paper of the gap statistic):

  1. Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding Wk.
  2. Generate B reference data sets and cluster each of them with varying number of clusters k = 1, …, kmax. Compute the estimated gap statistic Gap(k)=1Bb=1Blog(Wkb)log(Wk).
  3. Let w¯=(1/B)blog(Wkb), compute the standard deviation sd(k)=(1/B)b(log(Wkb)w¯)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√ and define sk=sdk×1+1/B‾‾‾‾‾‾‾√.
  4. Choose the number of clusters as the smallest k such that Gap(k)Gap(k+1)sk+1.

5.4.3 R codes

5.4.3.1 R function for computing the gap statistic

The R function clusGap() [in cluster package ] can be used to estimate the number of clusters in the data by applying the gap statistic.
A simplified format is:
clusGap(x, FUNcluster, K.max, B = 100, verbose = TRUE, ...)
    

  • x: numeric matrix or data frame
  • FUNcluster: a function (e.g.: kmeans, pam, …) which accepts i) a data matrix like x as first argument; ii) the number of clusters desired (k > = 2) as a second argument; and returns a list containing a component named cluster which is a vector of length n=nrow(x) of integers in 1:k determining the clustering or grouping of the n observations.
  • K.max: the maximum number of clusters to consider, must be at least two.
  • B: the number of Monte Carlo (“bootstrap”) samples.
  • verbose: if TRUE, the computing progression is shown.
  • : Further arguments for FUNcluster(), see kmeans example below.
clusGap() function returns an object of class “clusGap” which main component is Tabwith K.max rows and 4 columns, named “logW”, “E.logW”, “gap” and “SE.sim”. Recall that gap=E.logWlogW and SE.sim is the standard error of gap.

5.4.3.2 Gap statistic for k-means clustering

The R code below shows some example using the clustGap() function.
We’ll use B = 50 to keep the function speedy. Note that, it’s recommended to use B = 500 for your analysis.
The output of clusGap() function can be visualized using the function fviz_gap_stat() [in factoextra].
# Compute gap statistic
library(cluster)
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
    
## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..10
##  --> Number of clusters (method 'firstmax'): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 4.534565 4.754595 0.2200304 0.02504585
##  [2,] 4.021316 4.489687 0.4683711 0.02742112
##  [3,] 3.806577 4.295715 0.4891381 0.02384746
##  [4,] 3.699263 4.143675 0.4444115 0.02093871
##  [5,] 3.589284 4.052262 0.4629781 0.02036366
##  [6,] 3.519726 3.972254 0.4525278 0.02049566
##  [7,] 3.448288 3.905945 0.4576568 0.02106987
##  [8,] 3.398210 3.850807 0.4525967 0.01969193
##  [9,] 3.334279 3.802315 0.4680368 0.01905974
## [10,] 3.250246 3.759661 0.5094149 0.01928183
    
# Base plot of gap statistic
plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 3, lty = 2)
    
Optimal number of clusters - R data visualization
# Use factoextra
fviz_gap_stat(gap_stat)
    
Optimal number of clusters - R data visualization
In our example, the algorithm suggests k = 3
The optimal number of clusters, k, is computed using the “firstmax” method (see ?cluster::maxSE). The criterion proposed by Tibshirani et al (2001) can be used as follow:
# Print
print(gap_stat, method = "Tibs2001SEmax")
# Plot
fviz_gap_stat(gap_stat, 
              maxSE = list(method = "Tibs2001SEmax"))
# Relaxed the gap test to be within two standard deviations
fviz_gap_stat(gap_stat, 
          maxSE = list(method = "Tibs2001SEmax", SE.factor = 2))
    

5.4.3.3 Gap statistic for PAM clustering

We don’t need the argument “nstart” which is specific to kmeans() function.
# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = pam, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)
    
Optimal number of clusters - R data visualization
Three cluster solutions are suggested.

5.4.3.4 Gap statistic for hierarchical clustering

# Compute gap statistic
set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)
# Plot gap statistic
fviz_gap_stat(gap_stat)
    
Optimal number of clusters - R data visualization
Three cluster solutions are suggested.

6 NbClust: A Package providing 30 indices for determining the best number of clusters

6.1 Overview of NbClust package

As mentioned in the introduction of this article, many indices have been proposed in the literature for determining the optimal number of clusters in a partitioning of a data set during the clustering process.
NbClust package, published by Charrad et al., 2014, provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
An important advantage of NbClust is that the user can simultaneously computes multiple indices and determine the number of clusters in a single function call.
The indices provided in NbClust package includes the gap statistic, the silhouette method and 28 other indices described comprehensively in the original paper of Charrad et al., 2014.

6.2 NbClust R function

The simplified format of the function NbClust() is:
NbClust(data = NULL, diss = NULL, distance = "euclidean",
        min.nc = 2, max.nc = 15, method = NULL, index = "all")
  

  • data: matrix
  • diss: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be “NULL”
  • distance: the distance measure to be used to compute the dissimilarity matrix. Possible values include “euclidean”, “manhattan” or “NULL”.
  • min.nc, max.nc: minimal and maximal number of clusters, respectively
  • method: The cluster analysis method to be used including “ward.D”, “ward.D2”, “single”, “complete”, “average” and more
  • index: the index to be calculated including “silhouette”, “gap” and more.
The value of NbClust() function includes the following elements:
  • All.index: Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc
  • All.CriticalValues: Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc
  • Best.nc: Best number of clusters proposed by each index and the corresponding index value
  • Best.partition: Partition that corresponds to the best number of clusters

6.3 Examples of usage

Note that, user can request indices one by one, by setting the argument index to the name of the index of interest, for example index = “gap”.
In this case, NbClust function displays:
  • the gap statistic values of the partitions obtained with number of clusters varying from min.nc to max.nc ($All.index)
  • the optimal number of clusters ($Best.nc)
  • and the partition corresponding to the best number of clusters ($Best.partition)

6.3.1 Compute only an index of interest

The following example determine the number of clusters using gap statistics:
library("NbClust")
set.seed(123)
res.nb <- NbClust(iris.scaled, distance = "euclidean",
                  min.nc = 2, max.nc = 10, 
                  method = "complete", index ="gap") 
res.nb # print the results
   
## $All.index
##       2       3       4       5       6       7       8       9      10 
## -0.2899 -0.2303 -0.6915 -0.8606 -1.0506 -1.3223 -1.3303 -1.4759 -1.5551 
## 
## $All.CriticalValues
##       2       3       4       5       6       7       8       9      10 
## -0.0539  0.4694  0.1787  0.2009  0.2848  0.0230  0.1631  0.0988  0.1708 
## 
## $Best.nc
## Number_clusters     Value_Index 
##          3.0000         -0.2303 
## 
## $Best.partition
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 3 2 3 3 3 3 2 2 2
##  [71] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 3 2 3 2 2 3 2 2 2 3 3 3 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
   
The elements returned by the function NbClust() are accessible using the R code below:
# All gap statistic values
res.nb$All.index
# Best number of clusters
res.nb$Best.nc
# Best partition
res.nb$Best.partition
   

6.3.2 Compute all the 30 indices

The following example compute all the 30 indices, in a single function call, for determining the number of clusters and suggests to user the best clustering scheme. The description of the indices are available in NbClust documentation (see ?NbClust).
To compute multiple indices simultaneously, the possible values for the argument index can be i) “alllong” or ii) “all”. The option “alllong” requires more time, as the run of some indices, such as Gamma, Tau, Gap and Gplus, is computationally very expensive. The user can avoid computing these four indices by setting the argument index to “all”. In this case, only 26 indices are calculated.
With the “alllong” option, the output of the NbClust function contains:

  • all validation indices
  • critical values for Duda, Gap, PseudoT2 and Beale indices
  • the number of clusters corresponding to the optimal score for each indice
  • the best number of clusters proposed by NbClust according to the majority rule
  • the best partition
The R code below computes NbClust() with index = “all”:
nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")
   
# Print the result
nb
   
It’s possible to visualize the result using the function fviz_nbclust() [in factoextra], as follow:
fviz_nbclust(nb) + theme_minimal()
   
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 18 proposed  3 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is  3 .
   
Optimal number of clusters - R data visualization

  • ….
  • 2 proposed 2 as the best number of clusters
  • 18 indices proposed 3 as the best number of clusters.
  • 3 proposed 10 as the best number of clusters
According to the majority rule, the best number of clusters is 3

7 Infos

This analysis has been performed using R software (ver. 3.2.1)
  • Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36.
  • Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
  • Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423. PDF