Introduction to Clustering Techniques

Written by Yash Patel
on February 5, 2025

In today’s data-driven world, businesses and researchers deal with vast amounts of unstructured and unlabeled data. Identifying patterns in this data is crucial for decision-making. For example, e-commerce platforms analyze customer purchase histories to recommend products, financial institutions detect fraudulent transactions by recognizing unusual spending behaviors, and healthcare providers group patients based on symptoms to personalize treatments. These tasks require unsupervised learning techniques, which automatically find patterns in data without predefined labels. One of the most effective methods for this is clustering, which groups similar data points together based on their characteristics.

Clustering in Machine Learning

What is clustering?

Clustering is a technique used to group similar data points together in an unlabeled dataset, helping to reveal hidden structures in data.

Why is clustering important?

Helps businesses segment customers for targeted marketing.

Assists in anomaly detection, such as identifying fraudulent activities.

Enhances recommendation systems by grouping similar users or products.

Supports medical research by categorizing patients based on symptoms or genetic profiles.

Common Clustering Techniques:

K-Means Clustering – Partitions data into a fixed number of clusters based on similarity.

Hierarchical Clustering – Builds a tree-like structure of clusters, allowing for flexible grouping.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) – Detects clusters based on data density, useful for irregularly shaped groups.

Each of the above techniques has unique advantages and is suited for different types of data and applications. In the following sections, we will explore these clustering methods in detail, understanding how they work, their benefits, and their real-world applications.

Comparison of Clustering Techniques

Feature	K-Means	Hierarchical	DBSCAN
Predefined Number of Clusters Required	✅	❌	❌
Handles Outliers Well	❌	❌	✅
Works with Non-Spherical Clusters	❌	✅	✅
Scalable for Large Datasets	✅	❌	⚠️ (Moderate)
Provides Hierarchical Structure	❌	✅	❌
Detects Arbitrary-Shaped Clusters	❌	✅	✅

K-means

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset.

It accomplishes this using a simple conception of what the optimal clustering looks like:

The “cluster center” is the arithmetic mean of all the points belonging to the cluster.

Each point is closer to its cluster center than to other cluster centers.

Those two assumptions are the basis of the k-means model.

We will soon dive into exactly how the algorithm reaches this solution, but for now let’s take a look at a simple dataset and see the k-means result.

First, let’s generate a two-dimensional dataset containing four distinct blobs.

To emphasize that this is an unsupervised algorithm, we will leave the labels out of the visualization.

How do k-means work?

Decide the number of clusters.

Optimal number of k cluster can be find using elbow method the graph of no. Of clusters vs WCSS (Within Cluster Sum of Squared distance) or it can sometimes be called inertia.

Choose k-datapoints and assign them to a cluster.

Cluster centroid will be calculated.

Update the centroid until we find the optimal previous centroid and the updated centroid doesn’t change.

The sum of squared distances between data points and centroids will be calculated first.

Now, we will allocate each data point to the nearest centroid of a cluster.

Finally, we will again calculate the centroids for the cluster by averaging all the cluster data points.

When using k-means keep these points in mind.

Normalization is required since it uses distance.

Because of the iterative nature of K-Means and the random initialization of centroids, K-Means may become stuck in a local optimum and fail to converge to the global optimum. As a result, it is advised to employ distinct centroids’ initializations.

Graphical representation

STEP 1: Let us pick k clusters, i.e., K=2, to separate the dataset and assign it to its appropriate clusters. We will select two random places to function as the cluster’s centroid.

STEP 2: Now, each data point will be assigned to a scatter plot depending on its distance from the nearest K-point or centroid. This will be accomplished by establishing a median between both centroids. Consider the following illustration:

STEP 3: The points on the line’s left side are close to the blue centroid, while the points on the line’s right side are close to the yellow centroid. The left Form cluster has a blue centroid, whereas the right Form cluster has a yellow centroid.

STEP 4: Repeat the procedure, this time selecting a different centroid. To choose the new centroids, we will determine their new center of gravity, which is represented below:

STEP 5: After that, we’ll re-assign each data point to its new centroid. We shall repeat the procedure outlined before (using a median line). The blue cluster will contain the yellow data point on the blue side of the median line.

STEP 6: Now that reassignment has occurred, we will repeat the previous step of locating new centroids.

STEP 7: We will repeat the procedure outlined above for determining the center of gravity of centroids, as shown below.

STEP 8: Like the previous stages, we will draw the median line and reassign the data points after locating the new centroids.

STEP 9: We will finally group points depending on their distance from the median line, ensuring that two distinct groups are established and that no dissimilar points are included in a single group.

The final Cluster is as follows:

Advantages and Disadvantages

Advantages

It works faster than Hierarchical clustering when we have a high number of variables.

When compared to Hierarchical clustering K-means produces tighter clusters.

Disadvantages

It relies on distances, so scaling is required.

It is not advisable to do clustering tasks if the clusters have a sophisticated geometric shape.

If your dataset looks like this K-means will fail. Or you have noise in data.

Hierarchical Clustering

Hierarchical clustering is a method of grouping data into a tree-like structure, making it easier to visualize relationships between clusters. It does not require specifying the number of clusters beforehand, unlike K-means. This technique is categorized into two types:

1. Agglomerative Clustering (Bottom-Up Approach)

Starts with each data point as its cluster.

Iteratively merges the closest clusters based on a similarity measure (e.g., Euclidean distance).

Continues merging until only one cluster remains or a stopping criterion is met.

Common linkage methods used:

Single linkage – Merges clusters based on the shortest distance between points.

Complete linkage – Merges clusters based on the farthest distance between points.

Average linkage – Uses the average distance between all points in two clusters.

Example: Used in gene expression analysis, social network analysis, and document clustering.

2. Divisive Clustering (Top-Down Approach)

Starts with the entire dataset as a single cluster.

Recursively splits clusters into smaller sub-clusters.

The process continues until each data point is in its cluster or a stopping criterion is reached.

It is more computationally expensive than agglomerative clustering.

Example: Useful in text classification, anomaly detection, and topic modeling.

Agglomerative clustering

How does agglomerative clustering work?

We assign each point to an individual cluster in this technique.

Suppose there are 4 data points. We will assign each of these points to a cluster and hence will have 4 clusters at the beginning.

Now you create a proximity matrix which is nothing but the matrix of distance between each cluster.

Next, you will merge two clusters with the closest distance.

Again, create a proximity matrix now the number of clusters will be n-1 for the current step.

You will iterate the above steps till you make a single cluster.

How to choose the number of optimal clusters?

Here, the concept of Dendrogram is used.

A dendrogram is a treelike structure that records the sequences of merge or split.

We will plot the graph between the number of samples on the x-axis and the distance on the y-axis.

The greater the distance of the vertical lines in the dendrogram, the more the distance between those clusters.

Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line)

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold.

In the above diagram, we can see that 2 lines intersect so here the number of clusters is 2.

Divisive clustering is the opposite of agglomerative clustering. It starts with a single cluster and separates the cluster until every data point becomes a single cluster.

Merging is better than breaking the cluster and that’s why agglomerative clustering is used more.

DBScan

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.

DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density.

If the datapoints are dense then it will create a cluster of dense grouped datapoints.

It is robust to outliers.

It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.

It requires only two parameters: Epsilon and minpoint.

Epsilon is the radius of the circle which is created around each data point to check the density.

MinPoints are the number of points that are required in the circle so that a data point can be called a core point.

DBSCAN creates a circle of epsilon radius around every data point and classifies them into Core point, Border point, and Noise.

A datapoint is called a Core point if a circle around it contains at least specified minpoints.

A data point is called a Border point if a circle around it has at least 1 data point or less than minpoints.

If there are no other data points around any data point within the epsilon radius, then it is treated as noise.

Here min_point = 3, Red points are core points, yellow points are border points, and blue points are noise.

Euclidean distance is used to locate data points in a space.

There are two concepts in DBSCAN:

Reachability:

It states if a data point can be accessed from another data point either directly or indirectly.

Connectivity

It states whether two data points are in the same cluster or not.

In terms of Reachability and Connectivity DBSCAN can be referred to as:

Directly-Density Reachable

A point X is directly density-reachable from point Y w.r.t epsilon, minPoints if,

X belongs to the neighborhood of Y, i.e., dist(X, Y) <= epsilon.

Y is a core point

Here, X is directly density-reachable from Y but vice versa is not valid.

Density-Reachable

A point X is density-reachable from point Y w.r.t epsilon, minPoints if there is a chain of points p1, p2, p3, …, pn and p1=X and pn=Y such that pi+1 is directly density-reachable from pi.

Here, X is density-reachable from Y with X being directly density-reachable from P2, P2 from P3, and P3 from Y.

Density-Connected

Point X is density-connected from point Y w.r.t epsilon and minPoints if there exists a point O such that both X and Y are density-reachable from O w.r.t to epsilon and minPoints.

Effect of Epsilon and min_Points

The value of min_Points should be greater than the number of features. If min_Points is set as 1 then algorithm will conclude that each data point is a cluster. Generally, min_point value is double the number of dimensions.

The value of Epsilon can be decided by the K-distance graph. If the value of epsilon is too small, then a higher number of clusters will be created, and more data points will be considered as noise. If it chooses too big then then small cluster will merge into big cluster where information will be lost.

Though DBSCAN creates clusters based on varying densities, it struggles with clusters of similar densities.

Also, as the dimension of data increases, it becomes difficult for DBSCAN to create clusters, and it falls prey to the Curse of Dimensionality.

Yash Patel

Latest Blogs