Clustering — Finding Natural Groups
Imagine pouring a bag of mixed marbles onto a table — red ones cluster near red ones, blue near blue, without anyone sorting them. Clustering algorithms do the computational equivalent: given a set of unlabeled data points, find a partition into groups such that points within a group are more similar to each other than to points in other groups. The goal sounds intuitive; the challenge is making it rigorous and scalable.
Formalizing Similarity and Distance
Clustering requires a notion of similarity. The most common formal choice is Euclidean distance: for two points a = (a₁, a₂, ..., aₙ) and b = (b₁, b₂, ..., bₙ) in n-dimensional space, dist(a, b) = √[(a₁−b₁)² + (a₂−b₂)² + ... + (aₙ−bₙ)²] Smaller distance means more similar. Other distance metrics — cosine distance for text, Manhattan distance, edit distance for strings — apply in different domains. Given a distance function, a clustering is good if: - Intra-cluster distance (within a group) is small — members are tightly packed. - Inter-cluster distance (between groups) is large — groups are well-separated. The ratio of these is the basis for internal evaluation metrics like the silhouette score.
A centroid is the mean position of all points in a cluster — the arithmetic average of their coordinates. In k-means, centroids are the representatives around which clusters are built. Moving the centroid is how the algorithm iteratively improves its partition.
The k-means Algorithm — a worked trace: Input: dataset X of n points, integer k (number of clusters desired). Step 1 — Initialize: randomly place k centroids c₁, c₂, ..., cₖ in the feature space. Step 2 — Assign: for every point xᵢ, compute its distance to each centroid and assign it to the nearest one. cluster(xᵢ) = argmin_j dist(xᵢ, cⱼ) Step 3 — Update: recompute each centroid as the mean of all points assigned to it. cⱼ = (1/|Sⱼ|) Σ xᵢ for all xᵢ in cluster Sⱼ Step 4 — Repeat Steps 2 and 3 until assignments stop changing (convergence). Concrete example: 6 points in 2D — (1,1), (1,2), (2,1), (8,8), (9,8), (8,9) — with k=2. Initial centroids: c₁=(1,1), c₂=(8,8). Assign: first three points → cluster 1; last three → cluster 2. Update: c₁=(4/3, 4/3)→ actually (1.33,1.33); c₂=(8.33,8.33). Iterate: assignments unchanged → converged in one step. Result: two tight clusters perfectly separated, which matches the true geometry.
Choosing k — the critical hyperparameter: k-means requires you to specify k before running. If you pick too small a k, genuinely distinct groups merge. If k is too large, real groups split artificially. Two strategies for selecting k: 1. The Elbow Method: run k-means for k = 1, 2, 3, ... and plot total within-cluster sum of squares (WCSS) against k. WCSS always decreases as k increases, but it decreases steeply at first and then flattens. The 'elbow' — where the rate of decrease sharply slows — suggests the right k. 2. The Silhouette Score: for each point, compute how similar it is to its own cluster versus the nearest other cluster. Scores range from −1 (wrong cluster) to +1 (well-separated). Average over all points; higher is better. Neither method is infallible. Domain knowledge about how many groups make sense often matters as much as any metric.
Flashcards — click each card to reveal the answer
Limitations You Must Know
k-means is elegant but has real weaknesses that practitioners must understand: 1. Assumes spherical clusters: k-means minimizes Euclidean distance, so it works best when clusters are roughly round and similarly sized. Elongated, crescent-shaped, or nested clusters fool it. 2. Sensitive to initialization: different random starting centroids can yield different final partitions. Mitigation: run k-means multiple times with different seeds and keep the result with the lowest WCSS. The k-means++ initialization strategy also reduces this problem by spreading initial centroids apart. 3. Requires specifying k: the number of clusters must be chosen before the algorithm runs — it cannot discover k on its own. 4. Outliers distort centroids: a single extreme point shifts the centroid of its cluster, potentially breaking the partition. k-medoids (which uses actual data points as representatives rather than means) is more robust. Alternative algorithms for non-spherical structure — DBSCAN (Density-Based Spatial Clustering of Applications with Noise) can discover arbitrary shapes and automatically identifies noise points that belong to no cluster.
k-means uses Euclidean distance. If one feature ranges from 0 to 1,000 and another from 0 to 1, the large-range feature will dominate all distance calculations, and the second feature will be effectively ignored. Always standardize features (zero mean, unit variance) before running k-means — or any distance-based clustering algorithm.
After one full iteration of k-means with k=3, what happens to the centroids?
A researcher applies k-means to a dataset where the true clusters are two interleaved crescent shapes. Why will k-means likely fail?
Manual k-means
- Step 1: Draw a 10x10 grid on paper. Plot these 8 points: A(1,1), B(2,2), C(1,2), D(8,8), E(9,7), F(8,9), G(5,1), H(5,2).
- Step 2: Set k=2. Place initial centroids at C1=(1,1) and C2=(8,8).
- Step 3: Assign each point to its nearest centroid. Write the assignments.
- Step 4: Recalculate each centroid as the mean of its assigned points (average x and average y separately).
- Step 5: Repeat Steps 3-4. Has the partition changed? Repeat until stable.
- Step 6: What happened to points G and H? Try k=3 — does the result change meaningfully? What does this tell you about choosing k?