K-means clustering is one way of taking some data and allowing a computer to do what you do pretty naturally with your eyes and brain—separate the data into distinguishable clusters. For example, in the graph shown below, you can very easily see two clumps of points (points A and D in a clump and points B and C in a clump). A computer, to the extent it sees anything, sees just four points and their coordinates.

Why not just use our eyes and brain? Because once we teach a computer to approximate our ability to cluster 2D or 3D data, it can cluster data with many more than just 2 or 3 components. And then its “seeing” outpaces ours by quite a lot.

Let’s take a look at the instructions a computer could follow to do k-means clustering, and then we’ll dress it all up in linear algebra symbolism some other time. To start, I’ve just made 2 clusters of points (which we know about, but the computer doesn’t) with 2 components each (an x-component and y-component, i.e., 2D data).

Determining Least Distances

To start, we select a \(\mathtt{k}\), a number of clusters that we want, remembering that we know right now how many clusters there are but in most situations we would not. We choose \(\mathtt{k=2}\). Then we place the two cluster points at random locations—here I’ve put \(\mathtt{\color{blue}{k_1}}\) at \(\mathtt{(2,7)}\) and \(\mathtt{\color{red}{k_2}}\) at \(\mathtt{(4,2)}\).

Next, we calculate the distance from each point to each center. This is the good ol’ Pythagorean Theoremish Euclidean distance of \(\mathtt{\sqrt{(x_2-x_1)^{2}+(y_2-y_1)^{2}}}\). The cluster that we assign to each point is given by the closest center to that point. You can run the code below to print out the 8 distances.

Going by shorter distances, our first result, then, is to group point A in cluster \(\mathtt{\color{blue}{k_1}}\), because the first number in that pair is smaller, and points B, C, and D in cluster \(\mathtt{\color{red}{k_2}}\), because the second number in each of those pairs is the smaller one.

Moving the Centers Based on the Means

The next step—and last before repeating the process—is to move each center to the mean of the points in the cluster to which it is currently assigned. The mean of the points is determined by calculating the mean of the components separately. So, for our current cluster \(\mathtt{\color{red}{k_2}}\), the points B, C, and D have a mean of (\(\mathtt{\frac{2 + 2 + 6}{3}}\), \(\mathtt{\frac{4 + 3 + 5}{3}}\)), or (\(\mathtt{\frac{10}{3}}\), \(\mathtt{4}\)). And since \(\mathtt{\color{blue}{k_1}}\) has just one point, A, it will move to smack dab on top of that point, at (5, 6).

Now we can do another round of distance comparisons, given the new center locations. These calculations give us what we can see automatically—that points A and D belong to one cluster and points B and C belong to another cluster. In this case, A and D belong to cluster \(\mathtt{\color{blue}{k_1}}\) and B and C belong to cluster \(\mathtt{\color{red}{k_2}}\).

The cluster centers now move to the means of each pair of points, placing them where we would likely place them to begin with (directly between the two points in the cluster). Further calculations won’t change these assignments, so the k-means algorithm is done when it stops changing things drastically (or at all).