The K-Means problem is a warmup exercise for the tasks we will need to do involving geometry. The K-Means task starts with a collection of N items of data, and asks for the "best" way of sorting them into K groups, so that the items in each group are "close" to each other.
This means we need to
Let G(J) be the J-th of the K groups. We would expect that any item X(I) in this group is there because it is close to the other items. We can measure the closeness of two items using Euclidean distance. It turns out to be expensive to compare every pair of items, so instead we represent each group by its average value, and measure closeness relative to that. Let M(J) be the average of the items in group G(J). We say the "point energy" E(I,J) associated with item X(I) in group G(J) is (X(I)-M(J))^2 (that's the Euclidean distance squared.) The "group energy" E(J) is then simply the sum of the point energies of all the points in the group. There are technical reasons why squaring is the right thing to do.
If E(J) is the energy of the J-th group, then let E, the "total energy" be the sum of the group energies E(J) for 1 <= J <= K. We will use the energy value as an indicator of goodness of grouping. Given two possible groupings, we will prefer the one with smaller total energy.
The algorithm we will use is a weak version of the K-means algorithm. Begin by assigning each point to a group, in any way you want (But it is helpful to avoid having any empty groups). Now iterate the following steps:
Topics: