It is natural to think of the centroid of a geometric figure, such as a triangle, as a sort of "balance point". That is, if we had constructed a triangular shape out of cardboard, then we could keep the figure perfectly balanced and horizontal while supporting it with our finger at just that point.
Recall that, for a triangle with vertices A, B, C, the centroid was at (A+B+C)/3. For a general geometric shape, the coordinates of the centroid are determined by the pair of equations:
area = integral ( shape ) 1 dx dy cent(x) = integral ( shape ) x dx dy / area cent(y) = integral ( shape ) y dx dy / area
But now suppose that we constructed a shape that had a triangular profile in the (x,y) plane, but had a varying height in the z direction. We will call this varying height a "density", which we will usually write as rho(x,y). We assume that rho(x,y) is always nonnegative. Although this shape is more complicated, it also must have a balance point, which we call the center of mass. Without knowing more about rho(x,y), it is not possible to write down simple formulas for the center of mass, even for a triangle. So all we can say is
mass = integral ( shape ) rho(x,y) 1 dx dy com(x) = integral ( shape ) rho(x,y) x dx dy / mass com(y) = integral ( shape ) rho(x,y) y dx dy / mass
In the K-means problem, we have N data items to group. We can have a weighted K-means problem by giving each item a weight W as well. Instead of discrete centroids of groups, we compute discrete centers of mass. As an example, suppose we have data for the locations and populations of 128 cities in the United States. A weighted K-means algorithm would notice the high population density along the coast, which would attract centers in that direction. (In the notebook, we won't have time to do an exercise for this topic.)
Topics: