Introduction

K-means Clustering | Introduction

K-Means clustering randomly initializes k cluster centroids in the dataset. Then, the algorithm measures the distance between  each cluster centroids and each data point in the data set. The distance metric used can be Euclidean or Manhattan distance. The data points are then assigned to the centroids that is closest to that point. 

When all points are assigned, new centroids of the cluster are located by averaging all the datapoints that are assigned to that cluster. The new centroids are then used to repeat the process (measure distance to all points, cluster point to closest centroids, computer new centroids, repeat). 

This process continues until a stopping condition is reached. The stopping condition is usually met when a certain number of iterations has been reached, or if the movement of the new centroids point is less than a specific threshold, or if the centroids stop moving. On data that have a structure that allows for clustering, k-means typically converges after a low number of iterations.

K-Means always requires declaring the number of clusters up front. Below, we will discuss methods for determining and optimal k value.

The following image summarizes K-means clustering: 1a. Randomly initialize *k* cluster centroids, 1b. Measure the distance between each point and the cluster 2. Group data points to the nearest cluster 3. Average the points in a cluster to find the new cluster centroid, repeat steps 1b to 3 until a stopping condition is reached.