What?[edit | edit source]
K means clustering is a unsupervised machine learning algorithm to perform clustering. In this we randomly assume k centres in our dataset and continuously do so until we converge or we wish to stop. (It is NP hard Problem :P)
Why?[edit | edit source]
We do so to nearly at best approximate the positions of those centres to get a good estimate of how similar our data is to each other.
How?[edit | edit source]
When we have randomly assigned centres we calculate the euclidean distance of each observation to the closest centres and stop when this vakue is minimum.It is used as it is the most computation efficient and approximate solution to our problem of dividing into groups or clusters.
Limitations[edit | edit source]
One possible outcome is that there are no organic clusters in the data; instead, all of the data fall along the continuous feature ranges within one single group.
Additional resources[edit | edit source]
- Jupyter notebook: scikit-learn K-Means clustering
- Trevino, Andrea, ‘Introduction to K-Means Clustering’ <https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials> [accessed 8 December 2016]