How to cluster tabular data with Markov Clustering
Have you ever thought how could you cluster tabular data?
Whether one categorizes emails or performs customer segmentation, clustering is most of the time the way to go. This approach consists of dividing the dataset into groups (or clusters) such that objects belonging to the same cluster are more similar than items belonging to different clusters. It is possible to calculate such similarity according to a specific metric that of course can change with the problem to be solved.
The State of the art of clustering algorithms
Data scientists use clustering algorithms such as k-means, hierarchical clustering or DBSCAN in exploratory data analysis of tabular data. There is another type of data that is also common: networks. In the network analysis jargon, researchers refer to clustering network data to as community detection. In such a context, the clusters represent the communities.
Refer to [1] and [2] for more details on community detection and state-of-the-art algorithms for clustering networks, respectively.
What is Markov Clustering?
Obviously, in this episode I explain how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix.
Moreover, I highlight how one can build a similarity graph and then run a community detection algorithm on such graph to find clusters in tabular data. Finally, this episode explains how to cluster tabular data with Markov Clustering.
In addition to the episode, you can find a simple hands-on code snippet to play with on the Amethix Blog
Enjoy the show!Cluster tabular data with Markov Clustering
References
[1] S. Fortunato, “Community detection in graphs”, Physics Reports, volume 486, issues 3-5, pages 75-174, February 2010.
[2] Z. Yang, et al., “A Comparative Analysis of Community Detection Algorithms on Artificial Networks”, Scientific Reports volume 6, Article number: 30750 (2016)
[3] S. Dongen, “A cluster algorithm for graphs”, Technical Report, CWI (Centre for Mathematics and Computer Science) Amsterdam, The Netherlands, 2000.
[4] A. J. Enright, et al., “An efficient algorithm for large-scale detection of protein families”, Nucleic Acids Research, volume 30, issue 7, pages 1575-1584, 2002.
[5] Data Science at Home blog and podcast
Our website uses cookies to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept,” you consent to use ALL the cookies.