How to cluster tabular data with Markov Clustering

Have you ever thought how could you cluster tabular data? Whether one categorizes emails or performs customer segmentation, clustering is most of the time the way to go. This approach consists of dividing the dataset into groups (or clusters) such that objects belonging to the same cluster are more similar than items belonging to different clusters. It is possible to calculate such similarity according to a specific metric that of course can change with the problem to be solved.

The State of the art of clustering algorithms

Data scientists use clustering algorithms such as k-means, hierarchical clustering or DBSCAN in exploratory data analysis of tabular data. There is another type of data that is also common: networks. In the network analysis jargon, researchers refer to clustering network data to as community detection. In such a context, the clusters represent the communities. Refer to [1] and [2] for more details on community detection and state-of-the-art algorithms for clustering networks, respectively.

What is Markov Clustering?

Obviously, in this episode I explain how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, I highlight how one can build a similarity graph and then run a community detection algorithm on such graph to find clusters in tabular data. Finally, this episode explains how to cluster tabular data with Markov Clustering. In addition to the episode, you can find a simple hands-on code snippet to play with on the Amethix Blog Enjoy the show!Cluster tabular data with Markov Clustering

References

[1] S. Fortunato, “Community detection in graphs”, Physics Reports, volume 486, issues 3-5, pages 75-174, February 2010. [2] Z. Yang, et al., “A Comparative Analysis of Community Detection Algorithms on Artificial Networks”, Scientific Reports volume 6, Article number: 30750 (2016) [3] S. Dongen, “A cluster algorithm for graphs”, Technical Report, CWI (Centre for Mathematics and Computer Science) Amsterdam, The Netherlands, 2000. [4] A. J. Enright, et al., “An efficient algorithm for large-scale detection of protein families”, Nucleic Acids Research, volume 30, issue 7, pages 1575-1584, 2002. [5] Data Science at Home blog and podcast 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Data Science

Discord community chat

Join our Discord community to discuss the show, suggest new episodes and chat with other listeners!


Support us