Clustering

ABSTRACT: Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic
scalability becomes a major concern. In this work, we present our scalable approach for text clustering that works in two phases. In the first phase, we build an approximate k-NN graph through a parallel, iterative process reminiscent of the well-known NNDescent algorithm, and use the Jaro-Winkler similarity metric. The first phase concludes with a pruning stage, which strives at eliminating spurious links between items with low
similarity. In the second phase, we use a parallel approach to identify connected components in the k-NN graph, which are a proxy for clusters of similar items. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do
so through an extensive experimental evaluation, where we use real-life datasets including a 4 Millions SPAM emails collected by Symantec Research Labs, and a dump of 1.5 Million tweets collected using the Twitter API. Our experimental results show that even rough approximations of k-NN graphs are sufficient to identify useful clusters, which we validate both using standard metrics and with the help of domain experts. Our method –
which we implemented for the Apache Spark framework – is scalable and can be easily tuned to meet requirements stemming from different application domains

PUBLICATIONS
Alessandro Lulli, Thibault Debatty, Laura Ricci, Matteo Dell’Amico, and Pietro Michiardi, Scalable k-NN based text clustering, Accepted ad IEEE BigData 2015

CODE
The code used for the experiments is available at:
https://github.com/alessandrolulli/knnMeetsConnectedComponents

Please do not hesitate to contact me in case of any issues: