Varsha R Jenni is a 2nd year undergraduate student at the RV College of Engineering. Varsha has interests in machine learning and distributed computing. She has found HPCC Systems to be a great open source platform which makes data processing analysis easier and faster.
Density-based spatial clustering of data with noise (DBSCAN) is a popular clustering algorithm that groups data points which are close together using two parameters eps - which is the radius of each cluster, and Minpts, which is the minimum number of points in each cluster. However, the performance of DBSCAN reduces for the datasets with varying density clusters. The poster proposes the implementation of a novel distributed and adaptive DBSCAN algorithm on the HPCC Systems platform. The proposed approach uses techniques such as grid search and Gaussian kernel to search optimized values for the threshold density of clusters, thus eliminating the requirement for users to specify the parameters. Further, the experimental investigation suggests that proposed ADBSAN performs better compared to existing ADBSCAN implementations using k-dist and Gaussian kernels.
With the rapid advancement in technology, data is generated faster and larger. However, it is not the data that is valuable. The information conveyed by the data is of greater importance. Organizations can use this information for the betterment of society. Data mining is a huge field that deals with the extraction of information from raw data. Clustering is one of the tasks in this field. Clustering is the most common and efficient unsupervised learning technique. It aims at the partitioning of the dataset into groups of similar elements such that each group is different from other groups. Some of the applications of the DBSCAN algorithm are market research and analysis, image processing, and anomaly detection
This study aims at the implementation of an efficient, distributed, and adaptive DBSCAN(ADBSCAN) algorithm HPCC Systems which first determines the threshold density, for any given dataset, including datasets with variable density clusters for clustering. Thus eliminating the need for users to specify the values. Further, the manuscript discusses other ADBSCAN implementations and compares the proposed approach with them using various open datasets.
In this Video Recording, Varsha provides a tour and explanation of her poster content.
Click on the poster for a larger image. The original PDF version can be found here. (Available for download).