Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years
Bruno Moura Valle Costa is a 4th year student at the University of São Paulo University, currently studying Civil Engineering.
Clusters analysis is now, more than ever, relevant to several areas of research. As a statistical tool, it can be applied in several studies, such as market segmentation, policymaking and behavior research. It relies on a multivariate independence analysis technique and is considered an unsupervised learning technique when applied to machine learning algorithms. Ideally, the algorithm produces multiple clustered groups with high internal homogeneity and high external heterogeneity, creating a taxonomy based on the behavior of sampled data.
In this project, the focus of the study will be on a non-hierarchical method of cluster analysis: the k-means procedure. In Big Data, this method is less computation-demanding and requires less interactions to achieve convergence when compared to the usual hierarchical alternative. However, k-means does require, as input, a pre-defined number of clusters. Thus, one core challenge is imposed by the k-means procedure: the data analyst needs to estimate upfront the number of clusters contained within the data,which isn’t always possible, especially in the Big Data world, where the massive amount of data can make it harder to visualize any trends in the data. Equally important, the data analyst must be capable of selecting the proper attributes to stablish the initial clusters and to provide some practical meaning to the initially formed clusters. Hence, before realizing the full potential of the k-means algorithm in Big Data some complex pre-processing might be needed on the data to be clustered.
Previous literature defines several methods, such as the elbow method, that runs multiple iterations of the k-means algorithm and calculates the distortion in the clusters and searches for the inflection point, i.e., where clustering of the data can be considered optimal. There are other methods, such as the R2 and the Silhouette method, that can also be considered, however, little is known about how these methods can be optimally leveraged and compared to each other in an open-source Big Data platform technology, such as HPCC Systems. Therefore, this project aims to contribute to the overall effort of providing data analysts with the optimal tool set for estimating the number of potential groups to be formed from the data residing on the HPCC Systems platform, where the k-means algorithm is already available as a machine learning plugin.
To achieve this aim, the project initially develops a factorial analysis routine on a pre-defined data base, chosen to be the Chicago Crimes Database due to its multivariate composition and large set of records, in an attempt to reduce the dimensionality and create orthogonal variables, which have no correlation with each other and make the interpretation of groups easier. This factorial analysis involves multiple statistical tools, such as Pearson’s Chi-squared tests for correlation, using the pre-existing ECL Library and possibly generating new functions and libraries that will contribute with the HPCC Systems community.
Big Data Analysis is getting more relevant as data grows bigger day by day. Therefore, the k-means procedure appears as a computing-efficient cluster analysis method and its biggest barrier - defining the number of clusters - could be tackled trough the HPCC Systems functionalities. Thus, this project aspires to bring auxiliary tools and generate a full analysis experience on determining the number of clusters in the k-means algorithm through HPCC Systems.
In this Video Recording, Bruno provides a tour and explanation of his poster content.
Estimating the Number of Clusters for the K-means Algorithm in HPCC Systems
Click on the poster for a larger image.