Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years

Abhay Kashyap is currently in his 3rd year, pursuing a Bachelors in Computer Science and Engineering at RVCE in India.

His academic interests include Big Data Analytics and Computational Astrophysics.

Poster Abstract

The domain of astrophysics is an ever-growing field and data being collected is increasing rapidly as the instruments are improving in resolution and span of the sky they cover. Astronomical data is varied (spectroscopic, visible light images, radio signal data) and massive amounts of data are collected, which requires proper analysis in a timely manner to draw meaningful conclusions. The velocity at which it is being collected, its value, and inconsistencies in the data give rise to a need for big data solutions.

In this project, we try to use the big data platforms HPCC Systems and Hadoop to tackle one such problem. HPCC Systems provides a truly open-source big data solution that allows quickly processing, analyzing, and understanding large data sets, even massive, mixed-schema data, which is suitable for astronomical analysis. The formulated objectives for the project were to first, identify and analyze a problem in Astrophysics that requires an innovative big data solution. Secondly, exploring libraries in the platforms for solving the problem, and finally, analyzing and comparing the performance of the two platforms, and tabulating the results. Galaxy Zoo is a citizen scientist project which aims to be the largest catalog of classified galaxy images. The data used for this is taken from the SDSS-II supernova survey which operated in drift scan mode for three months in a year and contains about 240,000 images of objects. This data is given to citizen scientists to classify the object of the image based on different parameters as one of three classes: smooth, disk, or star/artifact.

To automate and pipeline this process a machine learning algorithm can be used to perform the classification. To compare the performance of Hadoop and HPCC Systems we used classification speed (throughput), ease of setup, ease of programming, and scalability as parameters. To achieve this, in each platform, an initial preprocessing of data was performed to look for missing or wrong values and correct them. Then, image processing tools were used to prepare the data for training. Platform-specific methods were used to classify test data and tabulate the results based on both model type and platform. Both the platforms support most ML classifiers, so a KNN, Random Forest, GLM, and a generalized neural network were used for classification and comparison.

The results consistently show that the HPCC Systems platform performs better in most machine learning tasks. On average the throughput of the HPCC Systems ML bundles was 15-19% better. Hadoop is preferred for more textual data where tasks are based on aggregating and calculating. For iterative tasks and multiple queries, results show that the multi-node architecture of HPCC Systems performs better. HPCC Systems is also easier to set up and with its more personalized networks can be created using cloud architecture. The only disadvantage in the case of HPCC Systems is that tools currently available for preprocessing and post-processing of data, especially images, are not sufficient and can be improved upon.

Presentation

In this Video Recording, Abhay provides a tour and explanation of his poster content.

Astronomical Analysis on HPCC Systems

Click on the poster for a larger image.