Browse: Home, Abstracts, Winners and runners up, Awards Ceremony (Watch Recording / View Slides), Posters by HPCC Systems Interns, Posters by Academic Partners, Poster Judges, Virtual Judging, Virtual Poster Booths
Vannel Zeufack is studying for a Masters in Computer Science at Kennesaw State University.
Vannel joined our intern program in 2020 for the second year. He has a keen interest in machine learning algorithms, having completed a project last year that involved contributing to a machine learning bundle focusing on anomaly detection algorithms.
In this flourishing era of Artificial Intelligence, Machine Learning algorithms are having an increasingly bigger impact into our daily lives. They are extensively used to power applications in various domains including self-driving cars, weather forecasting, marketing, robotics, anomaly detection and many more.
A machine learning project can be broadly divided into five main phases: data collection, data preprocessing, model selection and setup, inference and evaluation. Among all those phases, it is well known that the most time-consuming phase is data preprocessing which could account for about 80% of the whole project.
As machine learning has showed his importance since the last ten years, HPCC Systems, the end-to-end data lake management solution, have made itself up to date by providing a fully-fledged machine learning library. It contains a wide range of machine learning algorithms both supervised and unsupervised.
However, it currently lacks a data preprocessing package to help machine learning engineers speed up the data preprocessing phase of their projects and therefore enhance their productivity. They still have to write a lot of custom-made modules and functions.
To fill that gap, we implemented a Preprocessing Bundle for HPCC Systems Machine Learning Library. The current version includes the following modules and functions:
- LabelEncoder and OneHotEncoder: modules to process categorical features
- StandardScaler and MinMaxScaler: modules for scaling data
- MLNormalize: a function for normalizing data
- Split, StratifiedSplit, RandomSplit and StratifiedRandomSplit: functions for easily splitting datasets into training and test data
The Preprocessing Bundle will be included into HPCC Systems ML_CORE Library. It comes along with a tutorial showcasing how its services could be used into an end-end machine learning project to speed up the data preprocessing phase.
In this Video Recording, Vannel provides a tour and explanation of his poster content.
Poster Title: Preprocessing Bundle for HPCC Systems Machine Learning Library
Click on the poster for a larger image. The original PDF version can be found here. (Available for download).