Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

About Vannel Zuefack

Robert Kennedy is a PhD Candidate at Florida Atlantic University. He is completing a PhD in Computer Science and joined the HPCC Systems Intern program for the third year running in 2020. During his internships, Robert has focused on machine learning projects relating to neural networks and has been using HPCC Systems with TensorFlow.

Poster Abstract

In this flourishing era of Artificial Intelligence, Machine Learning algorithms are having an increasingly bigger impact into our daily lives. They are extensively used to power applications in various domains including self-driving cars, weather forecasting, marketing, robotics, anomaly detection and many more. 

A machine learning project can be broadly divided into five main phases: data collection, data preprocessing, model selection and setup, inference and evaluation. Among all those phases, it is well known that the most time-consuming phase is data preprocessing which could account for about 80% of the whole project. 

As machine learning has showed his importance since the last ten years, HPCC Systems, the end-to-end data lake management solution, have made itself up to date by providing a fully-fledged machine learning library. It contains a wide range of machine learning algorithms both supervised and unsupervised. 

However, it currently lacks a data preprocessing package to help machine learning engineers speed up the data preprocessing phase of their projects and therefore enhance their productivity. They still have to write a lot of custom-made modules and functions. 

To fill that gap, we implemented a Preprocessing Bundle for HPCC Systems Machine Learning Library. The current version includes the following modules and functions:

  • LabelEncoder and OneHotEncoder: modules to process categorical features 
  • StandardScaler and MinMaxScaler: modules for scaling data 
  • MLNormalize: a function for normalizing data 
  • Split, StratifiedSplit, RandomSplit and StratifiedRandomSplit: functions for easily splitting datasets into training and test data 

The Preprocessing Bundle will be included into HPCC Systems ML_CORE Library. It comes along with a tutorial showcasing how its services could be used into an end-end machine learning project to speed up the data preprocessing phase.


In this Video Recording, Vannel provides a tour and explanation of his poster content.

Poster Title: Preprocessing Bundle for HPCC Systems Machine Learning Library

Click on the poster for a larger image. The original PDF version can be found here (Available for download).

  • No labels