Implementation of Generative Adversarial Networks in HPCC Systems using GNN Bundle
HPCC Systems, an open source cluster computing platform for big data analytics consists of Generalized Neural Network bundle with a wide variety of features which can be used for various neural network applications. To enhance the functionality of the bundle, this paper proposes the design and development of Generative Adversarial Networks (GANs) on HPCC Systems platform using ECL, a declarative language on which HPCC Systems works. GANs have been developed on the HPCC Platform by defining the Generator and Discriminator models separately, and training them by batches in the same epoch. In order to make sure that they train as adversaries, a certain weights transfer methodology was implemented. MNIST dataset which has been used to test the proposed approach has provided satisfactory results. The results obtained were unique images very similar to the MNIST dataset, as it were expected.
R V College of Engineering
HSQL: A SQL-like language for HPCC Systems
In today's modern world, we're experiencing a substantial increase in the use of data in various fields, and this has necessitated the use of distributed systems to consume and process Big Data.
Machine learning tends to benefit from the usage of Big Data, and the models generated from such techniques tend to be more effective.
There is a steep learning curve to getting used to handling Big Data, especially in distributed systems, where the task of data processing is split amongst various nodes in clusters. In the proposed work, a new SQL-like language, HSQL for HPCC Systems, an open source distributed systems solution, was developed for allowing new users to get used to its novel microservice-based architecture and the ECL language with which it primarily operates with. Additionally, a program that could translate HSQL-based programs to ECL for use was made. HSQL was made to be completely inter-compatible with ECL programs, and it was able to provide a compact and easy to comprehend SQL-like syntax for performing general data analysis, creation of Machine learning models and visualizations while allowing a modular structure to such programs.
North Carolina State University
The Making of an Agriculture Data Lake
Farm and crop data is abundant. It comes from machinery, satellite and drone images, commodities exchanges, and more. Proper decisions depend on in-depth understanding and analysis of the available data. Furthermore, the data come from many diverse sources. Consequently,
farmers and farm managers are overwhelmed and ill equipped to exploit the information.
In this poster we will demonstrate how we created a data lake for agriculture data. There are several streams feeding the lake with raw data, such as USDA data and mercantile exchange pricing data. The raw data from the streams is assembled and organized to create derived information from the raw data, such as production rates. Finally, all this data, information, and knowledge can be viewed in a custom web application which farmers can use to evaluate the profit of various crop planting scenarios in order to better manage their farms.
This project is installed on an HPCC Systems server (thanks to Dan Camper). Data from the streams are fetched daily using ECL cron jobs and sprayed into the lake as it enters the landing zone. Custom ECL code flattens the new and old data and processes it to update information and knowledge. Thanks to the HPCC system's high processing power, all of the above tasks for a large data of over 39M entries are completed within 12 minutes. The data is then indexed and published in the form of REST web services using WsECL.
A custom web application written in Flutter framework presents this information in charts and tables. The data-viewing application retrieves information from the data lake using the REST interface in WsECL. The project exercises the breadth and depth of HPCC Systems.
Jack Fields - Community Choice Award
High School Student
American Heritage School of Boca/Delray, Florida
Using HPCC Systems GNN Bundle with TensorFlow to Train a Model to Find Known Faces Leveraging the Robotics API
With the rise in need of school security we hope to develop a robot that can combat this problem. The autonomous security robot at American Heritage School will be able to recognize known faces, use the RFID scanner to collect information, recognize license plates, and store and locate the data gathered from the RFID scanner and other systems. The security robot will have a customized introductory greeting, name recognition, and schedule locator. A mounting system was designed for the cameras that can be repositioned and adjusted to best capture the license plate. The drivetrain is composed of two sets of six pneumatic wheels. Each set of wheels is powered by a custom built, dual motor, single speed gearbox. The gearboxes use a gear ratio of 21.21:1, allowing the robot to reach a top speed of roughly 7.57 feet per second. This intern project furthered HPCC Systems integration with the robotics project by using the HPCC Systems GNN bundle with TensorFlow. Using a database with student information this project has be able to train a model to recognize known faces. This project also included upgrading the ROBOT API to work with the newest versions of ROS.
Jeff Mao - Best Poster - Use Case
High School Student
Lambert High School, Suwannee, Georgia, USA
HPCC Systems on Google Anthos
The Google Anthos is an Application Management Platform that manages multi-cloud and on-premise environments. It allows HPCC-Systems to be managed from separate cloud platforms through one centralized command center. Google Anthos comes with a plethora of options that range from config management (configurations through code) to service mesh (microservice controller/manager). The main benefit Anthos provides for HPCC-Systems is the ability to manage Kubernetes environments on any cloud. With Anthos, HPCC Systems has access to a common abstract layer that manages deployment, upgrades, configurations, networking, and scaling.
Johny Chen Jy
Universidade Federal de Santa Catarina (UFSC)
Deltabase caching solutions: An exploratory analysis for increasing ROXIE query performance
In the big data processing and querying world, in order to provision more accurate and up to date information to end users, a relatively common technological approach is to combine an Online Analytical Processing (OLAP) solution with an Online Transactional Processing (OLTP) solution. Whereas the OLAP component is responsible for performing most of the big data processing based on read-only data extracted from different sources, the OLTP component can be used to provide write access to the result data and complement the query results with real time data.
Similarly, in big data parallel processing and querying environments supported by the HPCC (High Performance Computing Cluster) Systems platform, Roxie query results based on data that was first extracted, transformed, and loaded by the Thor cluster, can be complemented with additional data coming from an external online database. This external component, frequently referred to as Deltabase, corresponds to a OLTP database that can be used to provide real time data and complement query results with data that eventually has not yet been processed by Thor. Despite the obvious benefit of providing more accurate and up to date information to end users; the Deltabase, in case of its failure or because of its own OLTP nature not optimized for data reading, can become a bottleneck in terms of performance and availability of the entire querying system. In such contexts, the utilization of a caching solution can become attractive.
Recently, NoSQL databases have been leveraged as a caching mechanism in hybrid OLAP/OLTP solutions by avoiding that queries recently executed are once again processed by the OLTP system, which are usually slower for providing read access to stored data in comparison to NoSQL databases. By providing an additional optimized component for real time data access and an additional layer of resilience against failures, the inclusion of a NoSQL database as a caching solution can potentially increase the performance and availability of hybrid big data processing and querying solutions.
The overall objective of this in progress study is to explore the usage of a NoSQL database as a potential caching solution to the Deltabase component of the HPCC Systems platform. To this end, an experimental approach will be leveraged. The alternative database architectures and caching algorithms will be evaluated and compared, both from a Roxie query response time and from an overall system availability.
Universidade Federal de Santa Catarina (UFSC)
A cross provider assessment for HPCC Systems and Container Orchestration
The advent of containers and their associated orchestration tools in recent years have fundamentally shifted how computational workloads are built and managed in distributed computing environments. Whereas containers offer a consistent lightweight runtime environment through OS-level virtualization, as well as low overhead to maintain and scale applications with high efficiency; the management of containers is controlled via container orchestrators. Container orchestration tools, such as Kubernetes, have a mechanism to launch and manage containers as clusters or pods, providing automation for running service containers. Orchestration, therefore, provides a flexible way of scaling services running inside a container that require load balancing, fault tolerance, and horizontal scaling.
However, not all distributed computing environments can be easily ported to the container orchestration paradigm. This migration can constitute a bigger challenge for data intensive supercomputing technologies such as the HPCC (High Performance Computing Cluster) Systems platform. This is due to the batch queuing nature of most of these platforms that possess strict assumptions around data storage persistence and host-specific shared resources, such as: each node must securely maintain its own set of data and will be reading and writing to a single shared file system. Especially for the HPCC Systems platform, which historically relies on data locality, the migration towards container orchestration paradigm can represent a particular challenge.
Despite this challenging scenario, and given the push toward containerization trends, advances have been made at some extent to make data intensive platforms such as HPCC Systems available in containerized environments running in public clouds. How this new platform architecture behaves from a functionality and performance standpoint across different public cloud providers; and in comparison to the original bare-metal architecture, is still a question whose answer is mostly unknown.
The overall objective of this in progress study is to explore the usage of the first HPCC Systems version with native support for containerization. To this end, a cross provider experiment will be executed to compare overall HPCC Systems performance and functionality among Azure Kubernetes Service (AKS), Amazon ́s Elastic Kubernetes Service (EKS) and bare metal. A benchmark test suite will be utilized to measure data transformation performance. It is expected that this study will contribute to a better understanding of how the recent released HPCC Systems version with native support for containerization behaves in terms of performance and functionality, as well as provide insights into future developments.
New College of Florida
Applying HPCC Systems TextVector to SEC Filings
The length and technical detail of SEC filings makes them largely inaccessible for most investors to read or analyze, and with the growing volume of data, these filings are getting longer and more numerous. It is therefore increasingly of interest to seek automation of part of this task using natural language processing (NLP).
Matthias's poster compares and contrasts approaches to sentiment analysis to address this problem and evaluate the efficacy of NLP for such a task.
Nathan Halliday - Best Poster - Platform Enhancement
High School Student
Hills Road Sixth Form College, Cambridge, UK
The Parallel Workflow Engine
The ECL language is centred around high performance. HPCC Systems focuses on parallelism to enable highly optimised dataset operations.
The parallel workflow engine increases the scope of parallel processing from within activity graphs to the entire workflow. The goal is to make workunits faster but maintain the existing behaviour of the sequential engine.
During my project, I have gradually extended the parallel engine to increase support for different ECL language constructs. Regression tests for different workflow modes in combination, ensure that the engine can process diverse queries.
One major challenge of the parallel engine was to implement condition items, since only one sub-branch of dependencies are executed by the engine. It also has a complex task of mimicking the sequential engine if the workflow fails.
The parallel workflow algorithm is planned to become default in version 7.12. It is beneficial for all ECL programmers and the speedup is achieved without altering the language functionality. For production systems, money will be saved, by providing the clusters with more work sooner. For cloud environments, additional resources can be added dynamically, to maximise the benefits of the faster processing.
Robert Kennedy - Best Poster - Data Analytics
PhD in Computer Science
Florida Atlantic University
Distributed GPU Accelerated Neural Networks with GNN
HPCC Systems Platform leverages many commodity computers to perform high performance computing tasks. The underlying hardware traditionally only provide a CPU for the actual computations and communicate with other member computers via networking protocols. This approach has proven to be very effective for many demanding applications. However, training large neural networks–or Deep Learning–is best benefited by utilizing hardware acceleration for the bulk of the computationally expensive tasks.
This poster presents the results of my Summer 2020 internship project that expands HPCC Systems and its General Neural Network (GNN) bundle by leveraging multiple GPUs that span across a cluster. Using hardware acceleration with the GNN bundle allows the ECL machine learning developer to drastically reduce training time. Further, this work is not limited to one GPU nor one physical computer. This work demonstrates that it is now possible to spread GNN computations over multiple GPUs either multiple GPUs in one machine or across multiple GPUs across multiple machines.
Masters of Computer Science
Kennesaw State University
Preprocessing Bundle for the HPCC Systems Machine Learning Library
In this flourishing era of Artificial Intelligence, Machine Learning algorithms are having an increasingly bigger impact into our daily lives. They are extensively used to power applications in various domains including self-driving cars, weather forecasting, marketing, robotics, anomaly detection and many more.
A machine learning project can be broadly divided into five main phases: data collection, data preprocessing, model selection and setup, inference and evaluation. Among all those phases, it is well known that the most time-consuming phase is data preprocessing which could account for about 80% of the whole project.
As machine learning has showed his importance since the last ten years, HPCC Systems, the end-to-end data lake management solution, have made itself up to date by providing a fully-fledged machine learning library https://hpccsystems.com/download/free-modules/machine-learning-library. It contains a wide range of machine learning algorithms both supervised and unsupervised.
However, it currently lacks a data preprocessing package to help machine learning engineers speed up the data preprocessing phase of their projects and therefore enhance their productivity. They still have to write a lot of custom-made modules and functions.
To fill that gap, we implemented a Preprocessing Bundle for HPCC Systems Machine Learning Library. The current version includes the following modules and functions:
The Preprocessing Bundle will be included into HPCC Systems ML_CORE Library. It comes along with a tutorial showcasing how its services could be used into an end-end machine learning project to speed up the data preprocessing phase.
R V College of Engineering
Hybrid Density-based Adaptive clustering using Gaussian kernel and Grid Search
Density-based spatial clustering of data with noise (DBSCAN) is a popular clustering algorithm that groups data points which are close together using two parameters eps - which is the radius of each cluster, and Minpts, which is the minimum number of points in each cluster. However, the performance of DBSCAN reduces for the datasets with varying density clusters. The poster proposes the implementation of a novel distributed and adaptive DBSCAN algorithm on the HPCC Systems platform. The proposed approach uses techniques such as grid search and Gaussian kernel to search optimized values for the threshold density of clusters, thus eliminating the requirement for users to specify the parameters. Further, the experimental investigation suggests that proposed ADBSAN performs better compared to existing ADBSCAN implementations using k-dist and Gaussian kernels.
With the rapid advancement in technology, data is generated faster and larger. However, it is not the data that is valuable. The information conveyed by the data is of greater importance. Organizations can use this information for the betterment of society. Data mining is a huge field that deals with the extraction of information from raw data. Clustering is one of the tasks in this field. Clustering is the most common and efficient unsupervised learning technique. It aims at the partitioning of the dataset into groups of similar elements such that each group is different from other groups. Some of the applications of the DBSCAN algorithm are market research and analysis, image processing, and anomaly detection
This study aims at the implementation of an efficient, distributed, and adaptive DBSCAN(ADBSCAN) algorithm on HPCC Systems which first determines the threshold density, for any given dataset, including datasets with variable density clusters for clustering. Thus eliminating the need for users to specify the values. Further, the manuscript discusses other ADBSCAN implementations and compares the proposed approach with them using various open datasets.
Masters in Computer Science
Leveraging and Evaluating Kubernetes support of Microsoft Azure
Deployment of HPCC Systems to commercial clouds can be done in multiple ways depending on various business needs. While lift-and-shift is one way to go which involves moving of unchanged application infrastructure from on-prem to the cloud based on virtual machine approach, containerization of the application with the aim to go cloud-native is another approach. The recent push of HPCC Systems to go cloud-native involves containerization strategy which provides a logical packaging mechanism in which HPCC resources are abstracted from the environment in which they run, with multiple containers running on top of the OS kernel directly.
This project leverages the new Kubernetes version of HPCC Systems for the cloud by targeting Microsoft Azure and provides steps to provision and deploy custom HPCC Systems cluster using Kubernetes and helm charts, based on the initial guidance on setting up a containerized version of HPCC Systems in Azure . Examining architectural differences between Kubernetes and Virtual Machine environments for HPCC Systems, evaluating performance of running jobs in the Kubernetes environment with multiple cloud configuration options and performing cost analysis to deploy containerized HPCC Systems are also an added effort in this project to understand how HPCC runs on the cloud and to identify existing or potential challenges along with potential solutions. With containerizing applications, the underlying storage also changes, and it becomes important to assess how HPCC handles storage and persists data in the Kubernetes environment . To achieve this, two storage options in Azure are examined along with tradeoffs and challenges with each option. The outcome of this project would also provide developers and users of HPCC Systems as an added building-block to leverage the cloud-native approach for faster deployment and a clean separation of concerns.
 Setting up a Default HPCC Systems Cluster on Microsoft Azure Cloud Using HPCC Systems 7.8.x and Kubernetes, Jake Smith | HPCC Systems. https://hpccsystems.com/blog/default-azure-setup.
 Persisting Data in an HPCC Systems Cloud Native Environment, Gavin Halliday | HPCC Systems. https://hpccsystems.com/blog/persisting-data-cloud.