HPCC Systems Intern Program - Class of 2022

Find out more about the HPCC Systems Summer Intern Program including how to apply and read this blog introducing the students and their projects.

11 students joined our intern program in 2022. Our students presented about their projects to the team during the year and 9 of them entered our 2022 Poster Contest held at our virtual HPCC Systems Community Day Summit in October 2022.

Meet the Class of 2022

Name

Project Title

Description

Mentor(s)

Resources

Name

Project Title

Description

Mentor(s)

Resources

Amy Ma
High School Student
Stoneman Douglas High School, Florida

Document Data Patterns


Data Patterns was an existing feature of HPCC Systems, however it was never formally documented. The information about usage was documented in three separate files. The purpose of this project was to gather the information from the various files and consolidate them into a book accessible to users from the Documentation area of the HPCC Systems website

Jim DeFabia

View Poster 

Blog Journal

Ananya Gupta
PhD Human-Centered Computing
Clemson University

Nepali Wiktionary Initiative and Translation


During first initiative, we developed a parser and an analyzer, NeWiktionary to build a knowledge base for words using HPCC Systems to build a dictionary record structure from the Wiktionary data. Ultimately, this dictionary will be used in doing NLP in Nepali using HPCC Systems. The second initiative focused on trying to involve community to build a better dictionary for Nepali language.

David Dehilster

View Poster

Blog Journal

Arun Gaonkar
Masters of Computer Science
North Carolina State University

Applying Causality Toolkit to Real World Datasets


This project is focused on the analysis of causality and causation-based inference. The main aim of the research is to understand the causal relationship between the factors that are involved in the real-world dataset by applying the Causality Toolkit developed by HPCC systems.

From the CDC dataset that included details from the health survey, I have analyzed diabetes and the effect of a few variables on the probability of diabetes. Using the Because module developed by HPCC Systems, we can observe and analyze the cause and effect of each variable of data.

Roger Dev

View Poster

Blog Journal

Arya Adesh
Bachelor of Computer Science and Engineering 
RVCE

Local Outlier Factor algorithm for Anomaly detection in ECL




Local Outlier Factor(LOF) is an unsupervised anomaly detection method that identifies anomalies without training. It is a density-based anomaly detection algorithm that assigns a degree of outlier-ness (called Local outlier factor) to each point in the dataset. LOF can find both global and Local Outliers. Local anomalies are points that are outlying with respect to their neighbors. Other anomaly detection algorithms accurately find global anomalies, however, they fail to identify local outliers as they assume the dataset to exhibit uniform data distribution. LOF is most suitable for uneven distribution datasets as it doesn’t make assumptions about the distribution. It can identify both global outliers (outlying with respect to all the points in the dataset) and local outliers.

Lili Xu

View Poster

Blog Journal

2022 Community Day Presentation

Elizabeth Lorti
Bachelor of International Development
King's College, London

Technology Marketing and Branding


This project required a different lens in understanding and evolving the HPCC Systems community. Social media and marketing strategies are crucial in ensuring that the company and platform are increasing engagement and expanding their reach. This project focused on completing a completed a competitive analysis of HPCC Systems vs competitors in order to further understand marketing strategies and how HPCC Systems can improve and stand out from competitors, updating and collecting collateral documents for review and re-branding and implementing new social media strategies to imporve engagement.

Jessica Lorti

View Poster

Blog Journal

Jack Del Vecchio
Bachelor of Computer Engineering
Miami of Ohio University

Interfacing MongoDB into ECL

This project provides details of the API that was used by my plugin to communicate between MongoDB and HPCC Systems. The two systems have similar data types that are native to them, but there are some differences.

The plugin allows for a wide variety of commands to be sent to the MongoDB database. MongoDB uses documents which are essentially JSON objects to pass commands back and forth to the server making it adaptable to almost any kind of operation a user would want to do. 

Dan Camper

View Poster

Blog Journal

2022 Community Day Presentation

Lucas Wang
Bachelor of Electrical Engineering and Computer Science
University of California, Berkeley

NLP++ Dictionary for the Chinese Language


The industry standard for Natural Language Processing lies with machine learning. The fault in this model is that the overwhelming amount of quality data needed for specific application of NLP is a giant barrier to entry. Example applications are medical, legal, and academic data. 

NLP++ programming language attempts to approach NLP from the perspective of building up a knowledge base of words which are then used to create trees and relationships between words. This method requires a dictionary which the language refers upon. My task was to seek and filter quality dictionary data for the Chinese language, implement a basic NLP++ analyzer with our dictionary, and integrate this into HPCC Systems.

David Dehilster

Blog Journal

Noah Seligson
Bachelor of Computer Science
University of Central Florida

Provide test code for bundles with no self test


The project’s goal was to integrate each HPCC Systems Machine Learning bundle into the Overnight Bundle and Test system. This was done per bundle by adding a folder titled ecl, where there were two components, test and key files. The test files include unit tests that run individual functions under a set of different test cases.

Additional test files are modified versions of existing test code, included so they can be implemented into the OBT. Each key file contains XML formatted text representing the correct result that the OBT scans for when running each corresponding test file. 

Attila Vamos

View Poster

Blog Journal

Sarvesh Prabhu
High School Student
Lambert High School, GA

A comparative study of Neural network vs. Tree-based deep learning methods in the image classification of colorectal medical imagery diagnosis using HPCC Systems

The scope of my research is to get to a consistently accurate diagnosis, possible by highlighting the areas of interest to the Physician (whether it’s a polyp, ulcer, etc.), allowing faster conclusion several hours faster than a traditional procedure while being non-invasive. The accuracy is realized by comparing the two dominant ML models: Neural Networks and Random Forest. These two models were created using two of the expansive HPCC bundles: the Generalized Neural Network, and the Learning Trees.

Bob Foreman

View Poster

Blog Journal

2022 Community Day Presentation

Shivam Singhal
Masters of Software Engineering
University of Oulu, Finland

ECL Code Documentation Generator Improvements


The ECL Doc Generator is a important project in the HPCC Systems Ecosystem because it generates the documentation for the Machine Learning Bundles. The availability of the package as a pypi module was updated as were the designs. The docs were rebranded with new logos and new features were added for example, readme addition as well improving visibility and code quality.

Lili Xu

View Poster

Blog Journal

Zheyu Shen
Master of Data Science
Columbia University

Causality Algorithm Development


This project involved designing and developing test cases to compare the causality tasks of different implementation to provide information about which are the most widely used and produce the best results. Zheyu also, assessed our own implementations focusing performance and adding more causality algorithm into our Causality Toolkit implementations. 

Direct LiNGAM Conjecture has the ability to deal with all categories except functions that cannot be fitted, which largely expands the capability of direction detection. 

Zheyu's works compares different methods with experiment results. He also integrated and tested RCoT, implementing and testing the Machine Learning method on conditional expectation. All have been merged into the HPCC Systems Causality Framework, "Because".

Roger Dev

View Poster

Blog Journal

All pages in this wiki are subject to our site usage guidelines.