The proposal period for 2022 internships is now closed
The proposal period for 2023 internships will open in November 2022
11 students joined our intern program in 2022. Our students presented about their projects to the team during the year and 9 of them entered our 2022 Poster Contest held at our virtual HPCC Systems Community Day Summit in October 2022.
Meet the Class of 2022
High School Student
Stoneman Douglas High School, Florida
|Document Data Patterns||Data Patterns was an existing feature of HPCC Systems, however it was never formally documented. The information about usage was documented in three separate files. The purpose of this project was to gather the information from the various files and consolidate them into a book accessible to users from the Documentation area of the HPCC Systems website||Jim DeFabia|
PhD Human-Centered Computing
|Nepali Wiktionary Initiative and Translation||During first initiative, we developed a parser and an analyzer, NeWiktionary to build a knowledge base for words using HPCC Systems to build a dictionary record structure from the Wiktionary data. Ultimately, this dictionary will be used in doing NLP in Nepali using HPCC Systems. The second initiative focused on trying to involve community to build a better dictionary for Nepali language.||David Dehilster|
Masters of Computer Science
North Carolina State University
|Applying Causality Toolkit to Real World Datasets|
This project is focused on the analysis of causality and causation-based inference. The main aim of the research is to understand the causal relationship between the factors that are involved in the real-world dataset by applying the Causality Toolkit developed by HPCC systems.
From the CDC dataset that included details from the health survey, I have analyzed diabetes and the effect of a few variables on the probability of diabetes. Using the Because module developed by HPCC Systems, we can observe and analyze the cause and effect of each variable of data.
Bachelor of Computer Science and Engineering
Local Outlier Factor algorithm for Anomaly detection in ECL
|Local Outlier Factor(LOF) is an unsupervised anomaly detection method that identifies anomalies without training. It is a density-based anomaly detection algorithm that assigns a degree of outlier-ness (called Local outlier factor) to each point in the dataset. LOF can find both global and Local Outliers. Local anomalies are points that are outlying with respect to their neighbors. Other anomaly detection algorithms accurately find global anomalies, however, they fail to identify local outliers as they assume the dataset to exhibit uniform data distribution. LOF is most suitable for uneven distribution datasets as it doesn’t make assumptions about the distribution. It can identify both global outliers (outlying with respect to all the points in the dataset) and local outliers.||Lili Xu|
2022 Community Day Presentation - Coming Soon
Bachelor of International Development
King's College, London
|Technology Marketing and Branding||Jessica Lorti|
|Jack Del Vecchio|
Bachelor of Computer Engineering
Miami of Ohio University
Interfacing MongoDB into ECL
This project provides details of the API that was used by my plugin to communicate between MongoDB and HPCC Systems. The two systems have similar data types that are native to them, but there are some differences.
The plugin allows for a wide variety of commands to be sent to the MongoDB database. MongoDB uses documents which are essentially JSON objects to pass commands back and forth to the server making it adaptable to almost any kind of operation a user would want to do.
2022 Community Day Presentation- Coming Soon
Bachelor of Electrical Engineering and Computer Science
University of California, Berkeley
|NLP++ Dictionary for the Chinese Language|
The industry standard for Natural Language Processing lies with machine learning. The fault in this model is that the overwhelming amount of quality data needed for specific application of NLP is a giant barrier to entry. Example applications are medical, legal, and academic data.
NLP++ programming language attempts to approach NLP from the perspective of building up a knowledge base of words which are then used to create trees and relationships between words. This method requires a dictionary which the language refers upon. My task was to seek and filter quality dictionary data for the Chinese language, implement a basic NLP++ analyzer with our dictionary, and integrate this into HPCC Systems.
|David Dehilster||Blog Journal|
Bachelor of Computer Science
University of Central Florida
|Provide test code for bundles with no self test|
The project’s goal was to integrate each HPCC Systems Machine Learning bundle into the Overnight Bundle and Test system. This was done per bundle by adding a folder titled ecl, where there were two components, test and key files. The test files include unit tests that run individual functions under a set of different test cases.
Additional test files are modified versions of existing test code, included so they can be implemented into the OBT. Each key file contains XML formatted text representing the correct result that the OBT scans for when running each corresponding test file.
High School Student
Lambert High School, GA
A comparative study of Neural network vs. Tree-based deep learning methods in the image classification of colorectal medical imagery diagnosis using HPCC Systems
|The scope of my research is to get to a consistently accurate diagnosis, possible by highlighting the areas of interest to the Physician (whether it’s a polyp, ulcer, etc.), allowing faster conclusion several hours faster than a traditional procedure while being non-invasive. The accuracy is realized by comparing the two dominant ML models: Neural Networks and Random Forest. These two models were created using two of the expansive HPCC bundles: the Generalized Neural Network, and the Learning Trees.||Bob Foreman|
2022 Community Day Presentation- Coming Soon
Masters of Software Engineering
University of Oulu, Finland
|ECL Code Documentation Generator Improvements||The ECL Doc Generator is a important project in the HPCC Systems Ecosystem because it generates the documentation for the Machine Learning Bundles. The availability of the package as a pypi module was updated as were the designs. The docs were rebranded with new logos and new features were added for example, readme addition as well improving visibility and code quality.||Lili Xu|
Master of Data Science
|Causality Algorithm Development|
This project involved designing and developing test cases to compare the causality tasks of different implementation to provide information about which are the most widely used and produce the best results. Zheyu also, assessed our own implementations focusing performance and adding more causality algorithm into our Causality Toolkit implementations.
Direct LiNGAM Conjecture has the ability to deal with all categories except functions that cannot be fitted, which largely expands the capability of direction detection.
Zheyu's works compares different methods with experiment results. He also integrated and tested RCoT, implementing and testing the Machine Learning method on conditional expectation. All have been merged into the HPCC Systems Causality Framework, "Because".