This project is available as a student work experience opportunity with HPCC Systems this summer. Curious about other projects we are currently offering? Take a look at our Ideas List.
Find out about the HPCC Systems Summer Internship Program.
Gradient boosting can be used to improve the classification performance of decision trees. See https://en.wikipedia.org/wiki/Gradient_boosting for a general discussion. Gradient boosting decision trees can also be used for regression (predicting a continuous value). The objective of this project is to implement an ECL Bundle for the Machine Learning Library that employs gradient boosted decision trees.
The loss functions to be supported are: Least Squares; Least Absolute Deviation; Binomial Deviance; Multinomial Deviance.
The Python scikit-learn implementation will be used for benchmarking and QA.
The Machine Learning Library employs three performance profiles: correct without any processing speed guarantees; suitable for concurrent processing of many small problems; suitable for a large problem. The specific dimensions of the small and large designation will vary between problem domains. The general difference though is that a small problem can be solved on a single machine in the required time frame and a large problem requires more resources that are available from a single machine.
Step 0: Install the HPCC Platform on your system (if Linux) or on a virtual machine running Linux.
Step 1: Test data generation. Generate test data suitable for classification and test data suitable for regression. There will be a binary version and a multinomial (5 levels) version. The test datasets should be 100K records in size.
Step 2: Use the platform python support to build classification and regression models with the Gradient Boosting Classifier and the Gradient Boosting Regressor found in SciKit-learn.
Step 3: Implement gradient boosted trees for classification and validate results against the Gradient Boosting Classifier.
Step 4: Implement an ECL version of gradient boosted trees for regression and validate results against the Gradient Boosting Regressor.
Step 5: Perform processing speed tests for the Myriad performance profile.
Step 6: Determine an approach that leverages a shared nothing cluster, and test an implementation of that approach.
Code will be committed to GitHub at least weekly.
By the mid term evaluation, we would expected you to have completed the following:
Steps 0, 1, and 2 will be completed and demonstrated (via my Web Ex). The binomial classifier will be running and validated against the Gradient Boosting Classifier.
Please add details below including the JIRA ticket details: