This project was completed by Sarthak Jain as part of the 2015 GSoC Program. Some of the new statistics have already been added to the HPCC Systems® Machine Learning Library and others will be available as part of the HPCC Systems 6.0.0 release in 2016. Machine learning statistics are important to the big data world, providing a way to drill down into data using complex queries, producing meaningful results to help businesses maintain their competitive edge in the market place. The HPCC Systems® Machine Learning Library has been around for a while now and we are always looking for ways to improve it.
The statistics Sarthak added provide metrics that indicate the ‘goodness’ of the model created. He completed the tasks associated with these statistics in very good time.
So when one of our modelling groups asked for some additional statistics to be added, Sarthak agreed to add those too.
He added 3 stepwise functions to the same modules which find the best model by adding or taking away independent variables. A ‘goodness’ metric was also added to select which independent variables are added to or taken away from the model. The 3 functions he added were Forward, Backward and Bidirectional.
Sarthak has made a valuable contribution to our Machine Learning Library that is of direct benefit not only to one of our own teams but, also provides everyone who uses the Linear and Logistic Regression Modules with a solid set of statistics that give vastly improved results about the models created.
Review Sarthak's commits for more information on his contribution to this project. Below are the details of the original project description.
We would like to add the following to the Linear Regression Module:
- T statistic and and P Value for each beta
- Adjusted R-squared
- P Value for the F statistic
We would also like to add the following to the Logistic Regression Module:
- P Values
- Standard Error
- Confidence levels for each beta
The implementation of these statistics will be done completely in the ECL language and a knowledge of ECL and distributed computing techniques is required. We will provide the formulae you need to get started with this project but you need to be able to write simple test code in at least one other statistical language, preferably R. We have an ML repository in github where you will check in code for development that you have completed. While knowledge of using git and github would be useful, we will provide guidance and training about our preferred method of usage to help you work in a complementary way with others also working on the ML Libraries.
Completion of this project involves:
- Learning about the structure of the ML Linear and Logisitic Regression modules
- Determining where the statistics should be added
- Adding code into the module for each statistic
- Creating test code which creates a model with an inputted training set and generating values for each newly developed statistic
- Creating test code in another statistical language that uses the same inputted training set and and generated the same values for each newly developed statistic.
- Comparing the values produced by the 2 versions of test code. The values must match before you can consider the development on the new statistic to be complete. When this is true you will will submit changes to the ML repository for the test code and the module containing the statistics that you have developed.
By the GSoC mid term review we would expect you to have completed one table in Microsoft Word for either the Linear or Logistic Regression statisitcs comparing the values generated by the code of each statistic. Also, you will be expected to have created test code that generates those values (including the dataset used). Your github pull requests for both the test code and the regression module containing the new statistics, need to have been accepted by this point.
Backup mentor: John Holt