Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The project proposal application period for 2020 summer internships is now open. Please see our list of Available Projects. Contact the project mentor for more information and to discuss your ideas. You may suggest a project idea of your own but it must leverage HPCC Systems in some way. Contact us for support from an HPCC Systems mentor with experience in your chosen project area.

This project was completed by Sarthak Jain as part of the 2015 GSoC Program. Some of the new statistics have already been added to the HPCC Systems® Machine Learning Library and others will be available as part of the HPCC Systems 6.0.0 release in 2016.  Machine learning statistics are important to the big data world, providing a way to drill down into data using complex queries, producing meaningful results to help businesses maintain their competitive edge in the market place. The HPCC Systems® Machine Learning Library has been around for a while now and we are always looking for ways to improve it. 

...

By the GSoC mid term review we would expect you to have completed one table in Microsoft Word for either the Linear or Logistic Regression statisitcs comparing the values generated by the code of each statistic. Also, you will be expected to have created test code that generates those values (including the dataset used). Your github pull requests for both the test code and the regression module containing the new statistics, need to have been accepted by this point.

Mentor

Tim Humphrey
Contact Details

Backup mentor: John Holt
Contact Details

Skills needed
  • Knowledge of ECL. Training manuals and online courses are available on the HPCC Systems website.
  • Knowledge of distributed computing techniques
  • Ability to write simple tests code in one other statistical language, preferably R.
  • Ability to use git and github to be able to clone the ML repository and submit pull requests for completed development (guidance available)
Deliverables

Mid-term

  • One table done in Microsoft Word, for either Linear Regression or Logistic Regression. The table will compare values generated by the code of  each statistic with the same statistic calculated by R.
  • Test code that generated the above mentioned values of the newly implemented statistic (along with the dataset used).
  • An accepted github pull request for both the test code and regression module containing the newly developed statistics.

Final

  • Two tables done in Microsoft Word, one for Linear Regression and one for Logistic Regression. Each table will compare values generated by the code of  each statistic with the same statistic calculated by R.
  • Test code that generated the above mentioned values of each newly implemented statistic (along with the dataset used).
  • An accepted github pull request for both the test code and regression modules containing the newly developed statistics.
Other resources