Predictive Model Markup Language (PMML) Processor

This project was completed by a student accepted on to the 2021 HPCC Systems Intern Program.

This project a student work experience opportunity with HPCC Systems. Curious about other projects we are offering? Take a look at our Ideas List

Find out about the HPCC Systems Summer Internship Program.

Project Description



PMML is the leading standard for statistical and data mining models. It makes machine learning models portable and easily be shared between different applications and systems. In this project, student is expected to prototype an ECL bundle that will utilize previously trained models stored as PMML to form predictions in parallel on an HPCC Systems cluster. The bundle should also be capable of packaging HPCC Systems machine learning models into PMML to feed into another system. The student will assess the approach and build a prototype package which will parse PMML and render the models, as appropriate.

This project will be implemented in Enterprise Control Language (ECL) on the Big Data processing platform HPCC Systems. The expected result is PMML bundle with a user-friendly interface.  Free ECL online training is available for students who don't have previous ECL programming experience. It's recommended to complete the training before the internship starts for a better internship experience. The solution may involve embedding of existing open source C++, Java, or Python code if necessary. After 12 week's hard work, we would like to have your work summarized in a white paper. In the previous internships, many project results are turned into paper publications in different conferences and journals including top venues. Although it's not required, we suggest student aim for it.



Mentor

Dan Camper
Contact Details

Backup Mentor: Roger Dev
Contact Details

Skills needed
  • Knowledge of PMML

  • Knowledge of distributed computing techniques

  • Knowledge of HPCC Systems

  • Knowledge of ECL

  • Familiar with Github

Deliverables
  • Midterm

    • Implementation design

    • Test datasets ready

    • Complete 70% of the implementation

    • Code check-in on Github

    • Documentation

    End of project

    • Complete 100% of the implementation

    • Performance testing

    • Documentation

    • Complete code check-in and code-review on Github

    • White paper/Publication

Other resources

All pages in this wiki are subject to our site usage guidelines.