Implement the Yinyang K-Means Clustering Algorithm

This project was completed by Lily Xu, a PhD student studying Computer Science at Clemson University. Lily joined the HPCC Systems intern program as a returning student (3rd time) in 2018. She has since completed her studies and has joined LexisNexis as an employee.

Find out more about the HPCC Summer Internship Program.
Curious about other projects we are offering? Take a look at our Ideas List.

Project Description

The Yingyang algorithm is an exact and thus 'drop-in' replacement for the standard k-means clustering measure. It boasts order of magnitude performance gains with proof of an exact and convergent result.

The key factor in its deliverance of such gains lies in the principle of conducting an initial clustering of centers that thus allow for certain subsequent distance computations to be avoided.

The abstract from the principle paper by Ding, Zhao, Shen, et al. follows:

"This paper presents Yinyang K-means, a new algorithm for K-means clustering. By clustering the centers in the initial stage, and leveraging efficiently maintained lower and upper bounds between a point and centers, it more effectively avoids unnecessary distance calculations than prior algorithms. It significantly outperforms prior K-means algorithms consistently across all experimented data sets, cluster numbers, and machine configurations. The consistent, superior performance—plus its simplicity, user-control of overheads, and guarantee in producing the same clustering results as the standard K-means does—makes Yinyang K-means a drop-in replacement of the classic K-means with an order of magnitude higher performance."

The implementation is to be coded completely in ECL and added to the HPCC-Systems Machine Learning Library. It is not essential for the candidate to possess prior knowledge or experience of ECL. However, it is expected that the candidate be proficient in the other necessary areas required to grasp the mathematics of the algorithm and parallel distributed computing.

To better gauge the scope of this project, regarding its size, detail, and general form, please review the following:

The existing k-means ECL implementation within the ML library can be found on GitHub here.
2015's addition of the CONCORD algorithm to the ML library found here and its accompanying test.

Completion of this project involves:

Implementation of the algorithm in ECL.
Testing the implementation for correctness and performance. This will involve comparing the new implementation, across a spectrum of data and boundary conditions, to those solutions given from the existing k-means functionality within the ML library. Testing for correctness is an obvious must, however; the majority of such testing is to ensure adequate performance gains are obtained.
Adequate documentation of the new functionality, its intended use and expected performance gains. This need not be a final form merged into our actual documentation however, it must be to a level that such a transition can be easily conducted by our documentation team.

By the start of the GSoC coding period we expect you to have used the Community Bonding Period to:

Review and complete all of the relevant ECL training material. For example, documentation and tutorials, video tutorials, and free online training classes.
Be proficient in building, installing, and running both the HPCC-Platform and the relevant integration of the ML library. Help can be found using platform wiki. Support will also be available from the project mentor.
Own an account on GitHub and adequate command of Git. In addition, you will be expected to have an account on our issue tracking system JIRA. Further information on general code contributions can be found here.
Setup a complete and working environment to start the project.

By the GSoC mid-term review we would expect you to have:

Acquired the skills and knowledge in using ECL, specifically in that of the existing k-means functionality as a basis, and to build and run test cases to aid in the actual implementation development.
The first code-complete and working draft should be complete by the time the mid-term evaluation is due. The remainder of the project should be spent reviewing, amending, and testing.

Mentor	John Holt Contact Details Backup Mentor: Edin Muharemagic Contact Details
Skills needed	Knowledge of distributed computing techniques Mathematics and algorithm design
Deliverables	Complete, correct, and working implementation Test code demonstrating the correctness and performance of the algorithm Supporting documentation
Other resources	HPCC Systems website JIRA issue for this project: https://track.hpccsystems.com/browse/ML-270 Learning ECL documentation and on-line training courses. Examples of existing code HPCC Systems Machine Learning documentation Wikipedia: K-means http://people.engr.ncsu.edu/xshen5/Publications/icml15.pdf Instructions for Students GSoC Student Manual

Project Description

Mentor

Skills needed

Deliverables

Other resources