This project was completed as a intern opportunity with HPCC Systems in 2016. Curious about projects we are offering for future internships? Take a look at our Ideas List.
Find out about the HPCC Systems Summer Internship Program.
The project proposal application period for 2020 summer internships is now open. Please see our list of Available Projects. Contact the project mentor for more information and to discuss your ideas. You may suggest a project idea of your own but it must leverage HPCC Systems in some way. Contact us for support from an HPCC Systems mentor with experience in your chosen project area.
The implementation can be done completely in the ECL language and only a knowledge of ECL and distributed computing techniques is required. A knowledge of linear algebra will be helpful.
Completion of this project involves:
- Selection of the test data. The test data will be a collection of open data text documents. The collection must have an open data license or be completely free of copyright restrictions. The most important aspect of the collection is that you will be familiar with the subjects in the collection so that you can judge the effectiveness of your implementation. The test text collection should be composed of 1000 to 10000 documents.
- Development of the algorithm using ECL.
- Testing the algorithm for correctness and performance.
By the GSoC mid term review we would expect you to have:
- Written the ECL needed to process the text documents into a dataset of term vectors.
Backup Mentor: Edin Muharemagic