Page tree
Skip to end of metadata
Go to start of metadata

This project is available as a student work experience opportunity with HPCC Systems this summer. Curious about other projects we are offering? Take a look at our Ideas List

Find out about the HPCC Systems Summer Internship Program.

The project proposal application period for 2020 summer internships is now open. Please see our list of Available Projects. Contact the project mentor for more information and to discuss your ideas. You may suggest a project idea of your own but it must leverage HPCC Systems in some way. Contact us for support from an HPCC Systems mentor with experience in your chosen project area.

Background

Nowadays we are facing the problem of exponentially growing amounts of data collected from interconnected services, the Internet of Things (IoT sensor arrays, etc.) and several other sources. We have a powerful asset in HPCC Systems for processing vast quantities of data. Fortunately, significant amounts of this data are stored in character based forms like XML, CSV, plain text, etc. which are highly compressible. Usually they are compressed with any of the wide spread compression method/tools like ZIP or GZIP at source (for example at generation or the collection point) to save storage space and transfer bandwidth. However this can create a bottleneck at the point where large compressed input data is fed in and it makes more sense to decompress it first and then spray. But this can take time and requires resources, remember, we are talking tens or hundreds or more of GB size files and datasets.

Project Description
This project requires the creation of a plug-in or plug-in architecture for spraying from a ZIP/GZIP archive without decompressing the content into a local directory or HPCC System landing zone.
The project is divided into 4 sub-projects.

Sub-project 1: Designing (using UML)

  • Designing it to handle single/multiple data file and single/multiple archive. (In special case when a single data file overlaps multiple archive).
  • Designign error handling/fault tolerance
  • Considering worst case scenarios
  • Creating the architecture and interface documentation
  • Planning the necessary activities, connections and inter (sub-) systems cooperations
  • Creating process sequence design among the (sub-) systems (when, who, what)
  • Creating unit, regression and performance test plans for all sub projects.

Sub-project 2: Handling single compressed data files

  • Implementing the plan created in Sub-project 1.
  • Generating different sizes of correct and bad single file content archive to test the implementation.
  • Generatign unit test case(s)
  • Generating regression test cases (in ECL)

Sub-project 3: Handling multiple compressed data file in single archive

  • Extending the result of Sub-project 2 to handle multiple file in one archive
  • Generating different sizes of correct and bad multiple file content archives to test the implementation.
  • Generating unit test case(s).
  • Generating regression test cases (in ECL).

Sub-project 4: Handling multiple archives

  • Extending the result of Sub-project 3 to handle multiple archives.
  • Generating different sizes of correct and bad content archive sets to test the implementation.
  • Generating unit test case(s).
  • Generating regression test cases (in ECL).

Completion of this project involves:

  • Producing design documentation
  • Producing the source code for the implemented solutions and submitting pull requests in HPCC Systems GitHub
  • Development of unit and regression test cases
  • From Sub-project 2 clean* unit and regression test report (clean* - The code generated in this activity doesn't introduce new warnings and/or errors compared to the result of our Automated Overnight Build and Test system.)

By the mid term review we would expect you to have:

  • Completed sub-projects 1 and 2.
  • Produced and submitted pull requests for the source code and test cases which have been reviewed and accepted by the project mentor.
  • Produced proven regression test results.

Mentor

Attila Vamos
Contact Details

Backup Mentor: Gavin Halliday
Contact Details

Skills needed
  • Ability to code in C++.
  • Ability to build and test the HPCC system (guidance will be provided).
  • Ability to understand architecture, functions and cooperation among the sub-systems
  • Ability to write test code.
Deliverables

All deliverable should stored into HPCCSystems GitHub.

Acceptance criteria

  • Reviewed and accepted/merged pull requests for all documentation, source code, test cases and test data. (It is highly recommend to decompose every sub projects to separated tasks and create pull requests each of tasks.)
  • Clean test reports

Midterm

  • Completion of sub-project 1 and 2.
  • Accepted pull requests for the source code and test cases.
  • Proven regression test results.

End of project

  • Completion of sub-project 3 and 4.
  • Accepted pull requests for the source code and test cases.
  • Proven regression test results.
Other resources
  • No labels