Provide Unicode implementations for HPCC Systems standard library functions

This project was completed as a intern opportunity with HPCC Systems in 2019. Curious about projects we are offering for future internships? Take a look at our Ideas List.

Find out about the HPCC Systems Summer Internship Program.

Project Description

As part of the drive to make improvements in the way HPCC Systems handles unstructured text, we need to ensure that all standard library functions have Unicode implementations. While a number of functions already have Unicode implementations, the following need to be provided:

ExcludeFirstWord
ExcludeLastWord
ExcludeNthWord

For the functions below, the definition in Str.ecl can probably be copied to Uni.ecl – although see the note below about normalization. The main complication might be extending the test cases.

EndsWith (*)
StartsWith (*)
RemoveSuffix (*,**)

It is possible that most of these functions will not need to call any special icu functions. They are likely to be similar to the string implementations.

CountWords (aka count delimited tokens)
FindCount (*)
Repeat (**)
SplitWords
Translate(**)

The following are no longer required for this project:

FromHexPairs (no longer required)
ToHexPairs (no longer required)

Extra work - Optimize the way break iterators are created

The current String version can serve as a sufficient specification except for:

Functions marked (*) will need an additional optional parameter to indicate if a normalization is to be performed and if so which normalization. The default will be not to perform a normalization. The normalization will be one of: NFC, NFKC, NFD, or NFKD. The parameter will be a string with the string literal indicating the normalization. See http://unicode.org/reports/tr15/ for the details concerning the normalizations. The ICU Normalizer2 class is to be used.

You can assume the strings coming in are normalized, but the translation may result in an unnormalized string. Look at unicodeEnsureIsNormalized() in rtl/eclrtl/eclrtl.cpp and the linked reference above. The test case will need to include examples where the normalization is required. There may be scope for another function which explicitly normalizes a Unicode string to a specific normal form.
Functions marked (**) must verify that unpaired surrogates cannot be created.

For Repeat() if the two inputs are well-formed Unicode strings then the output cannot contain any surrogate pairs. For translate the function needs to make sure that it maps code points rather than unicode16 characters - otherwise it would be possible to create unmatched pairs.

Completion of this project involves:

C/C++ implementation of the function
Code usage examples to be added into the HPCC Systems regression suite
Documentation update to add the function name to the Standard Library Reference page for the function. In the 4 cases where a parameter is added, the parameter description table will be updated.
An accepted pull request for the above three deliverables for each function.

By the GSoC mid term review we would expect you to have:

Accepted pull requests for 6 of the 13 functions listed above.

Mentor	John Holt Contact Details Backup Mentor: Gavin Halliday Contact Details
Skills needed	Ability to code in C++. Ability to build and test the HPCC system (guidance will be provided). Ability to write test code.
Deliverables	Accepted pull requests for each function Test cases demonstrating the correct behaviour and performance Documentation
Other resources	HPCC Systems website JIRA issue for this project: https://track.hpccsystems.com/browse/HPCC-12922 Learning ECL documentation and on-line training courses.

Mentor

Skills needed

Deliverables

Other resources