Text Search Bundle - Implementation of Equivalence Terms

This project was completed by Farah Al Shanik, a PhD student studying Computer Science at Clemson University. Farah Joined the HPCC Systems intern program in 2018.

There are many variants to take into account for this project such as matching plural and singular forms, language variants, punctuation evident in acronyms and the use of initials and alternative spellings. Such as color with and without the ‘u’.

Find out about the HPCC Systems Summer Internship Program.

Project Description

There is a detailed description of the work in the JIRA issue TS1, which includes an attachment to the the Open Source Text Search document.  This JIRA also details a series of sub-tasks describing the work.

There is a preliminary collection of ECL attributes that were drawn from several earlier proprietary text search applications.  The intent is to provide a framework for building generally useful text search applications supporting searching XML text documents.

The sub-projects are:

  1. Initial build version.  Build the inversion datasets.

  2. Initial search version.  Search the initial inversions.

  3. Regression tests.  Regressions for search request parsing, inversion builds, and search resolution.

  4. Document add, replace, and delete.  Attributes to maintain the inversion.

  5. Slice Rollup.  Automation to rollup the incremental data.  

  6. Wildcard processing.  Alter the wildcard processing to work with large numbers of terms that match a patterns.  

  7. Retrieval application.  An application to retrieve documents from the search resolve hit lists.

  8. Equivalence terms.  Language equivalence (like stemming) and ad hoc phrase equivalencing.

There is enough work that it is unlikely that a single intern would be able to complete all of the sub-projects in a single period.  

Completion of this project involves:

Code checkin will be done weekly, and the commit will be pushed.  The developer can determine whether to amend a single commit or to provide a sequence of weekly commits.

Each sub-project will be done in sequence, and each sub-project will have a separate pull request.

The attribute exports intended to be used by an application developer using the framework will be documented using java Doc style comments.

By the midterm review we would expect you to have completed:

  1. Initial build version: See https://track.hpccsystems.com/browse/TS-2

  2. Initial search version: See https://track.hpccsystems.com/browse/TS-3

  3. Regression tests: See https://track.hpccsystems.com/browse/TS-4

Mentor

John Holt
Contact details

Backup Mentor: Roger Dev
Contact Details 

Skills needed
  • Ability to code in ECL.

  • Knowledge of regular expression parsing.

  • Ability to build and test the HPCC system (guidance will be provided).

  • Ability to write test code.

Deliverables
  • Improve an algorithm to solve initialism with punctuation in the search request ans state names equivalence terms

  • Include equivalences mined from Moby Thesaurus

  • Improve an algorithm to find the synonyms of the terms that appear in the search request.

  • Test cases demonstrating the correct behaviour and performance

  • Documentation

Other resources

All pages in this wiki are subject to our site usage guidelines.