Find out about the HPCC Systems Summer Internship Program.

Project Description

In order to eventually create digital human readers in Spanish, a dictionary must be established. This project will use the Spanish dictionary from Wiktionary. One interesting aspect of this project are the verbs in Spanish which have a rich morphology.

If you are interested in this project, please contact David Dehilster.

Completion of this project involves:

  • Download the Spanish dictionary from wiktionary
  • Write an NLP++ parser to extract the vocabulary from the wiktionary files into text files
  • Write an NLP++ parser to transform the text files into knowledge base files
  • Create Spanish test files for part-of-speech tagging
  • Write an NLP++ part-of-speech tagger
  • Run the tests using the NLP++ Plugin in ECL to show enhancements
  • Create an NLP++ repository for the Spanish dictionary and analyzers

By the mid term review we would expect you to have:

  • More details coming soon
Mentor

David Dehilster

Skills needed
  • Keen interest in natural language
  • Ability to learn and program in NLP++
  • Ability to create test cases
  • Ability to write test code in ECL using the NLP++ plugin to test the enhanced dictionary
Deliverables

Midterm

  • Parts-of-speech text files

End of project

  • A Spanish dictionary repository in the VisualText open source github including the dictionary files and  NLP++ analyzers
Other resources
  • No labels