Page tree
Skip to end of metadata
Go to start of metadata

The proposal period for 2022 internships is now closed
The proposal period for 2023 internships will open in November 2022

This is new project, more information coming soon. If you are interested in this project contact Lorraine Chapman.

Find out about the HPCC Systems Summer Internship Program.

Project Description

In order to eventually create digital human readers in the Kurdish language, a dictionary must be established. This project will use the Kurdish dictionary from Wiktionary.

Two of the Kurdish dialects; “Zazaki” and “Hawrami” are considered endangered by UNESCO. Major Kurdish dialect, Kurmanji is not among those considered endangered by UNESCO, but it has experienced varying levels of state suppression in Turkey. If Kurds do not do anything to preserve their mother tongue, it is not hard to say that Kurmanji also will be considered as an endangered dialects in 10-15 years. But, why is it important to preserve a language? Every language expresses a distinct worldview, with its own set of value systems, philosophy, and cultural characteristics. The extinction of a language results in the irreversible loss of unique cultural information that has been embedded in it for generations, including historical, spiritual, and ecological knowledge that may be critical for the survival of not just its speakers, but also countless others. When we lose a language, we lose the worldview, philosophy, culture and knowledge of the people who spoke it, constituting a loss to all humanity.

As languages are becoming extinct at an alarming rate, speakers of endangered languages are turning to technology in a race against time to pass on their unique languages and cultures to the next generation. Thanks to the benefits of artificial intelligence for language documentation and learning, NLP is becoming more important than ever in the fight to save endangered language and has been already used for preserving endangered languages. For that purpose, I would like to propose a project for Creating an NLP dictionary for the Kurdish language. The idea is to use the Wiktionary Kurdish dictionary and make it into the first Kurdish open-source computer dictionary. This project will increase the interest in Kurdish Language among Kurds since Kurds become more motivated to learn and speak their mother tongue, when they see these kinds of works done to preserve Kurdish language including writing books and publishing magazine in Kurdish. As a result, the project will have a positive affect on the interest in Kurdish language and daily usage of the language. Secondly, Kurdish language owes its existence to the volunteer works done by individuals and non-profit organizations in the last century. There is no doubt that this kind of works requires high level of motivation and passion since they do not contribute financially to organizations or individuals. One of the main source that increase their motivation is seeing other projects that implemented to preserve the language. Finally, this project would spark others in computer science to be interesting in the subsequent creation of a Kurdish parser. In the end of the project, the dictionary will be ready to be used by programmers around the world. Programmers will be able to develop ChatBots in Kurdish. People will be able to practice with chatbot to develop their Kurdish skills. Furthermore, the project can be taken to another level to develop a language generator such as GPT-3. Then the language generator will be able to write any kind of article in Kurdish language. With this technology , even If there is no one left in the world who speaks Kurdish, the language will still not be extinct.

Completion of this project involves:

  • Download the Kurdish dictionary from wiktionary
  • Write an NLP++ parser to extract the vocabulary from the wiktionary files into text files
  • Write an NLP++ parser to transform the text files into knowledge base files
  • Create Kurdish test files for part-of-speech tagging
  • Write an NLP++ part-of-speech tagger
  • Run the tests using the NLP++ Plugin in ECL to show enhancements
  • Create an NLP++ repository for the Kurdish dictionary and analyzers

By the mid term review we would expect you to have:

  • <What must be completed to pass the evaluation and continue on to complete the project>
Mentor

David de Hilster
david.dehilster@lexisnexisrisk.com 

Backup Mentor: Add Backup Mentor Name
Add link to Email Address 

Skills needed
  • Keen interest in natural language
  • Ability to learn and program in NLP++
  • Ability to create test cases
  • Ability to write test code in ECL using the NLP++ plugin to test the enhanced dictionary
Deliverables

Midterm

  • Parts-of-speech text files

End of project

  • A Kurdish dictionary repository in the VisualText open source github including the dictionary files and  NLP++ analyzers
Other resources
  • No labels