Archives

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.


											
																					

The Dialects of Kurdish

The project aims to provide a comparative structural and typological survey of the dialect continuum of Kurdish, covering sample locations from across the major Kurdish speaking regions between the eastern Anatolian regions of Turkey, through northern Syria and Iraq and on to north-eastern Iran. The varieties covered include primarily those known as Kurmanji-Bahdini (Northern Kurdish) and Sorani (Central Kurdish), with some limited coverage of varieties belonging to the group known as Southern Kurdish.

The survey covers selected structures in lexicon, phonology and lexical phonology, morphology, and morpho-syntax, with a strong focus on the interaction of morphological alignment with verb semantics.

The data obtained through the survey’s questionnaire elicitation are presented in a Database that can be searched by location, structural tag, English translation of the elicitation phrase, and Kurdish word forms. For more information on the elicitation method please consult the Pilot and extended survey page.

A collection of Maps present the geographical distribution of selected variants that have been extracted from the questionnaire database.

A set of Free Speech Samples are presented in the form of audio files accompanied by a transliteration and English translation, and are linked to the Database entries that document the results of questionnaire elicitation with the same speakers. These are short samples of typically around 5 minutes that have been extracted from longer stretches of recordings of connected speech. The topics covered include biographical narration about village life, customs and traditions, and local history, as well as traditional tales, and provide a rich resource, so far unparalleled online, of documentation of Kurdish cultural traditions presented by ordinary people from across the Kurdish speaking regions, in their own local dialects.

The academic evaluation of the project data is currently underway (2017), led by the project’s Principal Investigator, Professor Yaron Matras, with the participation of a group of international leading researchers in Kurdish linguistics.

WOWA — Word Order in Western Asia

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance (the genetic affiliation of the languages concerned, e.g. Turkic, Semitic, etc.) and the impact of neighbouring languages, related or not, in shaping word order in usage. In addition, we address the issue of which aspects of word order are stable within a particular doculect, and which display corpus-internal variability.

More generally, this is connected to the issue of integrating variation into typology. Finally, WOWA is the only cross-linguistic data-base of its type that includes exclusively spoken language, and thus provides an important corrective to much ongoing work in corpus-based typology, which is still largely based on written language.

Atlas of the Languages of Iran (ALI)

The Atlas of the Languages of Iran (ALI) brings together insights from linguists in Iran and internationally, statistical and demographic publications by national agencies, and, foundationally, speakers of the many languages and dialects of the country. Rather than communicating a single view of Iran’s languages and dialects, the Atlas allows users to enrich their own perspectives on language distribution with location-based language data.

The searchable maps highlight patterns in the phonology (the sounds of language), morphosyntax (grammar) and lexicon (words) of Iran’s languages. Users can access, contribute and comment on language data, which are organized in reference to each of the country’s some 60,000 towns and cities.

Language planning in the diaspora: Corpus and prestige planning for Kurdish

The socio-political situation of Kurdish in the Middle East has been largely unfavourable for the development of a standard language and related prestige. This, in turn, led the members of the Kurdish diaspora in European countries to take charge of the issues of language policy and planning of their language without relying on governmental support. As examples of diasporic language planning achievements, I will describe and discuss two initiatives based in France and Sweden respectively. With their contributions they have been providing work that is normally carried out by language academies – i.e., institutions with state support. In the case of a people without a state, the task of standardising and promoting a language is even more complex. My paper will provide some insight into this huge enterprise.