Archives

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.


											
																					

‘Gan qey bedenî yeno çi mana’ (What the Soul Means for the Body)

Folklore-collecting initiatives in Turkey and Iran have become increasingly popular over the past decade. In this article we present a historical overview of folklore-collecting practices and focus on more recent developments in this field. While Kurdish folklore has been perceived as a cornerstone of Kurdish national identity and as a source of information on Kurdish history, today’s collectors in Turkey and Iran understand its role in a wider context of language revitalization and indigenous knowledge production. Collecting oral traditions in the Kurdish dialects of Kurmanji, Sorani, and Zazaki is appreciated as a step towards protecting and developing the Kurdish language, which is endangered by language assimilation policies in both countries. Reviving folkloric vocabulary, stories, and traditional knowledge practices such as agricultural teachings, folklore collectors revive and promote indigenous knowledge production, and enrich education and research. Drawing on language revitalization theories and indigenous knowledge production, this article offers insights into unexplored aspects of collecting, archiving, and publishing Kurdish folklore in recent years.

On the linguistic history of Kurdish

Historical linguistic sources of Kurdish date back just a few hundred years, thus it is not possible to track the profound grammatical changes of Western Iranian languages in Kurdish. Through a comparison with attested languages of the Middle Iranian period, this paper provides a hypothetical chronology of grammatical changes. It allows us to tentatively localise the approximate time when modern varieties separated with regard to the respective grammatical change. In order to represent the types of linguistic relationship involved, distinct models of language contact and language continua are set up.

The indivisibility of the nation and its linguistic divisions

Kurdish has four “geographical” dialects divided arbitrarily and forcibly among five neighboring countries of Turkey, Iran, Iraq, Syria and Armenia. It has three literary dialects, two standardizing varieties, numerous norms and three alphabets. Further complicating this linguistic landscape since 1918 is the crisscrossing of dialect areas by international borders and subjecting them to state policies ranging from linguicide (Turkey, Iran, Syria) to officialization on the local (Iraq before 2005; USSR) and national levels (Iraq since 2005). Under these conditions, dialect divisions were overshadowed by the linguicidal situation which threatened the survival of the language. The formation of the Kurdistan Regional Government in 1991 and the officialization of Kurdish as one of the two state languages of Iraq in 2005 have removed the external (state) threat, and raised, once more, the question of the dialect base of the standard language. While Iraqi rulers had in the past used dialect pluralism as justification for denying Kurdish official status, now the Kurds themselves have to cope with the linguistic fragmentation of their nation. This article examines the conflict over the adoption of one or two of the major dialects, Sorani and Kurmanji, as the official standard language in Iraq.

Grammatica e vocabolario della lingua kurda

The earliest scientific European studies on the Kurdish language and civilization, which date back to the late 18th century, were carried out by missionaries (first by Italian Catholics and later by Anglo-Saxon Protestants). The pioneer of European Kurdish studies was Maurizio Garzoni (1734-1804), a member of the Order of Black Friars, who reached the region of Mosul (Mowsel) in 1762. Two years later he settled in ʿAmādiya, the capital of the principality of Bahdinān, to the northeast of Mosul. There he collected materials for his Grammatica e vocabolario della lingua Kurda, which was published in Rome in 1787. The first of its kind, it remained an important source of information on the Kurdish language until the end of the 19th century.

Calibrating Kurmanji and Sorani

This chapter focuses primarily on Kurmanji and Sorani, which are the dialects of the Kurdish language. In Kurmanji, the infinitive is gotin. In Sorani, there are multiple forms of the infinitive: wutin/witin in Sulaimania and Kirkuk, kutin in Mukriyan and gotin in Erbil. Behdini, Behdinani or Badinani, the southern dialect cluster of Kurmanji, can be seen as a bridge between Kurmanji and Sorani. Seeing Kurmanji and Sorani as equal partners is not merely a linguistic matter: it is also a social matter. During the mandate period in Iraq, the British insisted that Sorani be the only Kurdish dialect taught in Kurdish schools, which engendered a fair amount of resentment among Kurmanji speakers. Choosing one dialect over another–or, to put it differently, imposing one dialect on a population which speaks another–is guaranteed to cause dissent.

The Dialects of Kurdish

The project aims to provide a comparative structural and typological survey of the dialect continuum of Kurdish, covering sample locations from across the major Kurdish speaking regions between the eastern Anatolian regions of Turkey, through northern Syria and Iraq and on to north-eastern Iran. The varieties covered include primarily those known as Kurmanji-Bahdini (Northern Kurdish) and Sorani (Central Kurdish), with some limited coverage of varieties belonging to the group known as Southern Kurdish.

The survey covers selected structures in lexicon, phonology and lexical phonology, morphology, and morpho-syntax, with a strong focus on the interaction of morphological alignment with verb semantics.

The data obtained through the survey’s questionnaire elicitation are presented in a Database that can be searched by location, structural tag, English translation of the elicitation phrase, and Kurdish word forms. For more information on the elicitation method please consult the Pilot and extended survey page.

A collection of Maps present the geographical distribution of selected variants that have been extracted from the questionnaire database.

A set of Free Speech Samples are presented in the form of audio files accompanied by a transliteration and English translation, and are linked to the Database entries that document the results of questionnaire elicitation with the same speakers. These are short samples of typically around 5 minutes that have been extracted from longer stretches of recordings of connected speech. The topics covered include biographical narration about village life, customs and traditions, and local history, as well as traditional tales, and provide a rich resource, so far unparalleled online, of documentation of Kurdish cultural traditions presented by ordinary people from across the Kurdish speaking regions, in their own local dialects.

The academic evaluation of the project data is currently underway (2017), led by the project’s Principal Investigator, Professor Yaron Matras, with the participation of a group of international leading researchers in Kurdish linguistics.

Phonological Variation in Kurdish

Kurdish is often portrayed as a linguistic unity, but an examination of phonological structures in the language reveals substantial internal variation. In this study, we examine the geographic distribution of vowels and consonants in the phonological inventories of 125 Northern Kurdish (Kurmanji) and Central Kurdish (Sorani) varieties in the Database of Kurdish Dialects, and their patterning in individual words from all of these data sets. The data reveal a stable set of core vowels and consonants, along with peripheral phonemes of both types that demonstrate a high level of variation in geographic distribution and frequency. Segments with significant distributional restrictions include front rounded vowels, uvular consonants, a contrastive aspirated stop series, emphatic alveolar obstruents, and pharyngeals ʕ and ħ. An analysis of these patterns gives modest confirmation of the well-known Northern vs. Central Kurdish dialect division, but shows that the phonological distinction between the two is best characterized in terms of tendencies rather than exact, regular correspondences. Beyond many other individual isoglosses in the data that cross-cut one another, there is a weak pattern of transition between the two major dialect areas; limited diffusion of phonological innovations to varieties at the geographic periphery of the language; and more direct influence of language contact on the phonological structures in certain regions. Alongside these various configurations of areal distribution, and in contrast to them, there is a strong, overarching pattern of non-directional phonological variability among varieties, which points to the local nature of phonological changes across the language area.