Archives

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.


											
																					

Constructing ditransitivity in literary Kurmanji

This study takes a corpus-driven approach based on a collection of contemporary novels and short stories in order to explore various options for realising ditransitive constructions in Kurmanji, discussing some phenomena that pose a challenge to clear categorisation. Semantically, “ditransitive constructions” can be defined as constructions expressing “three-participant events”, involving verbs with three participants, as often referrred to in typological literature: an agent, a theme and a recipient (or recipient-like) participant. Cross-linguistically typical instances are verbs of giving (e.g. dan in Kurmanji), showing (nüƟan dan) and saying (gotin), as well as their contraries (pirsün ‘ask’), and other semantically related verbs. In an interplay between flagging, indexing and word order, Kurmanji reveals a rich formal repertoire that presents a number of challenges to systematisation. It makes use of several morpho-syntactic devices, applied alternatively and generally in combination with oblique case: a postpredicative position, adpositional constructions, a verbal suffix indicating the presence of an indirect object, and light verb ezafe constructions that link an indirect object to the lexical nominal. The study aims at uncovering factors which determine the choice of a construction. The use of formally identifiable ditransitive constructions, on the other hand, clearly transcends the original concept of a “physical transfer”, extending into non-animate, abstract and metaphorical contexts. Depending on the construction at hand, cognitive contents, images, landscapes, sounds, and other non-human core arguments may end up in an agentive role, while humans are frequently expressed as verb complements, particularly undergoers of a self-caused movement. Recipients, on the other hand, can be inanimate entities and even abstract ideas.

The Dialects of Kurdish

The project aims to provide a comparative structural and typological survey of the dialect continuum of Kurdish, covering sample locations from across the major Kurdish speaking regions between the eastern Anatolian regions of Turkey, through northern Syria and Iraq and on to north-eastern Iran. The varieties covered include primarily those known as Kurmanji-Bahdini (Northern Kurdish) and Sorani (Central Kurdish), with some limited coverage of varieties belonging to the group known as Southern Kurdish.

The survey covers selected structures in lexicon, phonology and lexical phonology, morphology, and morpho-syntax, with a strong focus on the interaction of morphological alignment with verb semantics.

The data obtained through the survey’s questionnaire elicitation are presented in a Database that can be searched by location, structural tag, English translation of the elicitation phrase, and Kurdish word forms. For more information on the elicitation method please consult the Pilot and extended survey page.

A collection of Maps present the geographical distribution of selected variants that have been extracted from the questionnaire database.

A set of Free Speech Samples are presented in the form of audio files accompanied by a transliteration and English translation, and are linked to the Database entries that document the results of questionnaire elicitation with the same speakers. These are short samples of typically around 5 minutes that have been extracted from longer stretches of recordings of connected speech. The topics covered include biographical narration about village life, customs and traditions, and local history, as well as traditional tales, and provide a rich resource, so far unparalleled online, of documentation of Kurdish cultural traditions presented by ordinary people from across the Kurdish speaking regions, in their own local dialects.

The academic evaluation of the project data is currently underway (2017), led by the project’s Principal Investigator, Professor Yaron Matras, with the participation of a group of international leading researchers in Kurdish linguistics.

Phonological Variation in Kurdish

Kurdish is often portrayed as a linguistic unity, but an examination of phonological structures in the language reveals substantial internal variation. In this study, we examine the geographic distribution of vowels and consonants in the phonological inventories of 125 Northern Kurdish (Kurmanji) and Central Kurdish (Sorani) varieties in the Database of Kurdish Dialects, and their patterning in individual words from all of these data sets. The data reveal a stable set of core vowels and consonants, along with peripheral phonemes of both types that demonstrate a high level of variation in geographic distribution and frequency. Segments with significant distributional restrictions include front rounded vowels, uvular consonants, a contrastive aspirated stop series, emphatic alveolar obstruents, and pharyngeals ʕ and ħ. An analysis of these patterns gives modest confirmation of the well-known Northern vs. Central Kurdish dialect division, but shows that the phonological distinction between the two is best characterized in terms of tendencies rather than exact, regular correspondences. Beyond many other individual isoglosses in the data that cross-cut one another, there is a weak pattern of transition between the two major dialect areas; limited diffusion of phonological innovations to varieties at the geographic periphery of the language; and more direct influence of language contact on the phonological structures in certain regions. Alongside these various configurations of areal distribution, and in contrast to them, there is a strong, overarching pattern of non-directional phonological variability among varieties, which points to the local nature of phonological changes across the language area.

WOWA — Word Order in Western Asia

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance (the genetic affiliation of the languages concerned, e.g. Turkic, Semitic, etc.) and the impact of neighbouring languages, related or not, in shaping word order in usage. In addition, we address the issue of which aspects of word order are stable within a particular doculect, and which display corpus-internal variability.

More generally, this is connected to the issue of integrating variation into typology. Finally, WOWA is the only cross-linguistic data-base of its type that includes exclusively spoken language, and thus provides an important corrective to much ongoing work in corpus-based typology, which is still largely based on written language.