Faruk Büyüktekin, Reference Selection in Turkish: A Corpus-Based Approach

Ph.D. Candidate: Faruk Büyüktekin
Program: Cognitive Science
Date: 18.06.2025 / 10:00
Place: 
A-212

Abstract: This thesis investigates reference selection in natural language, focusing on the mechanism that shape the form of referring expressions. Drawing from both linguistic theory and data-driven and computational methods, the study seeks to uncover how grammatical, discourse, and cognitive factors jointly influence referential form. Turkish is chosen as the target language due to its typologically distinct characteristics specifically, its rich morphology, frequent use of null pronouns (pro-drop), and flexible word order. These features offer a testing ground to explore referential choices beyond the patterns observed in well-studied languages. A central contribution of this work is the creation of a novel coreference corpus based on spontaneous, goal-directed dialog. Unlike existing research that has typically relied on isolated sentences or written texts, this study uses situated task-based interaction, capturing reference in real-time, naturalistic speech. To facilitate this, a new annotation scheme was developed to represent the full range of referential forms, including full noun phrases, overt pronouns, and null pronouns, and their contextual and grammatical properties. The resulting corpus enables systematic and computational analyses of referential phenomena. Building on this resource, the thesis conducts statistical analyses and employs machine learning to evaluate the effects and interactions of multiple features on referential form. These include speaker role, turn-taking, grammatical function, competition, distance/recency, topicality, and sentence position. Among the findings, competition and distance emerged as the most predictive features in model performance, while grammatical role, speaker role and turn-taking showed significant but weaker effects. The results demonstrated that many of these factors significantly influence form choice in varying degrees, supporting and extending theoretical predictions from models of reference production and comprehension.