METU CORPUS

METU Corpora Research Group

The research group is interested in developing electronic corpora of modern Turkish stressing capabilities beyond individual words and individual sentences.

Completed Projects

  • METU Turkish Corpus (MTC)

Principal Investigator: Bilge Say

Funded by: METU-BAP project no: 99060402

METU Turkish Corpus is a collection of 2 million words of post-1990 written Turkish samples. A subset of the corpus is used in METU-Sabanci Turkish Treebank. METU Turkish Corpus is XCES tagged at the typographical level. The distribution of the corpus also includes a workbench and related publications.

The words of METU Turkish Corpus were taken from 10 different genres. At most 2 samples from one source is used; each sample is 2000 words or the sample ends when the next sentence ends.

The complete METU Turkish Corpus is available to researchers around the world for research purposes only; free of charge. The distribution of the corpus also includes a query workbench, and related publications. 

As part of a separate project (METU-Turkish Discourse Bank Project), discourse annotation has been done on a part of the corpus. METU- Turkish Discourse Bank Project site can be found here.

  • METU-Sabanci Turkish Treebank

Co-investigators: Bilge Say, Kemal Oflazer

METU-Sabanci Turkish Treebank is a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences. The sentences are taken form METU Turkish Corpus. The percentages of different genres in METU-Sabanci Turkish Treebank and METU Turkish Corpus were kept the similar. The structure of METU-Sabanci Turkish Treebank is based on XML. The distribution of the treebank also includes a user guide, a display program and related publications.

Turkish is an agglutinative language with free word order. Therefore, a dependency scheme was chosen to handle such a structure. Dependency links are put from words to inflectional groups of words.

metu_sabanci_turkish_treebank_introduction

The structure of METU-Sabanci Turkish Treebank is based on XML. Paragraphs, sentences and words are tagged by <Set>, <S> and <W> tags respectively. There are different attributes for each of the tags which hold information about number of sentences, number of words, morphological analyses, and dependency relations (For detailed information see the user guide).

Alumni

  • Sedef Akgul
  • Filiz Yilmaz Bican
  • Aysenur Birturk
  • Aygün Boduroğlu
  • Cem Bozsahin
  • Deniz Cantürk
  • Ruket Çakıcı
  • Şükrü Barış Demiral
  • Rabia Ergin
  • Gülşen Eryiğit
  • Barış Çağrı Genç
  • Irfan Nuri Karaca
  • Çağrı Kayadelen
  • Wolf Konig
  • Mine Mısırlısoy
  • Baris Sara
  • Devrim Saran
  • Ümit Deniz Turan
  • Barçın Uluışık
  • Hacer Üke
  • Deniz Zeyrek

We also thank the publishers and newspapers:

  • Adam Yayınevi
  • Atlas Dergisi
  • Bilgi Yayınevi
  • Bilim ve Ütopya Dergisi
  • Bütün Dünya Dergisi
  • Can Yayınları
  • Cumhuriyet Gazetesi
  • Doğu-Batı Dergisi
  • İletişim Yayınları
  • İş Bankası Kültür Yayınları
  • Kuraldışı Yayınevi
  • Milliyet Gazetesi
  • Radikal Gazetesi
  • Yapı Kredi Yayınları

Links

In order to get the METU Turkish Corpus, please fill in the METU Turkish Corpus user agreement form (click for Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

Likewise, the complete METU-Sabanci Turkish Treebank is available to researchers around the world, free of charge for research purposes only. The distribution of the treebank also includes a user guide, a display program and related publications. In order to get the treebank, fill in the METU-Sabanci Turkish Treebank user agreement form (click for the Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

  • METU Turkish Discourse Bank (METU-TDB)

  • Principal Investigator: Deniz Zeyrek
  • Funded by: TÜBİTAK 1001 (Project no. 107E156) 2008-2011

The METU - Turkish Discourse Bank (METU-TDB) project aims to develop a corpus annotated with information related to discourse structure of Turkish. As part of this project, the team investigates the nature of Turkish discourse structure to the extent that they are represented by connectives such as "çünkü" (because), "ama" (but) and "aksi halde" (otherwise). In the life time of the project, a 500.000-word subcorpus of METU Turkish Corpus is annotated with respect to connectives, their senses and arguments. As there is no current resource annotated with discourse structure information for Turkish available, the resulting annotated data is expected to become an important resource for future studies on Turkish discourse structure.

In the TDB 1.0 explict discourse connectives are annotated for their both arguments in the whole corpus. TDB 1.1 where 10% of the whole corpus is annotated for various discourse relations as well as their senses will be avaliable later in 2017.

Group members

  • Deniz Zeyrek
  • Cem Bozşahin
  • Ruket Çakıcı

Alumni

  • Ayışığı B. Sevdik-Çallı
  • Murathan Kurfalı (currently PhD student at Stockholm University)
  • Ege Saygıner
  • Deniz Hande Çakmak
  • Fikret Arslan
  • Ahmet Faruk Acar

•ENRINCHING TURKISH DISCOURSE BANK: ANNOTATING IMPLICIT CONNECTIVES (01.01.2015 - 31.12.2015)

Principal Investigator: Deniz Zeyrek   
Funded by: METU BAP GRANT (BAP-07-04-2015-004)

Group members

  • Murathan Kurfalı
  • Ege Say
  • Serkan Kumyol
  • Faruk Büyüktekin
  • Tuğçe Nur Bozkurt

•METU TURKISH PSYCHOLINGUISTIC DATABASE (Based on Child Literature Corpus developed by Elif Ahsen Acar (Tolgay))

Master’s Thesis by: Elif Ahsen Acar (Tolgay) 

Thesis Title: A TURKISH DATABASE FOR PSYCHOLINGUISTIC STUDIES: A CORPUS BASED STUDY ON FREQUENCY, AGE OF ACQUISITION, AND
IMAGEABILITY   

Supervisor: Deniz Zeyrek

Contributors:

•Murathan Kurfalı
•Cem Bozşahin

To have access to this database, please fill in the form provided at this link:

•TEXTLINK (2014-2018)

http://www.textlink.ii.metu.edu.tr/

Principal Investigator: Liesbeth Degand   
Funded by: (ISCH COST Action 1312)
Web manager team: METU Informatics Institute
• Deniz Zeyrek
• Işın Demirşahin
• Murathan Kurfalı
• Ahmet Üstün
• Ege Saygıner