METU CORPUS

METU Turkish Corpus (MTC)

METU Turkish Corpus is a collection of 2 million words of post-1990 written Turkish samples. A subset of the corpus is used in METU-Sabanci Turkish Treebank. METU Turkish Corpus is XCES tagged at the typographical level. The distribution of the corpus also includes a workbench and related publications.

The words of METU Turkish Corpus were taken from 10 different genres. At most 2 samples from one source is used; each sample is 2000 words or the sample ends when the next sentence ends.

The complete METU Turkish Corpus is available to researchers around the world for research purposes only; free of charge. The distribution of the corpus also includes a query workbench, and related publications. 

As part of a separate project (METU-Turkish Discourse Bank Project), discourse annotation has been done on a part of the corpus. METU- Turkish Discourse Bank Project site can be found here.

METU-Sabanci Turkish Treebank

METU-Sabanci Turkish Treebank is a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences. The sentences are taken form METU Turkish Corpus. The percentages of different genres in METU-Sabanci Turkish Treebank and METU Turkish Corpus were kept the similar. The structure of METU-Sabanci Turkish Treebank is based on XML. The distribution of the treebank also includes a user guide, a display program and related publications.

Turkish is an agglutinative language with free word order. Therefore, a dependency scheme was chosen to handle such a structure. Dependency links are put from words to inflectional groups of words.

metu_sabanci_turkish_treebank_introduction

The structure of METU-Sabanci Turkish Treebank is based on XML. Paragraphs, sentences and words are tagged by <Set>, <S> and <W> tags respectively. There are different attributes for each of the tags which hold information about number of sentences, number of words, morphological analyses, and dependency relations (For detailed information see the user guide).

Alumni

  • Sedef Akgul
  • Filiz Yilmaz Bican
  • Aysenur Birturk
  • Aygun Boduroglu
  • Cem Bozsahin
  • Deniz Canturk
  • Ruken Cakici
  • Sukru Baris Demiral
  • Rabia Ergin
  • Gulsen Eryigit
  • Baris Cagri Genc
  • Irfan Nuri Karaca
  • Cagri Kayadelen
  • Wolf Konig
  • Mine Misirlisoy
  • Baris Sara
  • Devrim Saran
  • Ümit Deniz Turan
  • Barcin Uluisik
  • Hacer Üke
  • Deniz Zeyrek

We also thank the publishers and newspapers:

  • Adam Yayinevi
  • Atlas Dergisi
  • Bilgi Yayinevi
  • Bilim ve Utopya Dergisi
  • Butun Dunya Dergisi
  • Can Yayinlari
  • Cumhuriyet Gazetesi
  • Dogu-Bati Dergisi
  • Iletisim Yayinlari
  • Is Bankasi Kultur Yayinlari
  • Kuraldisi Yayinevi
  • Milliyet Gazetesi
  • Radikal Gazetesi
  • Yapi Kredi Yayinlari

Projects

METU Turkish Corpus project was funded by METU-BAP (project no: 99060402)

Links

In order to get the METU Turkish Corpus, please fill in the METU Turkish Corpus user agreement form (click for Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

Likewise, the complete METU-Sabanci Turkish Treebank is available to researchers around the world, free of charge for research purposes only. The distribution of the treebank also includes a user guide, a display program and related publications. In order to get the treebank, fill in the METU-Sabanci Turkish Treebank user agreement form (click for the Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

METU Turkish Discourse Bank (METU-TDB)

The METU - Turkish Discourse Bank (METU-TDB) project aims to develop a corpus annotated with information related to discourse structure of Turkish. As part of this project, the team investigates the nature of Turkish discourse structure to the extent that they are represented by connectives such as "çünkü" (because), "ama" (but) and "aksi halde" (otherwise). In the life time of the project, a 500.000-word subcorpus of METU Turkish Corpus is annotated with respect to connectives, their senses and arguments. As there is no current resource annotated with discourse structure information for Turkish available, the resulting annotated data is expected to become an important resource for future studies on Turkish discourse structure.

In the TDB 1.0 explict discourse connectives are annotated for their both arguments in the whole corpus. TDB 1.1 where 10% of the whole corpus is annotated for various discourse relations as well as their senses will be avaliable later in 2017.

Group members

Principal Investigator

  • Deniz Zeyrek

Students

  • Murathan Kurfalı
  • Ege Saygıner

Projects

METU BAP GRANT (BAP-07-04-2015-004) ENRINCHING TURKISH DISCOURSE BANK: ANNOTATING IMPLICIT CONNECTIVES (01.01.2015 - 31.12.2015)

External Collaborators

  • Ruket Çakıcı, Dr., Middle East Technical University, Turkey
  • Işın Demirşahin, Google Inc., London
  • Ayışığı Sevdik-Çallı