GRADUATE SCHOOL OF INFORMATICS


Corpus

METU Turkish Corpus

 

METU Turkish Corpus is a collection of 2 million words of post-1990 written Turkish samples. A subset of the corpus is used in METU-Sabanci Turkish Treebank. METU Turkish Corpus is XCES tagged at the typographical level. The distribution of the corpus also includes a workbench and related publications.

The words of METU Turkish Corpus were taken from 10 different genres. At most 2 samples from one source is used; each sample is 2000 words or the sample ends when the next sentence ends.

The complete METU Turkish Corpus is available to researchers around the world for research purposes only; free of charge. The distribution of the corpus also includes a query workbench, and related publications. In order to get the METU Turkish Corpus, fill in the METU Turkish Corpus user agreement form (click for Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

As part of a separate project (METU-Turkish Discourse Bank Project), discourse annotation has been done on a part of the corpus. METU- Turkish Discourse Bank Project site can be found here.

 

 

 

METU-Sabanci Turkish Treebank

 

METU-Sabanci Turkish Treebank is a morphologically and syntactically annotated treebank corpus of 7262 grammatical sentences. The sentences are taken form METU Turkish Corpus. The percentages of different genres in METU-Sabanci Turkish Treebank and METU Turkish Corpus were kept the similar. The structure of METU-Sabanci Turkish Treebank is based on XML. The distribution of the treebank also includes a user guide, a display program and related publications.

Turkish is an agglutinative language with free word order. Therefore, a dependency scheme was chosen to handle such a structure. Dependency links are put from words to inflectional groups of words.

The structure of METU-Sabanci Turkish Treebank is based on XML. Paragraphs, sentences and words are tagged by <Set>, <S> and <W> tags respectively. There are different attributes for each of the tags which hold information about number of sentences, number of words, morphological analyses, and dependency relations (For detailed information see the user guide).

The complete METU-Sabanci Turkish Treebank is available to researchers around the world, free of charge for research purposes only. The distribution of the treebank also includes a user guide, a display program and related publications. In order to get the treebank, fill in the METU-Sabanci Turkish Treebank user agreement form (click for the Turkish version), sign it, scan it and e-mail to corpora@metu.edu.tr. You may also fax the signed form to +90 312 210 3745, and simultaneously send a notice to corpora@metu.edu.tr unless you have the option to scan the form. We prefer the first way and will be able to reply faster in that case.

 

 

Acknowledgements

 

METU Turkish Corpus project was funded by METU-BAP (project no: 99060402)and METU -Sabanci Turkish Treebank project by TUBITAK (project no: EEEAG 199E026). Both Projects' main investigator was Bilge Say, the Treebank projects' co-investigator was Kemal Oflazer. Umut Ozge and Nart Bedin Atalay were the main project assistants, respectively.

Various people contributed by making annotations, writing software, giving ideas in alphabetical order: Sedef Akgul, Filiz Yilmaz Bican, Aysenur Birturk, Aygun Boduroglu, Cem Bozsahin, Deniz Canturk, Ruken Cakici, Sukru Baris Demiral, Rabia Ergin, Gulsen Eryigit, Baris Cagri Genc, Irfan Nuri Karaca, Cagri Kayadelen, Wolf Konig, Mine Misirlisoy, Baris Sara, Devrim Saran, Ümit Deniz Turan, Barcin Uluisik, Hacer Üke, Deniz Zeyrek.

We also thank the publishers and newspapers Adam Yayinevi, Atlas Dergisi, Bilgi Yayinevi, Bilim ve Utopya Dergisi, Butun Dunya Dergisi, Can Yayinlari, Cumhuriyet Gazetesi, Dogu-Bati Dergisi, Iletisim Yayinlari, Is Bankasi Kultur Yayinlari, Kuraldisi Yayinevi, Milliyet Gazetesi, Radikal Gazetesi, Yapi Kredi Yayinlari for their generosity in giving us the permission to use text samples.