Submissions/OpenCorpora.org: wiki-principles + Wikimedians + Wikimedia projects

From Wikimania 2013 • Hong Kong

After careful consideration, the programme committee has decided not to accept the below submission at this time. Thank you to the author(s) for participating in the Wikimania 2013 programme submission, we hope to still see you at Wikimania this August.

Submission no.
2022
Subject no.
B10
Title of the submission

OpenCorpora.org: wiki-principles + Wikimedians + Wikimedia projects

Type of submission

presentation

Author of the submission

Lvova (talk), Victor Bocharov

Country of origin

Russian Federation

Affiliation

OpenCorpora.org

E-mail address

lvova@wikimedia.ru

Username

Lvova

Personal homepage or blog

http://lvova.livejournal.com

Abstract

OpenCorpora is a community-driven project that aims at creating a corpus of Russian (for beginning) texts with linguistic annotation, which is fully accessible to researchers under the terms of the Creative Commons Attribution/Share-Alike License. OpenCorpora includes texts taken from Wikimedia projects and uses wiki approach to the annotation process.

Detailed proposal

Annotated corpus is a database that includes natural language text with its linguistic annotation: part-of-speech tags, syntax, semantics and so on. Open access annotated corpora are important tools for the researchers working in the field of natural language processing since they can be used in a wide range of computational experiments. There is a number of such open access corpora for English, Spanish and other languages but there have been no open access annotated corpora available for the Russian language. This is due to both high cost of annotation and poor understanding of open licenses in Russia. OpenCorpora improves this: all our data can be downloaded and used under the terms of the CC-BY-SA and annotation works are performed by volunteers. It’s important, that Wikimedians are really useful source of texts (from Wikipedia. Wiktinary, Wikibooks, Wikisource, from own blogs) and working with the annotations of texts.

OpenCorpora uses a lot core principles of wiki-style content development and applies them to corpus annotation:

  • participants can edit annotation at any moment
  • all changes are recorded in the history and can be cancelled
  • low entry threshold for volunteers

This wiki-style is combined with pipeline of both human and automatic data verification in order to maintain high data quality suitable for linguistic software assessment purposes.

Within the scope of OpenCorpora project we are developing infrastructure that makes it possible to involve any Russian language native speaker into corpus annotation works. This works because a large number of annotation tasks really doesn't require deep understanding of linguistic theory and only native speakers' intuition is required.

In this lecture we will explain the way corpus annotation process works, methods we use to cope with mistakes and the results we are delivering.

Track
  • Cultural and Educational Outreach
Length of presentation/talk
25 Minutes
Language of presentation/talk
English
Will you attend Wikimania if your submission is not accepted?
probably yes
Slides or further information (optional)

Some publications in Russian are there: http://www.opencorpora.org/?page=publications

Special requests


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Yes, i'm interested. --Rave (talk) 12:05, 19 March 2013 (UTC)[reply]
  2. It would be also great to learn if the project will get new languages. Wizardist (talk) 17:21, 27 March 2013 (UTC)[reply]
  3. Slashme (talk) 18:08, 7 April 2013 (UTC)[reply]
  4. អមីរ ឯ. អហរោណិ 06:52, 12 April 2013 (UTC)[reply]
  5. Jtmorgan (talk) 23:38, 17 April 2013 (UTC)[reply]
  6. CT Cooper · talk 00:06, 29 April 2013 (UTC)[reply]
  7. Amakuha (talk) 19:14, 13 May 2013 (UTC)[reply]
  8. Ijon (talk) 05:36, 16 May 2013 (UTC)[reply]
  9. Your name here