The Bulgarian National Corpus : Theory and Practice in Corpus Design

Koeva, S.; Stoyanova, I.; Leseva, S.; Dimitrova, T.; Dekova, R.; Tarpomanova, E.

doi:10.15398/jlm.v0i1.33

Artykuł - szczegóły

Tytuł artykułu

The Bulgarian National Corpus : Theory and Practice in Corpus Design

Autorzy

Koeva S. , Stoyanova I. , Leseva S. , Dimitrova T. , Dekova R. , Tarpomanova E.

Treść / Zawartość

Pełne teksty:

Koeva_The Bulgarian National Corpus_1_2012.pdf

Pobierz

Identyfikatory

DOI

10.15398/jlm.v0i1.33

Warianty tytułu

Języki publikacji

Abstrakty

The paper discusses several key concepts related to the development of corpora and reconsiders them in light of recent developments in NLP. On the basis of an overview of present-day corpora, we conclude that the dominant practices of corpus design do not utilise the technologies adequately and, as a result, fail to meet the demands of corpus linguistics, computational lexicology and computational linguistics alike. We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilingual texts and on providing them with a detailed metadata description and high-quality multi-level annotation. We go on to illustrate this concept with a description of the compilation, structuring, documentation, and annotation of the Bulgarian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 billion words (1.95×10⁹) altogether. The BulNC is supplied with a comprehensive metadata description, which allows us to organise the texts according to different principles. The Bulgarian part of the BulNC is automatically processed (tokenised and sentence split) and annotated at several levels: morphosyntactic tagging, lemmatisation, word-sense annotation, annotation of noun phrases and named entities. Some levels of annotation are also applied to the Bulgarian-English paralel corpus with the prospect of expanding multilingual annotation both in terms of linguistic levels and the number of languages for which it is available. We conclude with a brief evaluation of the quality of the corpus and an outline of its applications in NLP and linguistic research.

Słowa kluczowe

corpus design Bulgarian National Corpus computational linguistics

Wydawca

Instytut Podstaw Informatyki PAN

Czasopismo

Journal of Language Modelling

Rocznik

2012

Tom

Vol. 0, No. 1

Strony

65--110

Opis fizyczny

Bibliogr. 79 poz., fot., rys., tab., wykr.

Twórcy

autor

Koeva S.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

autor

Stoyanova I.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

autor

Leseva S.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

autor

Dimitrova T.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

autor

Dekova R.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

autor

Tarpomanova E.

Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia, Bulgaria

Bibliografia

[1] Krasimira Aleksova (2000), Ezikat i semeystvoto. Kam metodikata za prouchvane na rechta v mikroobshtnostite (Language and the family. Towards a methodology for analysis of speech in micro social environments), Intervyu Press, Sofia.
[2] Beryl Sue Atkins (1992), Theoretical lexicography and its relation to dictionary-making, Dictionaries: Journal of the Dictionary Society of North America, 14: 4-43.
[3] Beryl Sue Atkins, Jeremy Clear, and Nicholas Ostler (1991), Corpus design criteria, http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf.
[4] Michele Banko and Eric Brill (2001), Scaling to very very large corpora for natural language disambiguation, in Proceedings of ACL 2001, pp. 26-33.
[5] Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2012), The New IDS Corpus Analysis Platform: challenges and prospects, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2905-2911.
[6] Piotr Bański and Adam Przepiórkowski (2010), The TEI and the NCP: the model and its application, in Proceedings of LREC 2010 Workshop on Language Resources: From Storyboard to Sustainability and LR Lifecycle Management (LRSLM2010), pp. 34-38.
[7] Marko Baroni and Adam Kilgarriff (2006), Large linguistically-processed web corpora for multiple languages, in Proceedings of European ACL, Trento, Italy, pp. 87-90.
[8] Božo Bekavac, Petya Osenova, Kiril Simov, and Marko Tadić (2004), Making monolingual corpora comparable: a case study of Bulgarian and Croatian, in M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, R. Silva, C. Pereira, F. Carvalho, M. Lopes, M. Catarino, and S. Barros, editors, Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, volume IV, pp. 1187-1190.
[9] Douglas Biber (1989), A typology of English texts, Linguistics, 27: 3-43.
[10] Douglas Biber (1993), Representativeness in corpus design, Literary and Linguistic Computing, 8 (4): 243-258.
[11] Ondřej Bojar, Zdeněk Žabokrtský, Ondřej Dušek, Petra Galuščáková, Martin Majliš, David Mareček, Jiří Maršík, Michal Novák, Martin Popel, and Aleš Tamchyna (2012), The joy of parallelism with CzEng 1.0, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
[12] Ted Briscoe (2006), An introduction to tag sequence grammars and the RASP system parser, Technical report, University of Cambridge, Computer Laboratory Technical Report.
[13] Lou Burnard (2005), Developing Linguistic Corpora: a Guide to Good Practice, chapter Metadata for corpus work, Oxford: Oxbow Books, http://ota.ahds.ac.uk/documents/creating/dlc/index.htm.
[14] František Čermak and Vera Schmiedtová (2003), The Czech National Corpus Project and lexicography, in M. Murata, S. Yamada, and Y. Tono, editors, Asialex ’03 Tokyo Proceedings: Dictionaries and Language Learning: How Can Dictionaries Help Human and Machine Learning?, pp. 74-80.
[15] Atanas Chanev, Kiril Simov, Petya Osenova, and Svetoslav Marinov (2006), Dependency conversion and parsing of the BulTreeBank, in Proceedings of the LREC Workshop Merging and Layering Linguistic Information, Genoa, Italy, 2006, pp. 16-23.
[16] Jonathan Chevelu, Nelly Barbot, Oliver Boeffard, and Arnaud Delhay (2007), Lagrangian relaxation for optimal corpus design, in Proceedings of the 6th ISCA Tutorial and Research Workshop on Speech Synthesis (SSW6), pp. 211-216.
[17] Christian Chiarcos and Tomaž Erjavec (2011), OWL/DL formalization of the MULTEXT-East morphosyntactic specifications, in Proceedings of the Linguistic Annotation Workshop 2011, pp. 11-20.
[18] Oliver Christ and Bruno M. Schulze (1994), The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual, University of Stuttgart, Germany.
[19] Jeremy Clear (1992), Corpus sampling, in G. Leitner, editor, New directions in English language corpora, Berlin: Mouton de Gruyter.
[20] David Crystal (1969), What is Linguistics?, London: Edward Arnold.
[21] David Crystal (1991), A Dictionary of Linguistics and Phonetics, Cambridge, MA: Basil Blackwell.
[22] James R. Curran and Miles Osborne (2002), A very very large corpus doesn’t always yield reliable estimates, in Proceedings of the 6th Conference on Natural Language Learning (CoNLL), pp. 126-131.
[23] Mark Davies (2010), The corpus of contemporary American English as the First Reliable Monitor Corpus of English, Literary and Linguistic Computing, 25 (4): 447-465.
[24] Ludmila Dimitrova, Tomaž Erjavec, Nancy Ide, Heiki-Jan Kaalep, Vladimir Petkevic, and Dan Tufiş (1998), Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages, in C. Boitet and P. Whitelock, editors, Proceedings of COLING-ACL’98: 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montréal, Québec, Canada, pp. 315-319, San Francisco, Calif.: Morgan Kaufmann.
[25] Ludmila Dimitrova, Violetta Koseska, Danuta Roszko, and Roman Roszko (2009), Bulgarian-Polish-Lithuanian Corpus – current development, in Proceedings of the International Workshop Multilingual resources, technologies and evaluation for Central and Eastern European languages in conjunction with International Conference RANPL 2009, Borovec, Bulgaria, 17 September 2009, pp. 1-8.
[26] Tomaž Erjavec (2004), MULTEXT-East Version 3: multilingual morphosyntactic specifications, lexicons and corpora, in M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, R. Silva, C. Pereira, F. Carvalho, M. Lopes, M. Catarino, and S. Barros, editors, Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1535-1538.
[27] Winthrop Nelson Francis and Henry Kučera (1964), Brown Corpus Manual, http://icame.uib.no/brown/bcm.html.
[28] Roger Garside, Geoffrey Leech, and Tony McEnery (1997), Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman.
[29] Voula Giouli, Nikos Glaros, Kiril Simov, and Petya Osenova (2009), A web-enabled and speech-enhanced parallel corpus of Greek-Bulgarian cultural texts, in Proceedings of Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009), pp. 35-41.
[30] Abduelbaset Goweder and Anne De Roeck (2001), Assessment of a significant Arabic corpus, in Proceedings Workshop on Arabic Language Processing, 39th ACL, Toulouse.
[31] Atle Grønn and Irena Marijanovic (2010), Russian in contrast: Form, meaning and parallel corpora, Oslo Studies in Language (OSLa), 2 (1): 1-24.
[32] Michael Halliday (1985), An Introduction to Functional Grammar, Melbourne: Edward Arnold.
[33] Frank Keller and Mirella Lapata (2003), Using the web to obtain frequencies for unseen bigrams, Computational Linguistics, 29 (3): 459-484.
[34] Adam Kilgarriff and Gregory Grefenstette (2003), Introduction to the special Issue on Web as Corpus, Computational Linguistics, 29 (3): 333-347.
[35] Adam Kilgarriff, Vojtěch Kovář, and Pavel Rychlý (2009), Tickbox Lexicography, in eLexicography in the 21st century: new challenges, new applications, pp. 411-418, Brussels: Presses universitaires de Louvain.
[36] Jan Kocek, Marie Kopřivová, and Věra Schmiedtová (2000), The Czech National Corpus, in Proceedings of EURALEX 2000, pp. 127-132.
[37] Philipp Koehn (2005), Europarl: A parallel corpus for statistical machine translation, in Proceedings of MT Summit, pp. 79-86.
[38] Philipp Koehn and Hieu Hoang (2007), Factored Translation Models, in Proceeding of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, Czech Republic, June 2007.
[39] Svetla Koeva (2006), Inflection morphology of Bulgarian multiword expressions, in Computer Applications in Slavic Studies, pp. 201-216, Boyan Penev Publishing Center.
[40] Svetla Koeva (2010), Balgarskiyat semantichno anotiran korpus – teoretichni postanovki (Bulgarian semantically annotated corpus – theoretical concepts), in Balgarskiyat semantichno anotiran korpus (Bulgarian semantically annotated corpus), IBL.
[41] Svetla Koeva, Diana Blagoeva, and Sia Kolkovska (2010), Bulgarian National Corpus Project, in N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, editors, Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), pp. 3678-3684.
[42] Svetla Koeva and Angel Genov (2011), Bulgarian language processing chain, in Proceedings of Integration of multilingual resources and tools in Web applications. Workshop in conjunction with GSCL 2011, 26 September 2011, University of Hamburg.
[43] Svetla Koeva, Svetlozara Leseva, Borislav Rizov Rizov, Ekaterina Tarpomanova, Tsvetana Dimitrova, Hristina Kukova, and Maria Todorova (2011), Design and development of the Bulgarian sense-annotated corpus, in Las tecnologías de la información y las comunicaciones: Presente y futuro en el análisis de córpora. Actas del III Congreso Internacional de Lingüística de Corpus. Valencia: Universitat Politécnica de Valéncia, pp. 143-150.
[44] Svetla Koeva, Svetlozara Leseva, Ivelina Stoyanova, Ekaterina Tarpomanova, and Maria Todorova (2006), Bulgarian tagged corpora, in Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, Sofia, Bulgaria, pp. 78-86.
[45] Svetla Koeva, Borislav Rizov, Ekaterina Tarpomanova, Tsvetana Dimitrova, Rositsa Dekova, Ivelina Stoyanova, Svetlozara Leseva, Hristina Kukova, and Angel Genov (2012a), Application of clause alignment for statistical machine translation, in Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-6), 12 July 2012, Jeju, Korea.
[46] Svetla Koeva, Ivelina Stoyanova, Rositsa Dekova, Borislav Rizov, and Angel Genov (2012b), Bulgarian X-language parallel corpus, in N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odjik, and S. Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2480-2486, http://www.lrec-conf.org/proceedings/lrec2012/pdf/587_Paper.pdf.
[47] Gunther Kress (1993), Against arbitrariness, Discourse and Society, 4 (2): 169-191.
[48] Marc Kupietz, Cyril Belica, Holger Keibel, and Andreas Witt (2010), The German Reference Corpus DEREKO: A Primordial sample for linguistic research, in N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, M. Rosner, and D. Tapias, editors, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010), pp. 1848-1854.
[49] David Lee (2001), Genres, registers, text types, domains and style: clarifying the concepts and navigating a path through BNC jungle, Language Learning & Technology, 5 (3): 37-72.
[50] Geoffrey Leech (1991), The state of the art in corpus linguistics, in English Corpus Linguistics: Linguistic Studies in Honour of Jan Svartvik, pp. 8-29, London: Longman.
[51] Christopher Manning and Hinrich Schutze (1999), Foundations of Statistical Natural Language Processing, MIT Press.
[52] Tony McEnery, Richard Xiao, and Yukio Tono (2006), Corpus-Based Language Studies. An Advanced Resource Book, Routledge.
[53] Charles F. Meyer (2002), English Corpus Linguistics. An Introduction, Cambridge University Press.
[54] Robert C. Moore (2002), Fast and accurate sentence alignment of bilingual corpora, in AMTA’02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, pp. 135-144, London, UK: Springer-Verlag.
[55] Graham Neubig (2011), The Kyoto Free Translation Task, http://www.phontron.com/kftt.
[56] Cvetanka Nikolova (1987), Chestoten rechnik na balgarskata razgovorna rech (A Frequency Dictionary of Colloquial Bulgarian), Sofia: Nauka i izkustvo.
[57] Monica Paramita, Ahmet Aker, Robert Gaizauskas, Paul Clough, Emma Barker, Nikos Mastropavlos, and Dan Tufiş (2011), Report on methods for collection of comparable corpora, ACCURAT – Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation.
[58] Brian Pinkerton (1994), Finding what people want: Experiences with the WebCrawler, in Proceedings of the First World Wide Web Conference, Geneva, Switzerland, http://thinkpink.com/bp/webcrawler/www94.html.
[59] Jan Pomikálek, Miloš Jakubíček, and Pavel Rychlý (2012), Building a 70 billion word corpus of English from ClueWeb, in N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odjik, and S. Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 502-506.
[60] Jan Pomikálek, Pavel Rychly, and Adam Kilgarriff (2009), Scaling to billion-plus word corpora, Advances in Computational Linguistics. Special Issue of Research in Computing Science, 41: 3-14.
[61] Adam Przepiórkowski (2011), Linguistic annotation of the National Corpus of Polish, FDSL 9, http://www.uni-goettingen.de/de/document/download/cbcf2e9ded91b3c41d0c460c31d1d9bb.pdf/nkjp.pdf.
[62] Adam Przepiórkowski, Marek Łaziński, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk (2010), Recent developments in the National Corpus of Polish, in N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odjik, and S. Piperidis, editors, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010), pp. 994-997.
[63] Adam Przepiórkowski and Marcin Woliński (2003), The unbearable lightness of tagging: A case study in morphosyntactic tagging of Polish, in Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC 2003), EACL 2003, pp. 109-116.
[64] Kiril Simov and Petya Osenova (2004), A hybrid strategy for regular grammar parsing, in M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, R. Silva, C. Pereira, F. Carvalho, M. Lopes, M. Catarino, and S. Barros, editors, Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 431-434.
[65] Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff, Krassimira Ivanova, Alexander Simov, and Milen Kouylekov (2002), Building a linguistically interpreted corpus of Bulgarian: the BulTreeBank, in Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain, pp. 1729-1736.
[66] John Sinclair (2005), Developing Linguistic Corpora: a Guide to Good Practice, chapter Corpus and text – basic principles, Oxford: Oxbow Books, http://ahds.ac.uk/linguistic-corpora/.
[67] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Daniel Varga (2006), The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pp. 2142-2147.
[68] Marko Tadić (2002), Building the Croatian National Corpus, in Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain, pp. 441-446.
[69] Jörg Tiedemann (2009), News from OPUS – A collection of multilingual parallel corpora with tools and interfaces, in N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pp. 237-248, Amsterdam/Philadelphia: John Benjamins.
[70] Tinko Tinchev, Svetla Koeva, Borislav Rizov, and Nikola Obreshkov (2007), System for advanced search in corpora, in Literature and Writing in Internet, pp. 92-111, Sofia: St. Kliment Ohridski University Press.
[71] Tzvetan Todorov (1984), Mikhail Bakhtin: The Dialogical Principle, Minneapolis: University of Minnesota Press.
[72] Peter Trudgill (1992), Introducing Language and Society, London: Penguin.
[73] Dan Tufiş, Svetla Koeva, Tomaž Erjavec, Maria Gavrilidou, and Cvetana Krstev (2009), ID10503 Building Language Resources and Translation Models for Machine Translation focused on South Slavic and Balkan Languages, in Scientific results of the SEE-ERA.NET Pilot Joint Call, Vienna, pp. 37-48.
[74] Dániel Varga, Lásló Németh, P’eter Halácsy, András Kornai, Viktor Trón, and Viktor Nagy (2005), Parallel corpora for medium density languages, in Proceedings of the RANLP 2005, pp. 590-596.
[75] Ruprecht von Waldenfels (2006), Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment, in B. Brehmer, V. Zhdanova, and R. Zimny, editors, Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9, pp. 123-138, München: Sagner.
[76] Ruprecht von Waldenfels (2011), Recent developments in ParaSol: Breadth for depth and XSLT based web concordancing with CWB, in Daniela M. and R. Garabík, editors, Natural Language Processing, Multilinguality. Proceedings of Slovko 2011, Modra, Slovakia, 20-21 October 2011, pp. 156-162.
[77] Richard Xiao (2010), Corpus creation, in The Handbook of Natural Language Processing, pp. 147-165.
[78] Jia Xu and Weiwei Sun (2011), Generating virtual parallel corpus: a compatibility centric method, in Proceedings of the Machine Translation Summit XIII.
[79] Dan-Hee Yang, Pascual Cantos Gomez, and Mansuk Song (2000), An Algorithm for Predicting the Relation between Lemmas and Corpus Size, ETRI Journal, 22 (2): 20-31.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2020).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-3bd3f4ab-ed9e-4a6f-9206-ebee76aa12b2