PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia, and from which we can extract valuable parallel texts. This work presents a framework for effectively extracting parallel sentences from that resource, which results in significantly improving the performance of statistical machine translation systems. Our framework is a bootstrapping-based method that is strengthened by using a new measurement for estimating the similarity between two bilingual sentences. We conduct experiment for the language pair of English and Vietnamese and obtain promising results on both constructing parallel corpora and improving the accuracy of machine translation from English to Vietnamese.
Wydawca
Rocznik
Strony
179--199
Opis fizyczny
Bibliogr. 31 poz., rys., tab.
Twórcy
autor
  • University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
autor
  • University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
autor
  • University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
autor
  • University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
autor
  • Japan Advanced Institute of Science and Technology, Japan and John von Neumann Institute, Vietnam National University at Ho Chi Minh City, Vietnam
Bibliografia
  • [1] AbduI-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009.
  • [2] Abdul-Rauf, S., Schwenk, H.: Exploiting comparable corpora with TER and TERp, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, BUCC ’09, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, ISBN 978-1-932432-53-4.
  • [3] Abdul Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT, Machine Translation, 25(4), December 2011, 341–375, ISSN 0922-6567.
  • [4] Achananuparp, P., Hu, X., Shen, X.: The Evaluation of Sentence Similarity Measures, Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery, DaWaK ’08, Springer-Verlag, Berlin, Heidelberg, 2008, ISBN 978-3-540-85835-5.
  • [5] Adafre, S. F., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, 62–69.
  • [6] Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness, Proceedings of the 18th international joint conference on Artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
  • [7] Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., Lai, J. C.: Class-Based n-gram Models of Natural Language, Computational Linguistics, 18, 1992, 467–479.
  • [8] Brown, P. F., Lai, J. C., Mercer, R. L.: Aligning sentences in parallel corpora, Proceedings of the 29th annual meeting on Association for Computational Linguistics, ACL ’91, Association for Computational Linguistics, Stroudsburg, PA, USA, 1991.
  • [9] Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., Mercer, R. L.: The mathematics of statistical machine translation: parameter estimation, Comput. Linguist., 19, June 1993, 263–311, ISSN 0891-2017.
  • [10] Chen, S. F.: Aligning sentences in bilingual corpora using lexical information, Proceedings of the 31st annual meeting on Association for Computational Linguistics, ACL ’93, Association for Computational Linguistics, Stroudsburg, PA, USA, 1993.
  • [11] Chen, S. F., Goodman, J.: An empirical study of smoothing techniques for language modeling, Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL ’96, Association for Computational Linguistics, Stroudsburg, PA, USA, 1996.
  • [12] Do, T. N., Besacier, L., Castelli, E.: A Fully Unsupervised Approach for Mining Parallel Data from Comparable Corpora, European COnference on Machine Translation (EAMT) 2010, Saint-Raphael (France), June 2010.
  • [13] Do, T. N., Besacier, L., Castelli, E.: UNSUPERVISED SMT FOR A LOW-RESOURCED LANGUAGE PAIR, 2d Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU 2010), Penang (Malaysia), May 2010.
  • [14] Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, Proceedings of the second international conference on Human Language Technology Research, HLT ’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
  • [15] Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia, ACM Trans. Multimedia Comput. Commun. Appl., 5(4), November 2009, 31:1–31:17, ISSN 1551-6857.
  • [16] Gale, W. A., Church, K. W.: A program for aligning sentences in bilingual corpora, Comput. Linguist., 19, March 1993, 75–102, ISSN 0891-2017.
  • [17] Hoang, C., Cuong, L. A., Thai, N. P., Bao, H. T.: Exploiting Non-Parallel Corpora for Statistical Machine Translation, Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on, 2012.
  • [18] Hoang, C., Le, C.-A., Pham, S.-B.: Improving the Quality of Word Alignment by Integrating Pearson’s Chi-Square Test Information, Proceedings of the 2012 International Conference on Asian Language Processing, IALP ’12, IEEE Computer Society, Washington, DC, USA, 2012, ISBN 978-0-7695-4886-9.
  • [19] Hoang, C., Le, C. A., Pham, S. B.: Refining lexical translation training scheme for improving the quality of statistical phrase-based translation, Proceedings of the Third Symposium on Information and Communication Technology, SoICT ’12, ACM, New York, NY, USA, 2012, ISBN 978-1-4503-1232-5.
  • [20] Hoang, C., Le, C. A., Pham, S. B.: A Systematic Comparison between Various Statistical Alignment Models for Statistical English-Vietnamese Phrase-Based Translation, Proceedings of the 2012 Fourth International Conference on Knowledge and Systems Engineering, KSE ’12, IEEE Computer Society, Washington, DC, USA, 2012, ISBN 978-0-7695-4760-2.
  • [21] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen,W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation, Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, Association for Computational Linguistics, Stroudsburg, PA, USA, 2007.
  • [22] Koehn, P., Och, F. J., Marcu, D.: Statistical phrase-based translation, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, Association for Computational Linguistics, Stroudsburg, PA, USA, 2003.
  • [23] Lau, R., Rosenfeld, R., Roukos, S.: Adaptive language modeling using the maximum entropy principle, Proceedings of the workshop on Human Language Technology, HLT ’93, Association for Computational Linguistics, Stroudsburg, PA, USA, 1993, ISBN 1-55860-324-7.
  • [24] Lopez, A.: Statistical machine translation, ACM Comput. Surv., 40(3), 2008.
  • [25] Munteanu, D. S., Marcu, D.: Improving Machine Translation Performance by Exploiting Non-Parallel Corpora, Comput. Linguist., 31, December 2005, 477–504, ISSN 0891-2017.
  • [26] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002.
  • [27] Ponzetto, S. P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness, J. Artif. Int. Res., 30, October 2007, 181–212, ISSN 1076-9757.
  • [28] Simard, M., Foster, G. F., Isabelle, P.: Using cognates to align sentences in bilingual corpora, Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2, CASCON ’93, IBM Press, 1993.
  • [29] Smith, J. R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, 2010, ISBN 1-932432-65-5.
  • [30] Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation, In Proceedings of Association for Machine Translation in the Americas, 2006.
  • [31] Tyers, F., Pienaar, J.: Extracting Bilingual Word Pairs from Wikipedia, in Proceedings of the SALTMIL Workshop at Language Resources and Evaluation Conference, LREC08, 2008.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-4d36a729-c240-4e4a-b807-63df242bb331
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.