Tytuł artykułu
Treść / Zawartość
Pełne teksty:
Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
The paper focuses on the adjustment of NLP tools for Polish; e.g., morphological analyzers and parsers, to user-generated content (UGC). The authors discuss two rule-based techniques applied to improve their efficiency: pre-processing (text normalization) and parser adaptation (modified segmentation and parsing rules). A new solution to handle OOVs based on inflectional translation is also offered.
Słowa kluczowe
Wydawca
Czasopismo
Rocznik
Tom
Strony
23--44
Opis fizyczny
Bibliogr. 38 poz., rys., tab.
Twórcy
autor
- Institute of Slavic Studies, Polish Academy of Sciences, Warsaw, Poland Fido Intelligence, Gdansk, Poland
autor
- AGH University of Science and Technology, Faculty of Computer Science, Electronics and Telecommunications, Department of Computer Science, Krakow, Poland
autor
- AGH University of Science and Technology, Faculty of Computer Science, Electronics and Telecommunications, Department of Computer Science, Krakow, Poland
Bibliografia
- 1. Aw A., Zhang M., Xiao J., Su J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main conference poster sessions, pp. 33–40, Association for Computational Linguistics, 2006.
- 2. Beaufort R., Roekhaut S., Cougnon L.A., Fairon C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779, Association for Computational Linguistics, 2010.
- 3. Buczynski A., Wawer A.: Shallow parsing in sentiment analysis of product reviews. In: Proceedings of the Partial Parsing workshop at LREC, vol. 2008, pp. 14–18, 2008.
- 4. Chiticariu L., Li Y., Reiss F.R.: Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In: EMNLP, pp. 827–832, 2013.
- 5. Choudhury M., Saraf R., Jain V., Mukherjee A., Sarkar S., Basu A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition (IJDAR), vol. 10(3–4), pp. 157–174, 2007.
- 6. Cook P., Stevenson S.: An unsupervised model for text message normalization. In: Proceedings of the workshop on computational approaches to linguistic creativity, pp. 71–78, Association for Computational Linguistics, 2009.
- 7. Graliński F.: Formalizacja nieciągłości zdań przy zastosowaniu rozszerzonej gramatyki bezkontekstowej. Ph.D. thesis, Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Poznań, 2007.
- 8. Graliński F., Jassem K., Junczys-Dowmunt M.: PSI-toolkit: A natural language processing pipeline. In: A. Przepiórkowski, M. Piasecki, K. Jassem, P. Fuglewicz, eds., Computational Linguistics, Studies in Computational Intelligence, vol. 458, pp. 27–39, Springer, 2013.
- 9. Grzenia J.: Komunikacja językowa w Internecie. Wydawnictwo Naukowe PWN, Warszawa, 2006.
- 10. Gunelius S.: The data explosion in 2014 minute by minute – Infographic. Newstex, vol. 12(07), 2014.
- 11. Haniewicz K., Kaczmarek M., Adamczyk M., Rutkowski W.: Polarity lexicon for the polish language: Design and extension with random walk algorithm. In: Advances in Systems Science, pp. 173–182, Springer, 2014.
- 12. Hu M., Liu B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, pp. 755–760, 2004.
- 13. Hwa R.: Sample selection for statistical parsing. Computational Linguistics, vol. 30(3), pp. 253–276, 2004.
- 14. Kędzia P., Piasecki M., Orlińska M.: Word Sense Disambiguation Based on Large Scale Polish CLARIN Heterogeneous Lexical Resources. Cognitive Studies, (15), pp. 269–292, 2015, http://dx.doi.org/10.11649/cs.2015.019.
- 15. Kobus C., Yvon F., Damnati G.: Normalizing SMS: are two metaphors better than one? In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 441–448, Association for Computational Linguistics, 2008.
- 16. Kopeć M.: Polski Korpus Koreferencyjny – wersja 0.85. 2013, http://zil.ipipan.waw.pl/PolishCoreferenceCorpus.
- 17. Krupa T.: Studium przypadku – system ISPAD. In: B. Wiszniewski, ed., Inteligentne wydobywanie informacji ze społecznościowych serwisów internetowych, Automatyka i Informatyka. Technologie Informacyjne. Internet i Sieci Semantyczne, pp. 121–139, PWNT, 2011.
- 18. Liu B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, vol. 5(1), pp. 1–167, 2012.
- 19. Luo W., Litman D.J., Chan J.: Reducing Annotation Effort on Unbalanced Corpus based on Cost Matrix. In: HLT-NAACL, pp. 8–15, 2013.
- 20. Manning C.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Computational Linguistics and Intelligent Text Processing, pp. 171–189, Springer, 2011.
- 21. Manning C.: Evaluation of Constituency Parsers. Stanford lectures online. 2012, http://www.youtube.com/watch?v=mMXgbrts82M.
- 22. Manning C., Schütze H.: Foundations of statistical natural language processing. MIT press, 1999.
- 23. Martínez P., Segura I., Declerck T., Martínez J.L.: TrendMiner: Large-scale Crosslingual Trend Mining Summarization of Real-time Media Streams. Procesamiento del Lenguaje Natural, vol. 53, pp. 163–166, 2014.
- 24. Nagarajan M., Gamon M.: Workshop on Language and Social Media – Introduction. In: Proceedings of LSM 2011, 2011.
- 25. Ong W.J.: Orality and literacy. Routledge, 2013.
- 26. Piasecki M.: Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, vol. 11(1–2), pp. 151–167, 2007.
- 27. Przepiórkowski A., Bańko M., Górski R.L., Lewandowska-Tomaszczyk B.: Narodowy Korpus Języka Polskiego. 2012, www.nkjp.pl.
- 28. Raghunathan K., Lee H., Rangarajan S., Chambers N., Surdeanu M., Jurafsky D., Manning C.: A multi-pass sieve for coreference resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501, Association for Computational Linguistics, 2010.
- 29. Ratuszniak B.: Monitoring social media. Co oferują firmy? [online], 2012, http://goo.gl/8p7mGp, accessed: 25.04.2012.
- 30. Ray A.: Customer Affinity Meets Brand Vectors: Sentiment that Matters. 2013, sentiment Analysis Symposium, New York.
- 31. Read J., Flickinger D., Dridan R., Oepen S., Øvrelid L.: The WeSearch Corpus, Treebank, and Treecache. A comprehensive sample of user-generated content. In: In Proceedings of the 8th International Conference on Language Resources and Evaluation, Citeseer, 2012.
- 32. Rodrıguez-Penagos C., Atserias J., Codina-Filba J., Garcıa-Narbona D., Grivolla J., Lambert P., Saurı R.: Combining lexicon-based ML and heuristics. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), vol. 2, pp. 483–489, Association for Computational Linguistics, 2013.
- 33. Searle J.R.: Speech acts: An essay in the philosophy of language, vol. 626. Cambridge University Press, 1969.
- 34. Skórzewski P.: Gobio and PSI-Toolkit: Adapting a deep parser to an NLP toolkit. In: Z. Vetulani, H. Uszkoreit, eds., Proceedings of the 6th Language and Technology Conference, pp. 523–526, Fundacja UAM, Poznań, 2013.
- 35. Świdziński M.: Gramatyka formalna języka polskiego. Wydawnictwo Uniwersytetu Warszawskiego, 1992.
- 36. Van Hee C., Van de Kauter M., De Clercq O., Lefever E., Hoste V.: LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 406–410, Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014.
- 37. Zdunkiewicz-Jedynak D., Ciunovič M.: Ćwiczenia ze stylistyki. Wydawnictwo Naukowe PWN, 2010.
- 38. Zhenzhen X., Dawei Y., Brian D.D.: Normalizing microtext. In: Proceedings of the AAAI-11 Workshop on Analyzing Microtext. San Francisco, AAAI, pp. 74–79, 2011.
Uwagi
PL
Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-646d39ba-7fd8-4136-b71d-03a059c47332