Reprezentacje dokumentów tekstowych w kontekście wykrywania niechcianych wiadomości pocztowych w języku polskim z wtrąceniami w języku angielskim

Andruszkiewicz, P.

Artykuł - szczegóły

Tytuł artykułu

Reprezentacje dokumentów tekstowych w kontekście wykrywania niechcianych wiadomości pocztowych w języku polskim z wtrąceniami w języku angielskim

Autorzy

Andruszkiewicz P.

Identyfikatory

Warianty tytułu

Representations of text documents in context of spam detection in Polish with English phrases

Języki publikacji

Abstrakty

Klasyfikacja dokumentów tekstowych wiąże się z utworzeniem ich reprezentacji. Duża liczba dokumentów zachęca do prób stosowania jak najbardziej oszczędnych sposobów ich reprezentowania. W niniejszej pracy przedstawione zostały możliwe reprezentacje dokumentów tekstowych, sposoby ich ograniczania w kontekście wykrywania niechcianych wiadomości pocztowych w języku polskim z wtrąceniami w języku angielskim.

Representation of text documents should be as small as possible and give high accuracy of classification. This paper presents representations of text documents and ways of their reduction in case of SPAM detection in Polish with English phrases.

Słowa kluczowe

reprezentacja dokumentów tekstowych funkcje istotności atrybutów TF-IDF ograniczenie reprezentacji klasyfikacja wykrywanie niechcianej poczty elektronicznej

text document representation term weighting functions TF-IDF reduction of text document representation classification SPAM detection

Wydawca

Wydawnictwo Politechniki Śląskiej

Czasopismo

Studia Informatica

Rocznik

2009

Tom

Vol. 30, nr 2A

Strony

273--286

Opis fizyczny

Bibliogr. 31 poz.

Twórcy

autor

Andruszkiewicz P.

Instytut Informatyki Politechnika Warszawska , 00-665 Warszawa, ul. Nowowiejska 15/19 tel. (022) 234-77-15, P.Andruszkiewicz@ii.pw.edu.pl

Bibliografia

1. Barnbrook G.: Defining Language: A Local Grammar of Definition Sentences. John Benjamins 2002.
2. Bole L., Cytowski J.: Modern Search Methods. Instytut Podstaw Informatyki PAN, Warszawa 1992.
3. Boratyński D.: Metody klasyfikacji dokumentów tekstowych w języku polskim. W: Wyzwania gospodarki elektronicznej - stan i perspektywy, Red. Tadeusz Grabiński, WSPiM, Chrzanów 2005.
4. Chakrabarti S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco 2003.
5. Chrabąszcz M., Gołębski M., Bembenik R: Metody klasyfikacji dokumentów tekstowych. Informatyka Teoretyczna i Stosowana, Wydawnictwo Politechniki Częstochowskiej, nr 3, Częstochowa 2002, s. 89-100.
6. Church K. W., Gale W. A.: Inverse document frequency (IDF): A measure of deviations from Poisson. Proceedings of the Third Workshop on Very Large Corpora (WVLC), s. 121-130, 1995.
7. Fang H., Tao T., Zhai C: A formal study of information retrieval heuristics. Proceedings ofSIGIR, 2004, s. 49-56.
8. Fawcett T.: In vivo spam filtering: a challenge problem for KDD. SIGKDD Explorations 5(2), 2003, s. 140*148.
9. Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York 2007.
10. Gawrysiak P.: Automatyczna klasyfikacja dokumentów, Praca Doktorska, 2001.
11. Hastie T., Tibshirani R., Friedman J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York 2001.
12. http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml
13. Jain A., Murty M., Flynn P.: Data Clustering: A Review. ACM Computing Surveys, 31(3), wrzesień 1999.
14. Kroon de H., Mitchell T., Kerckhoffs E.: Improving learning accuracy in information filtering. International Conference on Machine Learning - Workshop on Machine Learning Meets HCI (ICML-96), 1996.
15. Liu R., Lu Y.: Incremental context mining for adaptive document classification. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, s. 599-604.
16. Mani I., Maybury M. T.: Advances in Automatic Text Summarization. MIT Press 2001.
17. Ponte J. M., Croft W. B.: A Language Modeling Approach to Information Retrieval. Research and Development in Information Retrieval, 1998, s. 275-281.
18. Porter M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, s. 130-137.
19. Rijsbergen Van C. J.: Information Retrieval. Dept. of Computer Science, University of Glasgow, 1979.
20. Robertson S. E.: Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 2004, s. 503-520.
21. Robertson S. E., Jones K. S.: Simple proven approaches to text retrieval. Tech. Rep. TR356, Cambridge University Computer Laboratory, 1997.
22. Salib M., Sheer M.: Spam Classification with Naive Bayes and Smart Heuristics, 2002.
23. Saul L., Pereira F.: Aggregate and Mixed-Order Markov Models for Statistical Language Processing, Association for Computational Linguistics, New Jersey 1997.
24. Scime A.: Web Mining: Applications and Techniques. Idea Group Inc (IG1) 2005.
25. Shakeri S., Rosso P.: Spam Detection and Email Classification. Information Assurance and Computer Security, IOS Press, 2006.
26. Song F., Croft W. B.: A General Language Model for Information Retrieval (poster abstract). Eighth International Conference on Information and Knowledge Management (CIKM'99), 1999.
27. Sparck Jones K.: IDF term weighting and IR research lessons. Journal of Documentation 60, 2004, s. 521-523.
28. Stefanowski J., Zienkowicz M.: Classification of Polish Email Messages: Experiments with Various Data Representations. ISMIS 2006, s. 723-728.
29. Willett P.: Recent Trends in Hierarchic Document Clustering: A Critical Review. Information Processing and Management, 24(5), 1988, s. 577-597.
30. Youn S., McLeod D.: Efficient Spam Email Filtering using Adaptive Ontology. International Conference on Information Technology (ITNG'07), 2007, s. 249-254.
31. Zorkadis V., Karras D. A., Panayotou M.: Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering. Neural Networks 18(5-6), 2005, s. 799-807.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BSL9-0027-0025