The Concept Of Topological Information In Text Representation

Susmaga, R.; Masłowska, I.; Budzyńska, L.

Artykuł - szczegóły

Tytuł artykułu

The Concept Of Topological Information In Text Representation

Autorzy

Susmaga R. , Masłowska I. , Budzyńska L.

Wybrane pełne teksty z tego czasopisma

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

This paper studies the possibility of processing text documents using topological information on keywords, by which we mean internal positions of the keywords in the text. While the word counts are pieces of information that is independent of the sequence of words in the text, the topological, i.e. position-related, information manifests obvious dependency on the sequence of words. In result, the presented method stops treating the texts as amorphous collections of words and starts treating them as linearly-ordered sequences of words. Thus, the introduced, topological approach is of higher level than the popular bag-of-words approaches, and its advantage should unveil in applications to texts of similar themes; due to their similar counts of keywords the topological information may prove to be indispensable. It should also require significantly smaller sets of keywords as compared to the bag-of-words approaches.

Słowa kluczowe

text mining modelling textual data text representation topological information

Wydawca

Wydawnictwo Politechniki Poznańskiej

Czasopismo

Foundations of Computing and Decision Sciences

Rocznik

2011

Tom

Vol. 36, No. 1

Strony

57--78

Opis fizyczny

Bibliogr. 27 poz.

Twórcy

autor

Susmaga R.

autor

Masłowska I.

autor

Budzyńska L.

Institute of Computing Science, Poznań University of Technology, Piotrowo 2, Poznań, Poland, robert.susmaga@cs.put.poznan.pl

Bibliografia

[1] Al-Halimi, R.K.: Mining Topic Signals from Text, Doctoral Thesis, University of Waterloo, Waterloo, Ontario, 2003.
[2] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Addison-Wesley, 1999.
[3] Beil, F., Ester, M. and Xu, X.: Frequent term-based text clustering. Proceedings of KDD-2002, 8th International Conference on Knowledge Discovery and Data Mining (KDD) Edmonton, Alberta, 2002.
[4] Brin, S. and Page L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th International World Wide Web Conference. Computer Networks and ISDN Systems, 30(1-7), 1998, 107-117.
[5] Caropreso, M.F., Matwin, S. and Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Amita G. Chin (ed.), Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, US, 2001, 78-102.
[6] Cox T.F., Cox M.A.A.: Multidimensional Scaling, Chapman & Hall/CRC, Boca Raton, FL, 2nd edition, 2001.
[7] Fagan, J. L.: The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40(2), 1989, 115-132.
[8] Furnkranz J., Mitchell T., and Rilo E.: A case study in using linguistic phrases for text categorization on the WWW. In AAAI/ICML Workshop on Learning for Text Categorization, 1998.
[9] Gannett, J.C.M.: On certain independent factors in mental measurement, Proc. Roy. Soc. Lon,96, 1919-1920, 91-111.
[10] Gispert, A., Marino, J.B.: Linguistic knowledge in statistical phrase-based word alignment, Natural Language Engineering, 12(1), 2006, 91-108.
[11] Hearst, M.A.: Untangling text data mining, Proc. ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, 1999.
[12] Hotelling, H.: Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, 1933, 417-441.
[13] Kruskal J.B., Wish M.: Multidimensional Scaling, Sage Publications, Newbury Park, CA, 1978.
[14] Lewis, D. An evaluation of phrasal and clustered representations on a text categorization task. In Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (eds.), Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, 1992. ACM Press, New York, US, 37-50.
[15] Liu, B.: Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007.
[16] Maron, M. E. and Kuhns, J. L.: On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM, 7(3), 1960, 216-244.
[17] Pearson, K.: On lines and planes of closest fit to systems of points in space, Phil Mag., 6, 1901,559-572.
[18] Prickett, S., Carroll, R.P. (eds.): The Bible: Authorized King James Version, Oxford University Press, USA, 2008.
[19] Robertson, S.E. and Spärck Jones, K.: Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27, 1976, 129-46.
[20] Salton, G. and Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5), 1988, 513-523.
[21] Scott, S. and Matwin, S.: Feature engineering for text classification. In Ivan Bratko and Saso Dzeroski (eds.), Proceedings of ICML-99, 16th International Conference on Machine Learning, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, 379-388.
[22] Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1)2002, 1-47.
[23] Shafiei, M. et al.: Document Representation and Dimension Reduction for Text Clustering. Proceedings of Data Engineering Workshop, IEEE 23rd International Conference on Data Engineering, Istanbul, 2007, 770-779.
[24] Simard, M., Goutte, C., Isabelle, P.: Statistical Phrase-based Post-editing. Proc. North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2007, Rochester, NY, 2007, 508-515.
[25] Spärck Jones, K., Walker, S. and Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Parts 1 and 2. Information Processing and Management, 36(6), 2000, 779-808 and 809-840.
[26] Spearman, C.: General intelligence, objectively determined and measured, American Journal of Psychology, 15, 1904, 201-293.
[27] Thurstone, L.L.: Multiple factor analysis, Psych. Rev., 38, 1931, 406-427.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPP2-0019-0057