PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

The Concept Of Topological Information In Text Representation

Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This paper studies the possibility of processing text documents using topological information on keywords, by which we mean internal positions of the keywords in the text. While the word counts are pieces of information that is independent of the sequence of words in the text, the topological, i.e. position-related, information manifests obvious dependency on the sequence of words. In result, the presented method stops treating the texts as amorphous collections of words and starts treating them as linearly-ordered sequences of words. Thus, the introduced, topological approach is of higher level than the popular bag-of-words approaches, and its advantage should unveil in applications to texts of similar themes; due to their similar counts of keywords the topological information may prove to be indispensable. It should also require significantly smaller sets of keywords as compared to the bag-of-words approaches.
Rocznik
Strony
57--78
Opis fizyczny
Bibliogr. 27 poz.
Twórcy
autor
Bibliografia
  • [1] Al-Halimi, R.K.: Mining Topic Signals from Text, Doctoral Thesis, University of Waterloo, Waterloo, Ontario, 2003.
  • [2] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Addison-Wesley, 1999.
  • [3] Beil, F., Ester, M. and Xu, X.: Frequent term-based text clustering. Proceedings of KDD-2002, 8th International Conference on Knowledge Discovery and Data Mining (KDD) Edmonton, Alberta, 2002.
  • [4] Brin, S. and Page L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th International World Wide Web Conference. Computer Networks and ISDN Systems, 30(1-7), 1998, 107-117.
  • [5] Caropreso, M.F., Matwin, S. and Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Amita G. Chin (ed.), Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, US, 2001, 78-102.
  • [6] Cox T.F., Cox M.A.A.: Multidimensional Scaling, Chapman & Hall/CRC, Boca Raton, FL, 2nd edition, 2001.
  • [7] Fagan, J. L.: The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40(2), 1989, 115-132.
  • [8] Furnkranz J., Mitchell T., and Rilo E.: A case study in using linguistic phrases for text categorization on the WWW. In AAAI/ICML Workshop on Learning for Text Categorization, 1998.
  • [9] Gannett, J.C.M.: On certain independent factors in mental measurement, Proc. Roy. Soc. Lon,96, 1919-1920, 91-111.
  • [10] Gispert, A., Marino, J.B.: Linguistic knowledge in statistical phrase-based word alignment, Natural Language Engineering, 12(1), 2006, 91-108.
  • [11] Hearst, M.A.: Untangling text data mining, Proc. ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, 1999.
  • [12] Hotelling, H.: Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, 1933, 417-441.
  • [13] Kruskal J.B., Wish M.: Multidimensional Scaling, Sage Publications, Newbury Park, CA, 1978.
  • [14] Lewis, D. An evaluation of phrasal and clustered representations on a text categorization task. In Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen (eds.), Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, 1992. ACM Press, New York, US, 37-50.
  • [15] Liu, B.: Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007.
  • [16] Maron, M. E. and Kuhns, J. L.: On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM, 7(3), 1960, 216-244.
  • [17] Pearson, K.: On lines and planes of closest fit to systems of points in space, Phil Mag., 6, 1901,559-572.
  • [18] Prickett, S., Carroll, R.P. (eds.): The Bible: Authorized King James Version, Oxford University Press, USA, 2008.
  • [19] Robertson, S.E. and Spärck Jones, K.: Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27, 1976, 129-46.
  • [20] Salton, G. and Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5), 1988, 513-523.
  • [21] Scott, S. and Matwin, S.: Feature engineering for text classification. In Ivan Bratko and Saso Dzeroski (eds.), Proceedings of ICML-99, 16th International Conference on Machine Learning, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, 379-388.
  • [22] Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1)2002, 1-47.
  • [23] Shafiei, M. et al.: Document Representation and Dimension Reduction for Text Clustering. Proceedings of Data Engineering Workshop, IEEE 23rd International Conference on Data Engineering, Istanbul, 2007, 770-779.
  • [24] Simard, M., Goutte, C., Isabelle, P.: Statistical Phrase-based Post-editing. Proc. North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2007, Rochester, NY, 2007, 508-515.
  • [25] Spärck Jones, K., Walker, S. and Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Parts 1 and 2. Information Processing and Management, 36(6), 2000, 779-808 and 809-840.
  • [26] Spearman, C.: General intelligence, objectively determined and measured, American Journal of Psychology, 15, 1904, 201-293.
  • [27] Thurstone, L.L.: Multiple factor analysis, Psych. Rev., 38, 1931, 406-427.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BPP2-0019-0057
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.