Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Similarity-based web clip matching

Treść / Zawartość
Warianty tytułu
Języki publikacji
The research areas of extraction and integration of web data aim at delivery of tools and methods to extract pieces of information from third-party web sites and then to integrate them into profiled, domain-specific, custom web pages. Existing solutions rely on specialized APIs or XPath querying tools and are therefore not easily accessible to non technical end users. In this paper we describe our new comprehensive, non-XPath integration platform which allows end users to extract web page fragments using a simple query-by-example approach and then to combine these fragments into custom, integrated web pages. We focus on our two novel similarity-based web clip matching algorithms: Attribute Weights Tree Matching and Edit Distance Tree Matching.
Słowa kluczowe
Opis fizyczny
Bibliogr. 22 poz., il., wykr.
  • Adelberg, B. (1998) NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Record 27,2, 283-294.
  • Arocena, G.O. and Mendelzon, A.O. (1998) WebOQL: Restructuring Documents, Databases, and Webs. 14th IEEE International Conference on Data Engineering. IEEE Computer Society, 24-33.
  • Baczkiewicz, M., Kaleta, P., Luczak, D. and Zakrzewicz, M. (2010) Extraction and Integration of Dynamic Heterogeneous Web Resources. Proc. of TPD 2010 Conference. Wydawnictwa Naukowo-Techniczne, 101-110.
  • Chakrabarti, D. and Mehta, R.R. (2010) The paths more taken: matching DOM trees to search logs for accurate webpage clustering. Proc. of WWW 2010 Conference. ACM Press, 24-33.
  • Clipmarks
  • Califf, M.E. and Mooney, R.J. (2003) Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research, 4, 539-565.
  • Chun-Nan, H. and Ming-Tzung, D. (1998) Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems, 23 (9), 521-538.
  • Crescenzi, V. and Mecca, G. (1998) Grammars Have Exceptions. Information Systems, 23 (8), 539-565.
  • Crescenzi, V., Mecca, G. and Merialdo, P. (2001) RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proc. of the 26th International Conference on Very Large Database Systems. Morgan Kaufmann, 109-118.
  • Embley,D.W., Campbell,D.M., Jiang,Y.S., Liddle, S.W., Yiu-Kai,N., Quass, D. and Smith, R.D. (1999) Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering, 31 (3), 227-251.
  • Freitag, D. (2000) Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39 (2/3), 169-202.
  • Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M. and Vassalos, V. (1997)] Template-Based Wrappers in the TSIMMIS System. SIGMOD Record, 26 (2), 532-535.
  • Jie, H., Dingyi, H., Chenxi, L., Hua-Jun, Z., Zheng, C. and Yong, Y. (2007) Homepage live: automatic block tracing for web personalization. 16th International World Wide Web Conference (WWW2007). ACM Press, 1-10.
  • Kowalkiewicz, M., Orlowska, M.E., Kaczmarek, T. and Abramowicz, W. (2006) Towards More Personalized Web: Extraction and Integration of Dynamic Content from the Web. Proc. of APWeb Conference. Springer, 668-679.
  • Kulathuramaiyer, N. (2007) Mashups: Emerging Application Development Paradigm for a Digital Journal. Journal of Universal Science, 13 (4), 531-542.
  • Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal, 118 (1-2), 15-68.
  • Liu, L., Pu, C. and Han, W. (2000) XWRAP: An XML-Enabled Wrapper Construction System forWeb Information Sources. Proc. of the 16th IEEE International Conference on Data Engineering. IEEE Computer Society, 611-621.
  • Jindal, N. and Liu, B. (2010) A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. Proc. of SDM 2010 Conference. SIAM, 930-941.
  • Muslea, I., Minton, S. and Knoblock, C.A. (2001) Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4 (1/2), 93-114.
  • ProgrammableWeb (2007) ProgrammableWeb Homepage, http://www.programmableweb. com/
  • Ribeiro-Neto, B.A., Laender, A.H.F. and Silva, A.S. (1999) Extracting Semi-Structured Data Through Examples. Proc. of the 8th ACM International Conference on Information and Knowledge Management. ACM Press, 94-101.
  • Sahuguet, A. and Azavant, F. (2001) Building intelligentWeb applications using lightweight wrappers. Data and Knowledge Engineering, 36 (3), 283-316.
Typ dokumentu
Identyfikator YADDA
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.