Similarity-based web clip matching

Baczkiewicz, M.; Łuczak, D.; Zakrzewicz, M.

Artykuł - szczegóły

Tytuł artykułu

Similarity-based web clip matching

Autorzy

Baczkiewicz M. , Łuczak D. , Zakrzewicz M.

Treść / Zawartość

Pełne teksty:

httpwww_bg_utp_edu_plartcc2011baczkiewicz-et-al.pdf

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The research areas of extraction and integration of web data aim at delivery of tools and methods to extract pieces of information from third-party web sites and then to integrate them into profiled, domain-specific, custom web pages. Existing solutions rely on specialized APIs or XPath querying tools and are therefore not easily accessible to non technical end users. In this paper we describe our new comprehensive, non-XPath integration platform which allows end users to extract web page fragments using a simple query-by-example approach and then to combine these fragments into custom, integrated web pages. We focus on our two novel similarity-based web clip matching algorithms: Attribute Weights Tree Matching and Edit Distance Tree Matching.

Słowa kluczowe

information extraction web web content integration

Wydawca

Systems Research Institute, Polish Academy of Sciences

Czasopismo

Control and Cybernetics

Rocznik

2011

Tom

Vol. 40, no 3

Strony

715--730

Opis fizyczny

Bibliogr. 22 poz., il., wykr.

Twórcy

autor

Baczkiewicz M.

autor

Łuczak D.

autor

Zakrzewicz M.

Poznań University of Technology, Institute of Computing Science Poznań, Poland, Malgorzata.Baczkiewicz@cs.put.poznan.pl

Bibliografia

Adelberg, B. (1998) NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Record 27,2, 283-294.
Arocena, G.O. and Mendelzon, A.O. (1998) WebOQL: Restructuring Documents, Databases, and Webs. 14th IEEE International Conference on Data Engineering. IEEE Computer Society, 24-33.
Baczkiewicz, M., Kaleta, P., Luczak, D. and Zakrzewicz, M. (2010) Extraction and Integration of Dynamic Heterogeneous Web Resources. Proc. of TPD 2010 Conference. Wydawnictwa Naukowo-Techniczne, 101-110.
Chakrabarti, D. and Mehta, R.R. (2010) The paths more taken: matching DOM trees to search logs for accurate webpage clustering. Proc. of WWW 2010 Conference. ACM Press, 24-33.
Clipmarks http://clipmarks.com/
Califf, M.E. and Mooney, R.J. (2003) Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research, 4, 539-565.
Chun-Nan, H. and Ming-Tzung, D. (1998) Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems, 23 (9), 521-538.
Crescenzi, V. and Mecca, G. (1998) Grammars Have Exceptions. Information Systems, 23 (8), 539-565.
Crescenzi, V., Mecca, G. and Merialdo, P. (2001) RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proc. of the 26th International Conference on Very Large Database Systems. Morgan Kaufmann, 109-118.
Embley,D.W., Campbell,D.M., Jiang,Y.S., Liddle, S.W., Yiu-Kai,N., Quass, D. and Smith, R.D. (1999) Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages. Data and Knowledge Engineering, 31 (3), 227-251.
Freitag, D. (2000) Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39 (2/3), 169-202.
Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M.M. and Vassalos, V. (1997)] Template-Based Wrappers in the TSIMMIS System. SIGMOD Record, 26 (2), 532-535.
Jie, H., Dingyi, H., Chenxi, L., Hua-Jun, Z., Zheng, C. and Yong, Y. (2007) Homepage live: automatic block tracing for web personalization. 16th International World Wide Web Conference (WWW2007). ACM Press, 1-10.
Kowalkiewicz, M., Orlowska, M.E., Kaczmarek, T. and Abramowicz, W. (2006) Towards More Personalized Web: Extraction and Integration of Dynamic Content from the Web. Proc. of APWeb Conference. Springer, 668-679.
Kulathuramaiyer, N. (2007) Mashups: Emerging Application Development Paradigm for a Digital Journal. Journal of Universal Science, 13 (4), 531-542.
Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal, 118 (1-2), 15-68.
Liu, L., Pu, C. and Han, W. (2000) XWRAP: An XML-Enabled Wrapper Construction System forWeb Information Sources. Proc. of the 16th IEEE International Conference on Data Engineering. IEEE Computer Society, 611-621.
Jindal, N. and Liu, B. (2010) A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. Proc. of SDM 2010 Conference. SIAM, 930-941.
Muslea, I., Minton, S. and Knoblock, C.A. (2001) Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4 (1/2), 93-114.
ProgrammableWeb (2007) ProgrammableWeb Homepage, http://www.programmableweb. com/
Ribeiro-Neto, B.A., Laender, A.H.F. and Silva, A.S. (1999) Extracting Semi-Structured Data Through Examples. Proc. of the 8th ACM International Conference on Information and Knowledge Management. ACM Press, 94-101.
Sahuguet, A. and Azavant, F. (2001) Building intelligentWeb applications using lightweight wrappers. Data and Knowledge Engineering, 36 (3), 283-316.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BATC-0009-0007