PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Automatic wrapper generation and generalization for social media websites

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The data contained within user generated kontent websites prove to be valuable in many applications, for example in social media monitoring or in acquisition of training sets for machine learning algorithms. Mining such data is especially difficult in case of web forums, because of hundreds of various forum engines used. We propose an algorithm capable of unsupervised extraction of posts from social websites, without the need to analyse more than one page in advance. Our method localizes potential data regions by repetition analysis within document structure and filtering potential results. Subsequently, the fields of data records are fund using key characteristics and series-wide dependencies. We manager to achieve 85% precision of extraction and 79% recall after experiments on single pages taken from 258 websites. Our solution is characterized by high computing efficiency, thus enabling wide applications.
Rocznik
Strony
817--834
Opis fizyczny
Bibliogr. 17 poz., wykr.
Twórcy
autor
  • Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233 Gdańsk, Poland
autor
  • Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233 Gdańsk, Poland
Bibliografia
  • 1. Chang, C. & Lui, S. (2001) IEPAD: information extraction based on pat tern discovery. In: Proceedings of the 10th International Conference on Word Wide Web. ACM, New York, NY, USA, 681–668.
  • 2. Cong, G., Wang, L., Lin, C., Song, Y. & Sun, Y. (2008) Finding questionanswer pairs from online forums. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, NY, USA, 467–474.
  • 3. Crescenzi, V., Mecca, G. & Merialdo, P. (2002) RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA, 624–624.
  • 4. Freitag, D. & Kushmerick, N. (2000) Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press, Menlo Park, CA, USA, 577–583.
  • 5. Glance, N., Hurst, M., Nigam, K., Siegler, M., Stockton, R. & Tomokiyo, T. (2005)Deriving marketing intelligence from online discussion. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, NY, USA, 419–428.
  • 6. Hong, J. & Fauzi, F. (2010) TreeWrap-Data Extraction Using Tree Matching Algorithm. In: Majlesi Journal of Electrical Engineering. Islamic Azad University, Majlesi, 4 (2), 43–55.
  • 7. Kim, P. (2006, Q3) The forrester wave: Brand monitoring (white paper). Forrester Wave, Cambridge, MA, USA.
  • 8. Kushmerick, N. (1997) Wrapper induction for information extraction (doctor al dissertation). University of Washington.
  • 9. Lerman, K., Getoor, L., Minton, S. & Knoblock, C. (2004) Using the structure of Web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA, 119–130.
  • 10 Lerman, K., Minton, S. & Knoblock, C. (2003) Wrapper maintenance: a machine learning approach. In: Journal of Artificial Intelligence Research. AI Access Foundation, El Segundo, CA, USA, 18 (1), 149–181.
  • 11. Levenshtein, V. (1966) Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics-doklady. American Institute of Physics, New York, NY, USA, 10 (8), 707-710.
  • 12. Li, S., Tang, L., Hu, J. & Chen, Z. (2009) Automatic Data Extraction from Web Discussion Forums. In: Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology. IEEE Computer Society Washington, DC, USA, 219–225.
  • 13. Liu, B., Grossman, R. & Zhai, Y. (2003) Mining data records inWeb pages. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, 601–606.
  • 14. Muslea, I., Minton, S. & Knoblock, C. (1998) Stalker: Learning extraction rules for semistructured, web-based information sources. In: Proceedings of AAAI-98 Workshop on AI and Information Integration. AAAI Press, Menlo Park, CA, USA, 74–81.
  • 15. Pang, B. & Lee, L. (2008) Opinion mining and sentiment analysis. Now Publishers Inc., Boston-Delft.
  • 16. Papadakis, N., Skoutas, D., Raftopoulos, K. & Varvarigou, T. (2005) An automatic web wrapper for extracting information from web sources, using clustering techniques. In: Proceedings of the 2005 Symposium on Applications and the Internet. IEEE Computer Society Washington, DC, USA, 24–30.
  • 17. Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R. & Sen, P. (2011) Web information extraction using markov logic networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, USA, 1406–1414.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-5f6fa270-749e-4ec5-b5cd-f8df2a2a7ca2
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.