Finding similar documents in web search results

Kużelewska, U.

Artykuł - szczegóły

Tytuł artykułu

Finding similar documents in web search results

Autorzy

Kużelewska U.

Treść / Zawartość

Pełne teksty:

httpwww_wi_pb_edu_plplikinaukazeszytyz9kuzelewska-full.pdf

Pobierz

Identyfikatory

Warianty tytułu

Identyfikowanie dokumentów podobnych w wynikach wyszukiwania w sieci WWW

Języki publikacji

Abstrakty

Searching the Web is a challenging task. According to the Zamir and Etzioni’s definition, Internet is “unorganized, unstructured and decentralized place”. Although there are powerful search engines available, the number of indexed web pages exceeds 1 trillion [20] and still grows. Most of the search engines return list of documents from their bases sorted according to their relevance to a search query. Such approach is not the best, because the returned list is very long and may contain documents not related to the query. To increase efficiency of a searching process one may identify groups of similar documents from result list. One of the tools to do it are traditional clustering algorithms. The article presents clustering Web search results directly from a search engine as well as sets created from results for different queries. Documents were grouped using the following methods: EM and XMeans.

Przeszukiwanie sieci WWW jest niezmiernie trudnym zadaniem. Według Zamira i Etzioniego Internet to "miejsce bez struktury, niezorganizowane i zdecentralizowane". Chociaz istnieją potężne narzędzia w postaci wyszukiwarek internetowych, ich użycie staje się z czasem trudniejsze, gdyż ilość zaindeksowanych stron internetowych przekracza 1 bln [20] i nadal rośnie. Większość wyszukiwarek generuje wyniki posortowane według ich zgodności z treścią zapytania w postaci bardzo długich list. Takie podejście nie jest najlepszym rozwiązaniem z powodu rozmiaru list oraz zawierania w nich dokumentów nie związanych z zapytaniem. W celu zwiększenia efektywności przeszukiwania Internetu można ˙ zastosowac grupowanie podobnych dokumentów z generowanej przez wyszukiwarki listy wyników. Jednym z takich narzędzi są tradycyjne algorytmy grupujące. W artykule przedstawiono wyniki grupowania dokumentów bezpośrednio z listy zwróconej przez wyszukiwarkę oraz zbiorów dokumentów utworzonych z wyników wyszukiwania dla kilku zapytań. Wykorzystano następujące metody grupujące: EM i XMeans.

Słowa kluczowe

web search results clustering documents similarity snippets clustering

grupowanie wyników wyszukiwania podobieństwo dokumentów grupowanie snippetów

Wydawca

Oficyna Wydawnicza Politechniki Białostockiej

Czasopismo

Zeszyty Naukowe Politechniki Białostockiej. Informatyka

Rocznik

2012

Tom

Z. 9

Strony

61--76

Opis fizyczny

Bibliogr. 22 poz., tab.

Twórcy

autor

Kużelewska U.

Bialystok University of Technology, Faculty of Computer Science, Białystok, Poland

Bibliografia

[1] R. Campos, G. Dias, C. Nunes, WISE: Hierarchical Soft Clustering of Web Page Search Results based on Web Content Mining Techniques, Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, 2006, pp. 301-304
[2] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39, 1977, pp. 1–38
[3] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligent Information Systems, 17(2/3), 2001, pp. 107–145
[4] S. Lawrence, C. L. Giles, Accessibility and Distribution of Information on the Web, Nature (400), 1999, pp. 107-109
[5] M. Mahdavi, H. Abolhassani, Harmony K-means algorithm for document clustering, Data Mining and Knowledge Discovery(18), 2009, pp. 370–391
[6] S. Osinski, An algorithm for clustering of web search results, Master Thesis, Poznan University of Technology, 2003
[7] D. Pelleg, A. Moore, X-means: Extending K-means with Efﬁcient Estimation of the Number of Clusters, Proceedings of International Conference on Machine Learning, 2000, pp. 727-734
[8] G. Salton, A Vector Space Model for Automatic Indexing, Communications of the ACM, 18(11), 1975, pp. 613-620
[9] M. Sathya, J. Jayanthi, N. Basker, Link Based K-Means Clustering Algorithm for Information Retrieval, Proceedings of IEEE-International Conference on Recent Trends in Information Technology, 2011, pp. 1111-1115
[10] W. Rakowski, An intelligent search engine using clustering methods to optimize search results, Master Thesis (in Polish), Bialystok University of Technology, 2011
[11] D. Weiss, A Clustering Interface for Web Search Results in Polish and English, Master Thesis, Poznan University of Technology, 2001
[12] D. Weiss, The search for meaning in a haystack, Seminar of Institute of Linguistics, Polish Academy of Science (in Polish), 2003
[13] I.H. Witten, E. Frank, M.A. Hall, Weka: data mining software in Java, [http: //www.cs.waikato.ac.nz/ml/weka/] (26.06.2012)
[14] O. Zamir, O. Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results, WWW Computer Networks 31(11-16), 1999, pp. 1361-1374
[15] D. Zhang, Y.Dong, Semantic, Hierarchical, Online Clustering of Web Search Results, APWeb’2004, 2004, pp. 69-78
[16] Bing Search Engine, [http://www.bing.com] (10.09.2011)
[17] Google Search Engine, [http://www.google.pl] (15.10.2012)
[18] Carrot2 Clustering Engine, [http://search.carrot2.org/stable/ search]
[19] Web search results datasets, [http://credo.fub.it/] (26.06.2012)
[20] Google Blog, [http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html], (21.06.2012)
[21] Wikipedia description of Aida expression, [http://en.wikipedia.org/wiki/Aida_%28disambiguation%29]
[22] Vivisimo company, [http://vivisimo.com] (10.09.2011)

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPBC-0005-0009