Text comparison using data compression
Wybrane pełne teksty z tego czasopisma
Porównanie tekstu przy użyciu kompresji danych
Similarity detection is very important in the field of spam detection, plagiarism detection or topic detection. The main algorithm for comparison of text document is based on the Kolmogorov Complexity, which is one of the perfect measures for computation of the similarity of two strings in defined alphabet. Unfortunately, this measure is incomputable and we must define several approximations which are not metric at all, but in some circumstances are close to this behaviour and may be used in practice.
W artykule omówiono metody rozpoznawania podobieństwa tekstu. Głównie używanym algorytmem jest Kolmogotov Complexity. Głównym ograniczeniem jest brak możliwości dane algorytmu są trudne do dalszego przetwarzania numerycznego – zaproponowano szereg aproksymacji.
Bibliogr. 30 poz.
-  M. Potthast, B. Stein, A. Eiselt, B. universitt Weimar, A. Barrncedeo, and P. Rosso, “P.: Overview of the 1st international competition on plagiarism detection,” in In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org, 2009, pp. 1–9.
-  H. Maurer, F. Kappe, and B. Zaka, “Plagiarism - a survey.”
-  J. Platos, V. Snasel, and E. El-Qawasmeh, “Compression of small text files,” Ad- vanced Engineering Informatics, vol. 22, no. 3, pp. 410–417, 2008.
-  A. Tversky, “Features of similarity,” Psychological Review, vol. 84, no. 4, pp. 327– 352, 1977, cited By (since 1996)1968.
-  R. Cilibrasi and P. M. B. Vitanyi, “Clustering by compression,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1523–1545, 2005.
-  M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitanyi, “The similarity metric,” IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
-  A. Granados, “Analysis and study on text representation to improve the accuracy of the normalized compression distance,” AI Commun., vol. 25, no. 4, pp. 381–384, 2012.
-  X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared information and program plagiarism detection,” IEEE Transactions on Information Theory, vol. 50, no. 7, pp. 1545–1551, 2004.
-  R. Cilibrasi, P. Vitanyi, and R. de Wolf, “Algorithmic clustering of music based on string compression,” Computer Music Journal, vol. 28, no. 4, pp. 49–67, 2004, cited By (since 1996)76.
-  S. Dubnov, G. Assayag, O. Lartillot, and G. Bejerano, “Using machine-learning methods for musical style modeling,” Computer, vol. 36, no. 10, pp. 73–80, 2003, cited By (since 1996)25.
-  M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, “An information- based sequence distance and its application to whole mitochondrial genome phy- logeny,” Bioinformatics, vol. 17, no. 2, pp. 149–154, 2001, cited By (since 1996)274.
-  D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Physical Review Letters, vol. 88, no. 4, pp. 487 021–487 024, 2002, cited By (since 1996)145.
-  J. J. Vayrynen, T. Tapiovaara, K. Kettunen, and M. Dobrinkat, “Normalized com- pression distance as an automatic MT evaluation metric,” in Proceedings of MT 25 years on, 21–22 Nov 2009 Cranfield, UK, to appear.
-  D. Sculley and C. Brodley, “Compression and machine learning: A new perspective on feature space vectors,” 2006, pp. 332–341, cited By (since 1996)17.
-  P. M. B. Vit´anyi, “Universal similarity,” CoRR, vol.abs/cs/0504089, 2005.
-  J. Walder, M. Kratky, R. Baca, J. Platos, and V. Snasel, “Fast decoding algorithms for variable-lengths codes,” Inf. Sci., vol. 183, no. 1, pp. 66–91, 2012.
-  D. Kirovski and Z. Landau, “Generalized lempel-ziv compression for audio,” in Multimedia Signal Processing, 2004 IEEE 6th Workshop on, 2004, pp. 127–130.
-  V. Crnojevic, V. Senk, and Z. Trpovski, “Lossy lempel-ziv algorithm for image compression,” in Telecommunications in Modern Satellite, Cable and Broadcasting Service, 2003. TELSIKS 2003. 6th International Conference on, vol. 2, 2003, pp.522–525 vol.2.
-  D. Chuda and M. Uhlık, “The plagiarism detection by compression method,” in CompSysTech, B. Rachev and A. Smrikarov, Eds. ACM, 2011, pp. 429–434.
-  J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate cod- ing,” Information Theory, IEEE Transactions on, vol. 24, no. 5, pp. 530–536, 1978.
-  M. Potthast, B. Stein, A. Barr´on-Ceden˜o, and P. Rosso, “An Evaluation Frame- work for Plagiarism Detection,” in 23rd International Conference on Computa- tional Linguistics (COLING 10), C.-R. Huang and D. Jurafsky, Eds. Stroudsburg, Pennsylvania: Association for Computational Linguistics, Aug. 2010, pp. 997–1005.
-  J. Sammon, “A nonlinear mapping for data structure analysis,” Computers, IEEE Transactions on, vol. C-18, no. 5, pp. 401–409, 1969.
-  R. Arnold and T. Bell, “A corpus for the evaluation of lossless compression algo- rithms,” pp. 201–210.
-  Watanabe, T.; Sugawara, K.; Sugihara, H., "A new pattern representation scheme using data compression," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.24, no.5, pp.579,590, May 2002
-  Daniele Cerra, Mihai Datcu, A fast compression-based similarity measure with applications to content-based image retrieval, Journal of Visual Communication and Image Representation, Volume 23, Issue 2, February 2012, pp 293-302, ISSN 1047-3203
-  T.A. Welch, A technique for high-performance data compression, Computer 17 (6) (1984) 8–19.
-  Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. Towards parameter-free data mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'04). ACM, New York, NY, USA, 206-215. 2004.
-  Ferragina P et al., Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment, BMC Bioinformatics 2007, 8:252
-  Pere-Pau Vázquez and Jordi Marco, Using Normalized Compression Distance for image similarity measurement: an experimental study, The Visual Computer, November 2012, Volume 28, Issue 11, pp 1063-1084
-  Pinho, A.J.; Ferreira, P.J.S.G., "Image similarity using the normalized compression distance based on finite context models," Image Processing (ICIP), 2011 18th IEEE International Conference on , vol., no., pp.1993,1996, 11-14 Sept. 2011