PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Semi-supervised approach to handle sudden concept drift in Enron data

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Detection of concept changes in incremental learning from data streams and classifier adaptation is studied in this paper. It is often assumed that all processed learning examples are always labeled, i.e. the class label is available for each example. As it may be difficult to satisfy this assumption in practice, in particular in case of data streams, we introduce an approach that detects concept drift in unlabeled data and retrains the classifier using a limited number of additionally labeled examples. The usefulness of this partly supervised approach is evaluated in the experimental study with the Enron data. This real life data set concerns classification of user's emails to multiple folders. Firstly, we show that the Enron data are characterized by frequent sudden changes of concepts. We also demonstrate that our approach can precisely detect these changes. Results of the next comparative study demonstrate that our approach leads to the classification accuracy comparable to two fully supervised methods: the periodic retraining of the classifier based on windowing and the trigger approach with the DDM supervised drift detection. However, our approach reduces the number of examples to be labeled. Furthermore, it requires less updates of retraining classifiers than windowing.
Rocznik
Strony
667--695
Opis fizyczny
Bibliogr. 39 poz., wykr.
Twórcy
  • Institute of Computing Sciences Poznan University of Technology Piotrowo 2, 60-965 Poznan, Poland
Bibliografia
  • Baena-Garcia, M., Campo-Avila, J., Fidalgo, R., Bifet, A., Gavalda, R. and Morales-Bueno, R. (2006) Early drift detection method. In: ECMLPKDD 2006Workshop onKnowledge Discovery from Data Streams, 77-86. eprints.pascal network.org/archive/00002509/01/EDDM.pdf
  • Baeza-Yates, R.A. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston.
  • Bekkerman, R., McCallum, A. and Huang, G. (2004) Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Technical Report IR-418, Center of Intelligent Information Retrieval, UMass Amherst.
  • Bifet, A. and Kirkby, R. (2009) Data streammining: A practical approach. moa manual. Technical report, The University of Waikato.
  • Bifet, A., Kirkby, R., Holmes, G., Gavalda, R. and Pfahringer, B. (2009) New ensemble methods for evolving data streams. In: 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 139-148.
  • Bishop, C.M. (2007) Pattern Recognition and Machine Leaning. Springer, Berlin.
  • Brzezinski, D. and Stefanowski, J. (2011) Accuracy updated ensemble for data streams with concept drift. In: Proceeding of the HAIS 2011 Conference. LNAI 6679, Springer, 155-163.
  • Clark, J., Koprinska, I. and Poon, J. (2003) Linger - a smart personal assistant for e-mail classification. In: O. Kaynak et al., eds., ICANN/ ICONIP 2003 Proc. of the 13th Intern. Conf. on Artificial Neural Networks. LNCS 2714, Springer, 274-277.
  • Deckert, M. (2011) Batch weighted ensemble for mining data streams with concept drift. In: M. Kryszkiewicz et al., eds., Proc. IMIS 2011. LNAI 6804, Springer, 290-299.
  • Domingos, P. and Hulten, G. (2000) Mining high-speed data streams. In: KDD ‘00: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, 71–80.
  • Fan, W., Huang, Y., Wang, H. and Yu, P.S. (2004) Active mining of data streams. In: Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 457-416, www.siam.org/ proceedings/datamining/ 2004/dm04.php
  • Gama, J. (2010) Knowledge Discovery from Data Streams. CRC Press, Boca Raton.
  • Gama, J. and Gaber, M.M. (2007) Learning From Data Streams: Processing Techniques in Sensor Networks. Springer, Berlin.
  • Gama, J., Medas, P., Castillo, G. and Rodrigues, P. (2004) Learning with drift detection. In: SBIA Brazilian Symposium on Artificial Intelligence. Springer Verlag, 286-295.
  • Han, J. and Kamber, M. (2006) Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco.
  • Huang, S. (2008) An active learning method for mining time-changing data streams. In: Proceedings of the 2008 Int. Symposium on Intelligent Information Technology Application, IITA’08. IEEE Computer Society, Washington, 548-552.
  • Hulten, G., Spencer, L. and Domingos, P. (2001) Mining time-changing data streams. In: KDD ‘01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, 97-106.
  • Kifer, D., Ben-David, S. and Gehrke, J. (2004) Detecting change in data streams. In: VLDB ‘04: Proceedings of the Thirtieth international conference on Very large data bases. VLDB Endowment, 180-191.
  • Klimt, B. and Yang, Y. (2004) The Enron corpus: A new dataset for email classification research. In: J.-F. Boulicaut et al., eds., Proceedings of the ECML 2004 Conference. LNCS 3201, Springer, 217-226.
  • Klinkenberg,R. and Renz, I. (1998) Adaptive information filtering: Learning in the presence of concept drifts. In: Workshop Notes of the ICML/AAAI-98 Workshop Learning for Text Categorization. AAAI Press, 33-40.
  • Klosgen,W. and Zytkow, J.M. (2002) Handbook of Data Mining and Knowledge Discovery. Oxford Press, Oxford.
  • Kmieciak, M.R. (2009) Learning Multiple Classifiers from Text Streams. Master’s thesis, Poznan University of Technology, Poznań, Poland.
  • Kubat, M. (1989) Floating approximation in time-varying knowledge bases. Pattern Recognition Letters 10 (4), 223-227.
  • Kuncheva, L.I. (2004) Classifier ensembles for changing environments. In: Proc. of the 5th Workshop on Multiple Classifier Systems. LNCS 3077, Springer, 1-15.
  • Lewis, D.D. (1995) A sequential algorithm for training text classifiers: corrigendom and additional data. SIGIR Forum 29 (2), 13-19.
  • Masud, M., Gao, J., Khan, L., Han, J. and Thuraisingham, B. (2008) A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: Proc. ICDM ‘08. Eighth IEEE International Conference on Data Mining. IEEE Press, 929-934.
  • Masud, M., Gao, J., Khan, L., Han, J. and Thuraisingham, B. (2009) Integrating novel class detection with classification for koncept-drifting data streams. In: Proc. ECML PKDD ‘09. Volume II. Springer-Verlag, 79-94.
  • Mitchell, T. (1997) Machine Learning. McGraw-Hill Education (ISE Editions), New York.
  • Nishida, K. (2008) Learning and Detecting Concept Drift. Ph.D. thesis, Graduate School of Information Science and Technology, Hokkaido University.
  • Quinlan, J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco.
  • Schlimmer, J. and Granger,R. (1986) Incremental learning from noisy data. Machine Learning 1 (3), 317-354.
  • Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys 34 (1), 1-47.
  • Stefanowski, J. and Zienkowicz, M. (2006) Classification of Polish email messages: Experiments with various data representations. In: F. Esposito et al., eds., Foundations of Intelligent Systems. 16th International Symposium ISMIS 2006. Proceedings. LNAI 4203, Springer, 723-728.
  • Szopka, B. (2007) Machine Learning and Text Processing Methods for Classification of Emails (in Polish). Master’s thesis, Poznan University of Technology, Poznan, Poland (supervisor J. Stefanowski).
  • Tsymbal, A. (2004) The problem of concept drift: definitions and related works. Technical report, Dept. of Computer Science, Trinity College Dublin.
  • Widmer, G. and Kubat, M. (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69-101.
  • Woolam, C., Masud, M. and Khan, L. (2009) Lacking labels in the stream: Classifying evolving stream data with few labels. In: Proceedings of the ISMIS 2009 Conference. Springer Verlag, 552-562.
  • Yang, Y. and Liu, X. (1999) A Re-examination of text categorization methods. In: Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 42-49.
  • Zliobaite, I. (2009) Learning under concept drift: an overview. Technical report, Faculty of Mathematics and Informatics, Vilnius University.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BATC-0009-0005
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.