Discretization of data using Boolean transformations and information theory based evaluation criteria

Jankowski, C.; Reda, D.; Mańkowski, M.; Borowik, G.

doi:10.1515/bpasts-2015-0105

Artykuł - szczegóły

Tytuł artykułu

Discretization of data using Boolean transformations and information theory based evaluation criteria

Autorzy

Jankowski C. , Reda D. , Mańkowski M. , Borowik G.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.1515/bpasts-2015-0105

Warianty tytułu

Języki publikacji

Abstrakty

Discretization is one of the most important parts of decision table preprocessing. Transforming continuous values of attributes into discrete intervals influences further analysis using data mining methods. In particular, the accuracy of generated predictions is highly dependent on the quality of discretization. The paper contains a description of three new heuristic algorithms for discretization of numeric data, based on Boolean reasoning. Additionally, an entropy-based evaluation of discretization is introduced to compare the results of the proposed algorithms with the results of leading university software for data analysis. Considering the discretization as a data compression method, the average compression ratio achieved for databases examined in the paper is 8.02 while maintaining the consistency of databases at 100%.

Słowa kluczowe

machine learning discretization discernibility function logic minimization information theory entropy

nauczanie maszynowe dyskretyzacja minimalizacja funkcji logicznych teoria informacji entropia

Wydawca

Polska Akademia Nauk, Wydział IV Nauk Technicznych

Czasopismo

Bulletin of the Polish Academy of Sciences. Technical Sciences

Rocznik

2015

Tom

Vol. 63, nr 4

Strony

923--932

Opis fizyczny

Bibliogr. 40 poz., tab.

Twórcy

autor

Jankowski C.

C.Jankowski@stud.elka.pw.edu.pl

Institute of Telecommunications, Warsaw University of Technology, 15/19 Nowowiejska St., 00-665 Warszawa, Poland

autor

Reda D.

Institute of Telecommunications, Warsaw University of Technology, 15/19 Nowowiejska St., 00-665 Warszawa, Poland

autor

Mańkowski M.

Institute of Radioelectronics and Multimedia Technology, Warsaw University of Technology, 15/19 Nowowiejska St., 00-665 Warszawa, Poland

autor

Borowik G.

Institute of Telecommunications, Warsaw University of Technology, 15/19 Nowowiejska St., 00-665 Warszawa, Poland

Bibliografia

[1] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases”, AI Magazine 17, 37-54 (1996).
[2] C. Moraga, “Design of neural networks”, 11th Int. Conf. Knowledge-Based Intelligent Informational and Engineering Systems, Lecture Notes in Computer Science 4692, 26-33 (2007), DOI: 10.1007/978-3-540-74819-9 4.
[3] A. Raghuvanshi and M.A. Perkowski, “Image processing and machine learning for the diagnosis of melanoma cancer”, BIODEVICES 2011 - Proc. Int. Conf. on Biomedical Electronics and Devices 1, 405-410 (2011).
[4] S. Hui and S. Żak, “Discrete Fourier transform based pattern classifiers”. Bull. Pol. Ac.: Tech. 62 (1), 15-22 (2014), DOI: 10.2478/bpasts-2014-0002.
[5] A. Jastriebow and K. Poczęta, “Analysis of multi-step algorithms for cognitive maps learning”, Bull. Pol. Ac.: Tech. 62 (4), 735-741 (2014), DOI: 10.2478/bpasts-2014-0079.
[6] M.G.M. Hunink, P.P. Glasziou, J.E. Siegel, J.C. Weeks, J.S. Pliskin, A.S. Elstein, and M.C. Weinstein, Decision Making in Health and Medicine: Integrating Evidence and Values, University Press, Cambridge, 2001.
[7] R. Cuingnet, E. Gerardin, J. Tessieras, G. Auzias, S. Lehericy, M.O. Habert, M. Chupin, H. Benali, and O. Colliot, “Automatic classification of patients with Alzheimer’s disease from structural MRI: a comparison of ten methods using the ADNI database”, NeuroImage 56 (2), 766-781 (2011), DOI: 10.1016/j.neuroimage.2010.06.013.
[8] J. Komorowski, Z. Pawlak, L. Polkowski, and A. Skowron, Rough Sets: A Tutorial, Springer, Singapore, 1998.
[9] G. Borowik, “Boolean function complementation based algorithm for data discretization”, Computer Aided Systems Theory - EUROCAST 2013, Lecture Notes in Computer Science 8112, 218-225 (2013), DOI: 10.1007/978-3-642-53862-9 28.
[10] M. Žádník and Z. Michlovský, “Is spam visible in flow-level statistics?”, CESNET National Research and Education Network, Tech. Rep., CESNET 1, Prague, 2009.
[11] O.L. Mangasarian and W.H. Wolberg, “Cancer diagnosis via linear programming”, SIAM News 23 (5), 1-18 (1990).
[12] K. Bache and M. Lichman, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml (2013).
[13] M.R. Chmielewski and J.W. Grzymala-Busse, “Global discretization of continuous attributes as preprocessing for machine learning”, Int. J. Approximate Reasoning 1, 294-301 (1996).
[14] A. Ekbal, “Improvement of prediction accuracy using discretization and voting classifier”, 18th Int. Conf. on Pattern Recognition 2, 695-698 (2006), DOI: 10.1109/ICPR.2006.698.
[15] E. Frank and I.H. Witten, “Making better use of global discretization”, Proc. Sixteenth Int. Conf. on Machine Learning 1, 115-123 (1999).
[16] H.-V. Nguyen, E. Müller, J. Vreeken, and K. B¨ohm, “Unsupervised interaction-preserving discretization of multivariate data”, Data Mining and Knowledge Discovery 28 (5-6), 1366-1397 (2014), DOI: 10.1007/s10618-014-0350-5.
[17] P. Chaudhari, D.P. Rana, R.G. Mehta, N.J. Mistry, and M.M. Raghuwanshi, “Discretization of temporal data: a survey”, Int. J. Computer Science and Information Security 11 (2), 66-69 (2014).
[18] D. Farid and C. Rahman, “Mining complex data streams: discretization, attribute selection and classification”, J. Advances in Information Technology 4 (3), 2013.
[19] J.W. Grzymala-Busse, “Discretization based on entropy and multiple scanning”, Entropy 15 (5), 1486-1502 (2013), DOI: 10.3390/e15051486.
[20] L. Zou, D. Yan, H.R. Karimi, and P. Shi, “An algorithm for discretization of real value attributes based on interval similarity”, J. Applied Mathematics, Article ID 350123 (2013), DOI: 10.1155/2013/350123.
[21] K. Shehzad, “Edisc: a class-tailored discretization technique for rule-based classification”, IEEE Trans. Knowledge and Data Engineering 24 (8), 1435-1447 (2012), DOI: 10.1109/TKDE.2011.101.]
[22] D.M. Maslove, T. Podchiyska, and H.J. Lowe, “Discretization of continuous features in clinical datasets”, J. American Medical Informatics Association 20 (3), 544-553 (2013), DOI: 10.1136/amiajnl-2012-000929.
[23] M.G. Augasta and T. Kathirvalavakumar, “A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier”, Applied Soft Computing 12 (2), 619-625 (2012), DOI: 10.1016/j.asoc.2011.11.001.
[24] J.L. Lustgarten, S. Visweswaran, V. Gopalakrishnan, and G.F. Cooper, “Application of an efficient Bayesian discretization method to biomedical data”, BMC Bioinformatics 12 (1), 2011, DOI: 10.1186/1471-2105-12-309.
[25] T.W. Rondeau and C.W. Bostian, Artificial Intelligence in Wireless Communications, Artech House, London, 2009.
[26] G. Borowik and T. Łuba, “Fast algorithm of attribute reduction based on the complementation of Boolean function”, Advanced Methods and Applications in Computational Intelligence 1, 25-41 (2014), DOI: 10.1007/978-3-319-01436-4 2.
[27] Z. Pawlak, Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, London, 1991, DOI: 10.1007/978-94-011-3534-4.
[28] H. Liu, F. Hussain, C. Tan, and M. Dash, “Discretization: An enabling technique”, Data Mining and Knowledge Discovery 6 (4), 393-423 (2002), DOI: 10.1023/A:1016304305535.
[29] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, 1993.
[30] C.H. Papadimitriou, Computational Complexity, Academic Internet Publ., London, 2007.
[31] S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani, Algorithms, McGraw-Hill, London, 2008.
[32] B. Steinbach and C. Posthoff, “Improvements of the construction of exact minimal covers of Boolean functions”, Computer Aided Systems Theory - EUROCAST 2011, Lecture Notes in Computer Science 6928, 272-279 (2012), DOI: 10.1007/978-3-642-27579-1 35.
[33] R.K. Brayton, G.D. Hachtel, C.T. McMullen, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis, Kluwer Academic Publishers, Berlin, 1984, DOI: 10.1007/978-1-4613-2821-6.
[34] C.E. Shannon, “A mathematical theory of communication”, Bell System Technical J. 27 (3), 379-423 (1948), DOI: 10.1002/j.1538-7305.1948.tb01338.x.
[35] G. Holmes, A. Donkin, and I.H. Witten, “Weka: a machine learning workbench”, Proc. 1994 Second Australian and New Zealand Conf. Intelligent Information Systems 1, 357-361 (1994), DOI: 10.1109/ANZIIS.1994.396988.
[36] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, 1993.
[37] R. Kohavi, “The power of decision tables”, Machine Learning: ECML-95 912, 174-189 (1995), DOI: 10.1007/3-540-59286-5 57.
[38] L.S. Cessie and J.C. van Houwelingen, “Ridge estimators in logistic regression”, Applied Statistics 41 (1), 191-201 (1992).
[39] J.C. Platt, “Fast training of support vector machines using sequential minimal optimization”, Advances in Kernel Methods 1, 185-208 (1999).
[40] L. Breiman, “Random forests,” Machine Learning 45 (1), 5-32 (2001), DOI: 10.1023/A:1010933404324.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-dde4352e-0f96-486a-825f-60e163480f9d