Filtering Decision Rules Driven by Sequential Forward and Backward Selection of Attributes: An Illustrative Example in Stylometric Domain

Zielosko, Beata; Stańczyk, Urszula; Jabloński, Kamil

doi:10.15439/2023F7295

Artykuł - szczegóły

Tytuł artykułu

Filtering Decision Rules Driven by Sequential Forward and Backward Selection of Attributes: An Illustrative Example in Stylometric Domain

Autorzy

Zielosko Beata , Stańczyk Urszula , Jabloński Kamil

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2023F7295

Warianty tytułu

Języki publikacji

Abstrakty

The paper presents investigations concerning the decision rule filtering process controlled by the estimated relevance of available attributes. In the conducted study, two search directions were used, sequential forward selection and sequential backward elimination, applied after the knowledge discovery step to the rule sets inferred from a dataset. The steps of sequential search, along with two different strategies of rule selection, were governed by three rankings obtained for variables, all related to characteristics of data and rules that can be induced, as follows, (i) a ranking based on the weighting factor referring to the occurrence of attributes in generated decision reducts, (ii) the OneR ranking exploiting short rule properties, and (iii) the proposed ranking defined through the operation of greedy algorithm for rule induction. The three rankings were confronted and compared from the perspective of their usefulness for the selection of rules performed in the two directions. The resulting sets of rules were analysed with respect to the properties of the constituent decision rules and from the point of performance for all constructed rule-based classifiers. Substantial experiments were carried out in the stylometric domain, treating the task of authorship attribution as classification. The results obtained indicate that for all three rankings and search paths it was possible to obtain a noticeable reduction of attributes while at least maintaining the power of inducers, at the same time improving characteristics of rule sets.

Słowa kluczowe

greedy algorithm computer science filtering process control feature extraction labeling task analysis

algorytm zachłanny informatyka filtrowanie kontrola procesu ekstrakcja cech etykietowanie analiza zadań

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2023

Tom

Vol. 35

Strony

833--842

Opis fizyczny

Bibliogr. 34 poz., wz., tab.

Twórcy

autor

Zielosko Beata

beata.zielosko@us.edu.pl

University of Silesia in Katowice, Institute of Computer Science, Będzińska 39, 41-200 Sosnowiec, Poland

autor

Stańczyk Urszula

urszula.stanczyk@polsl.pl

Silesian University of Technology, Department of Graphics, Computer Vision and Digital Systems Akademicka 2A, 44-100 Gliwice, Poland

autor

Jabloński Kamil

kjablonski1@us.edu.pl

University of Silesia in Katowice, Institute of Computer Science, Będzińska 39, 41-200 Sosnowiec, Poland

Bibliografia

1. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, 2017.
2. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Eds., Feature Extraction: Foundations and Applications, ser. Studies in Fuzziness and Soft Computing. Physica-Verlag, Springer, 2006, vol. 207.
3. A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial Intelligence, vol. 97, no. 1, pp. 245–271, 1997.
4. U. Stańczyk, “Weighting of features by sequential selection,” in Feature Selection for Data and Pattern Recognition, ser. Studies in Computational Intelligence, U. Stańczyk and L. Jain, Eds. Berlin, Germany: Springer-Verlag, 2015, vol. 584, pp. 71–90.
5. I. Witten, E. Frank, and M. Hall, Data Mining. Practical Machine Learning Tools and Techniques, 3rd ed. Morgan Kaufmann, 2011.
6. B. Zielosko and U. Stańczyk, “Reduct-based ranking of attributes,” in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES-2020, Virtual Event, 16-18 September 2020, ser. Procedia Computer Science, M. Cristani, C. Toro, C. Zanni-Merk, R. J. Howlett, and L. C. Jain, Eds., vol. 176. Elsevier, 2020, pp. 2576–2585.
7. H. Liu and H. Motoda, Computational Methods of Feature Selection. CRC Press, 2007.
8. E. Amaldi and V. Kann, “On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems,” Theoretical Computer Science, vol. 209, no. 1, pp. 237–260, 1998.
9. B. Zielosko and M. Piliszczuk, “Greedy algorithm for attribute reduction,” Fundam. Informaticae, vol. 85, no. 1-4, pp. 549–561, 2008.
10. M. M. Mafarja and S. Mirjalili, “Hybrid whale optimization algorithm with simulated annealing for feature selection,” Neurocomputing, vol. 260, pp. 302–312, 2017.
11. P. Pudil, J. Novovièová, and J. Kittler, “Floating search methods in feature selection,” Pattern Recognition Letters, vol. 15, no. 11, pp. 1119–1125, 1994.
12. I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.
13. U. Stańczyk, B. Zielosko, and L. C. Jain, “Advances in feature selection for data and pattern recognition: An introduction,” in Advances in Feature Selection for Data and Pattern Recognition, ser. Intelligent Systems Reference Library, U. Stañczyk, B. Zielosko, and L. C. Jain, Eds. Springer, 2018, vol. 138, pp. 1–9.
14. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1-3, pp. 389–422, 2002.
15. W. Altidor, T. M. Khoshgoftaar, and J. V. Hulse, “An empirical study on wrapper-based feature ranking,” in 2009 21st IEEE International Conference on Tools with Artificial Intelligence, 2009, pp. 75–82.
16. Z. Pawlak and A. Skowron, “Rough sets and boolean reasoning,” Information Sciences, vol. 177, no. 1, pp. 41–73, 2007.
17. A. Janusz and D. Ślęzak, “Utilization of attribute clustering methods for scalable computation of reducts from high-dimensional data,” in 2012 Federated Conference on Computer Science and Information Systems (FedCSIS), 2012, pp. 295–302.
18. Y. Yang, D. Chen, H. Wang, E. C. Tsang, and D. Zhang, “Fuzzy rough set based incremental attribute reduction from dynamic data with sample arriving,” Fuzzy Sets and Systems, vol. 312, pp. 66–86, 2017.
19. Y. Liu, L. Zheng, Y. Xiu, H. Yin, S. Zhao, X. Wang, H. Chen, and C. Li, “Discernibility matrix based incremental feature selection on fused decision tables,” International Journal of Approximate Reasoning, vol. 118, pp. 1–26, 2020.
20. J. Henzel, A. Janusz, M. Sikora, and D. Ślęzak, “On positive-correlation-promoting reducts,” in Rough Sets, R. Bello, D. Miao, R. Falcon, M. Nakata, A. Rosete, and D. Ciucci, Eds. Springer International Publishing, 2020, pp. 213–221.
21. J. Wróblewski, “Ensembles of classifiers based on approximate reducts,” Fundam. Informaticae, vol. 47, no. 3–4, p. 351–360, 2001.
22. J. Bazan and M. Szczuka, “The rough set exploration system,” in Transactions on Rough Sets III, ser. Lecture Notes in Computer Science, J. F. Peters and A. Skowron, Eds. Berlin, Heidelberg: Springer, 2005, vol. 3400, pp. 37–56.
23. J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.
24. R. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, pp. 63–91, 1993.
25. S. Ali and K. A. Smith, “On learning algorithm selection for classification,” Applied Soft Computing, vol. 6, no. 2, pp. 119–138, 2006.
26. M. J. Moshkov, M. Piliszczuk, and B. Zielosko, “Greedy algorithm for construction of partial association rules,” Fundam. Informaticae, vol. 92, no. 3, pp. 259–277, 2009.
27. M. J. Moshkov, M. Piliszczuk, and B. Zielosko, “On construction of partial reducts and irreducible partial decision rules,” Fundam. Informaticae, vol. 75, no. 1-4, pp. 357–374, 2007.
28. B. Zielosko, “Sequential optimization of γ-decision rules,” in Federated Conference on Computer Science and Information Systems - FedCSIS 2012, Wroclaw, Poland, 9-12 September 2012, Proceedings, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., 2012, pp. 339–346.
29. E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the American Society for Information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009.
30. M. Eder, “Style-markers in authorship attribution a cross-language study of the authorial fingerprint,” Studies in Polish Linguistics, vol. 6, no. 1, pp. 99–114, 2011.
31. H. Wu, Z. Zhang, and Q. Wu, “Exploring syntactic and semantic features for authorship attribution,” Applied Soft Computing, vol. 111, p. 107815, 2021.
32. S. G. Weidman and J. O’Sullivan, “The limits of distinctive words: Reevaluating literature’s gender marker debate,” Digital Scholarship in the Humanities, vol. 33, pp. 374–390, 2018.
33. U. Stańczyk and G. Baron, “On heterogeneity or sub-classes aspect in construction of stylometric input datasets,” in Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 26th International Conference KES-2022, Verona, Italy, 7-9 September 2022, ser. Procedia Computer Science, M. Cristani, C. Toro, C. Zanni-Merk, R. J. Howlett, and L. C. Jain, Eds. Elsevier, 2022, vol. 207, pp. 2526–2535.
34. U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuousvalued attributes for classification learning,” in 13th International Joint Conference on Articial Intelligence, vol. 2. Morgan Kaufmann Publishers, 1993, pp. 1022–1027.

Uwagi

1. The research works presented in the article were carried out at the Institute of Computer Science, University of Silesia in Katowice, Sosnowiec, Poland, and within the statutory project of the Department of Graphics, Computer Vision and Digital Systems (RAU-6, 2023), at the Silesian University of Technology, Gliwice, Poland.

2. Thematic Tracks Regular Papers

3. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-039f49a8-3f32-4161-9108-2265efd16b3c