PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Generalizing clustering inferences with ML augmentation of ordinal survey data

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
In this paper, we attempt to generalize the ability to achieve quality inferences of survey data for a larger population through data augmentation and unification. Data augmentation techniques have proven effective in enhancing models’ performance by expanding the dataset’s size. We employ ML data augmentation, unification, and clustering techniques. First, we augment the limited survey data size using data augmentation technique(s). Second, we carry out data unification, followed by clustering for inferencing. We took two benchmark survey datasets to demonstrate the effectiveness of augmentation and unification. The first dataset contains information on aspiring student entrepreneurs’ characteristics, while the second dataset comprises survey datarelated to breast cancer. We compare the inferences drawn from the original survey data with those derived from the transformed data using the proposed scheme. The results of this study indicate that the machine learning approach, data augmentation with the unification of data followed by clustering, can be beneficial for generalizing the inferences drawn from the survey data.
Wydawca
Czasopismo
Rocznik
Tom
Strony
63--93
Opis fizyczny
Bibliogr. 51 poz., rys., tab., wykr.
Twórcy
  • Jawaharlal Nehru University, Data to Knowledge (D2K) Lab, School of Computer & SystemsSciences, New Delhi 110 067, India
autor
  • Jawaharlal Nehru University, Data to Knowledge (D2K) Lab, School of Computer & SystemsSciences, New Delhi 110 067, India
Bibliografia
  • [1] Aggarwal C.C.: An introduction to cluster analysis. In: C.C. Aggarwal, C.K.Reddy (eds.), Data clustering,chapter 1, pp. 1–28, Chapman and Hall/CRC, 2018. doi: 10.1201/9781315373515-1.
  • [2] Agrawal T., Choudhary P.: Segmentation and classification on chest radiography: a systematic survey, The Visual Computer, vol. 39(3), pp. 875–913, 2023.
  • [3] Ahmad A., Khan S.S.: Survey of State-of-the-Art Mixed Data Clustering Algorithms, IEEE Access, vol. 7, pp. 31883–31902, 2019.doi: 10.1109 /access.2019.2903568.
  • [4] Back B., Sere K., Vanharanta H.: Managing complexity in large databases using self-organizing maps, Accounting, Management and Information Technologies, vol. 8(4), pp. 191–210, 1998. doi: 10.1016/s0959-8022(98)00009-5.
  • [5] Behrend T.S., Sharek D.J., Meade A.W., Wiebe E.N.: The viability of crowd-sourcing for survey research, Behavior Research Methods, vol. 43, pp. 800–813,2011. doi: 10.3758/s13428-011-0081-0.
  • [6] Belloni A., Chernozhukov V., Hansen C.: Inference on treatment effects after selection among high-dimensional controls, The Review Economic Studies, vol. 81(2), pp. 608–650, 2014.
  • [7] Bowles C., Chen L., Guerrero R., Bentley P., Gunn R., Hammers A., Dickie D.A., Hernandez M.V., Wardlaw J., Rueckert D.: GAN augmentation: Augmenting training data using generative adversarial networks, arXiv: 181010863, 2018.
  • [8] Buskirk T.D., Kirchner A., Eck A., Signorino C.S.: An Introduction to Machine Learning Methods for Survey Researchers, 2018. doi: 10.29115/sp-2018-0004.
  • [9] Bzdok D., Altman N., Krzywinski M.: Statistics versus machine learning, Nature Methods, vol. 15, pp. 233–234, 2018. doi: 10.1038/nmeth.4642.
  • [10] Caliński T., Harabasz J.: A Dendrite Method for Cluster Analysis, Communications in Statistics Theory&Methods, vol. 3(1), pp. 1–27, 1974. doi: 10.1080/03610927408827101.
  • [11] Cameron A.C., Miller D.L.: A Practitioner’s Guide to Cluster-Robust Inference, Journal Human Resources, vol. 50(2), pp. 317–372, 2015. doi: 10.3368/jhr.50.2.317.
  • [12] Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: synthetic mi-nority over-sampling technique,Journal Artificial Intelligence Research, vol. 16,pp. 321–357, 2002. doi: 10.1613/jair.953.
  • [13] Chen C., Wang Y., Hu W., Zheng Z.: Robust multi-view K-means clustering with outlier removal, Knowledge-Based Systems, vol. 210, 106518, 2020.
  • [14] Chen Y., Tang S., Bouguila N., Wang C., Du J., Li H.: A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data, Pattern Recognition, vol. 83, pp. 375–387, 2018.doi: 10.1016/j.patcog.2018.05.030.
  • [15] Church A.H., Waclawski J.: Designing and using organizational surveys: A seven-step process, John Wiley & Sons, 2001.
  • [16] Dempster A.P., Laird N.M., Rubin D.B.: Maximum likelihood from incomplete data via the EM algorithm, Journal Royal Statistical Society: Series B (Methodological), vol. 39(1), pp. 1–22, 1977.
  • [17] van Dyk D.A., Meng X.L.: The Art of Data Augmentation, Journal of Computational and Graphical Statistics, vol. 10(1), pp. 1–50, 2001. doi: 10.1198/10618600152418584.
  • [18] Firdaus S., Uddin M.A.: A survey on clustering algorithms and complexity analysis, International Journal of Computer Science Issues, vol. 12(2), 62, 2015.
  • [19] Garcıa-Jara G., Protopapas P., Estevez P.A.: Improving astronomical time-series classification via data augmentation with generative adversarial networks, The Astrophysical Journal, vol. 935(1), 23, 2022.
  • [20] Giordan M., Diana G.: A clustering method for categorical ordinal data, Communications in Statistics-Theory&Methods, vol. 40(7), pp. 1315–1334, 2011.doi: 10.1080/03610920903581010.
  • [21] Golinko E., Sonderman T., Zhu X.: CNFL: Categorical to Numerical Feature Learning for Clustering and Classification. In: 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC), Shenzhen, China, pp. 585–594,2017. doi: 10.1109/DSC.2017.87.
  • [22] Graubardand B.I., Korn E.L.: Inference for Superpopulation Parameters using Sample Surveys, Statistical Science, vol. 17(1), pp. 73–96, 2002. doi: 10.1214/ss/1023798999.
  • [23] He H., Bai Y., Garcia E.A., Li S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp. 1322–1328, 2008. doi: 10.1109/IJCNN.2008.4633969.
  • [24] Kern C., Klausch T., Kreuter F.: Tree-based machine learning methods for survey research, Survey Research Methods, vol. 13 (1), pp. 73–93, 2019.
  • [25] Kim K., Hong J.S.: A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis, Pattern Recognition Letters, vol. 98, pp. 39–45, 2017. doi: 10.1016/j.patrec.2017.08.011.
  • [26] Kumar B., Kumar R.: Difference-Attribute-Based Clustering for Ordinal Survey Data. In: A.K. Dubey, V. Sugumaran, P.H.J. Chong (eds.), AdvancedIoT Sensors, Networks and Systems. SPIN 2022, pp. 17–27, Springer, 2022. doi: 10.1007/978-981-99-1312-12.
  • [27] Kumar B., Kumar R.: Entropy-based clustering for subspace pattern discoveryin ordinal survey data. In: V. Bhateja, X.S. Yang, J. Chun-Wei Lin, R. Das(eds.), Intelligent Data Engineering and Analytics. FICTA 2022. Smart Innovation, Systems and Technologies, pp. 509–519, Springer, 2022. doi: 10.1007/978-981-19-7524-045.
  • [28] Kumar B., Kumar R.: Unification of Numerical and Ordinal Survey Data for Clustering-based Inferencing, INFOCOMP Journal Computer Science, vol. 22(1), 2023. https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/2492.
  • [29] Kumar R., Rockett P.: Multiobjective genetic algorithm partitioning for hierarchical learning of high-dimensional pattern spaces: a learning-follows-decomposition strategy, IEEE Transactions on Neural Networks, vol. 9(5), pp. 822–830, 1998. doi: 10.1109/72.712155.
  • [30] Ley C., Martin R.K., Pareek A., Groll A., Seil R., Tischer T.: Machine learning and conventional statistics: making sense of the differences, Knee Surgery, SportsTraumatology, Arthroscopy, vol. 30(3), pp. 753–757, 2022. doi: 10.1007/s00167-022-06896-6.
  • [31] Luchi D., Rodrigues A.L., Varejao F.M.: Sampling approaches for applying DBSCAN to large datasets, Pattern Recognition Letters, vol. 117, pp. 90–96,2019. doi: 10.1016/j.patrec.2018.12.010.
  • [32] Mamabolo M.A., Myres K.: A detailed guide on converting qualitative data into quantitative entrepreneurial skills survey instrument, The Electronic Journalof Business Research Methods, vol. 17(3), pp. 102–117, 2019. doi: 10.34190/JBRM.17.3.001.
  • [33] Mason M.: Sample size and saturation in PhD studies using qualitative interviews, Forum Qualitative Sozial for schung/Forum: Qualitative Social Research, vol. 11(3), 2010. doi: 10.17169/fqs-11.3.1428.
  • [34] Nardo M.: The quantification of qualitative survey data: a critical assessment, Journal Economic Surveys, vol. 17(5), pp. 645–668, 2003.
  • [35] Pakhira M.K.: A Linear Time-Complexityk-Means Algorithm Using Cluster Shifting. In: 2014 International Conference on Computational Intelligence and Communication Networks, CICN’2014, pp. 1047–1051, 2014. doi: 10.1109/CICN.2014.220.
  • [36] Rastogi R., Mondal P., Agarwal K., Gupta R., Jain S.: GA based clustering of mixed data type of attributes (numeric, categorical, ordinal, binary, and ratioscaled), BIJIT – BVICAM’s International Journal of Information Technology, vol. 7(2), pp. 861–866, 2015.
  • [37] Rich T.S.: South Korean perceptions of unification: Evidence from an experimental survey, Georgetown Journal of International Affairs, vol. 20, pp. 142–149,2019. doi: 10.1353/gia.2019.0022.
  • [38] Rodriguez M.Z., Comin C.H., Casanova D., Bruno O.M., Amancio D.R.,Costa L.d.F., Rodrigues F.A.: Clustering algorithms: A comparative approach,PloS one, vol. 14(1), e0210236, 2019. doi: 10.1371/journal.pone.0210236.
  • [39] Rousseeuw P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Journal of Computational&Applied Mathematics, vol. 20, pp. 53–65, 1987. doi: 10.1016/0377-0427(87)90125-7.
  • [40] Sadh R., Kumar R.: Clustering of Quantitative Survey Data based on Marking Patterns, INFOCOMP Journal Computer Science, vol. 19(2), pp. 109–119, 2020.
  • [41] Sadh R., Kumar R.: Transformation and classification of ordinal survey data, Computer Science, vol. 24(2), 2023. doi: 10.7494/csci.2023.24.2.4871.
  • [42] Schliep E.M., Hoeting J.A.: Data augmentation and parameter expansion for independent or spatially correlated ordinal data, Computational Statistics&Data Analysis, vol. 90, pp. 1–14, 2015. doi: 10.1016/j.csda.2015.03.020.
  • [43] Stevens S.S.: On the theory of scales of measurement, Science, vol. 103(2684), pp. 677–680, 1946. doi: 10.1126/science.103.2684.677.
  • [44] Taylor L., Nitschke G.: Improving Deep Learning with Generic Data Augmentation. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, pp. 1542–1547, 2018. doi: 10.1109/SSCI.2018.8628742.
  • [45] Temraz M., Keane M.T.: Solving the class imbalance problem using a counterfactual method for data augmentation, Machine Learning with Applications, vol. 9,100375, 2022.
  • [46] Tourangeau R.: Cognitive aspects of survey measurement and mis measurement, International Journal of Public Opinion Research, vol. 15(1), pp. 3–7, 2003. doi: 10.1093/ijpor/15.1.3.
  • [47] Valsiner J., Molenaar P.C., Lyra M.C.D.P., Chaudhary N.: Dynamic Process Methodology in the Social and Developmental Sciences, Springer, 2009.
  • [48] Van Hulse J., Khoshgoftaar T.M., Napolitano A.: Experimental perspectives on learning from imbalanced data. In: ICML ’07: Proceedings of the 24th international conference on Machine learning, pp. 935–942, 2007. doi: 10.1145/1273496.1273614.
  • [49] Velleman P.F., Wilkinson L.: Nominal, ordinal, interval, and ratio typologies are misleading, The American Statistician, vol. 47(1), pp. 65–72, 1993. doi: 10.1515/9783110887617.161.
  • [50] Zhang Y., Cheung Y.M.: Learnable Weighting of Intra-Attribute Distances for Categorical Data Clustering with Nominal and Ordinal Attributes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44(7), pp. 3560–3576,2021. doi: 10.1109/TPAMI.2021.3056510.
  • [51] Zhang Y., Cheung Y.M., Tan K.C.: A Unified Entropy-Based Distance Metric for Ordinal-and-Nominal-Attribute Data Clustering, IEEE Transactions on Neural Networks and Learning Systems, vol. 31(1), pp. 39–52, 2019. doi: 10.1109/TNNLS.2019.2899381.
Uwagi
PL
Opracowanie rekordu ze środków MNiSW, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-a2ad6dda-ccc5-4fc4-905a-929357d6a0a8
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.