Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Czasopismo
2024 | T. 25 (1) | 63--93
Tytuł artykułu

Generalizing clustering inferenceswith ml augmentation ofordinal survey data

Treść / Zawartość
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
In this paper, we attempt to generalize the ability to achieve quality inferences of survey data for a larger population through data augmentation andunification. Data augmentation techniques have proven effective in enhancingmodels’ performance by expanding the dataset’s size. We employ ML dataaugmentation, unification, and clustering techniques. First, we augment thelimitedsurvey data size using data augmentation technique(s). Second, wecarry out data unification, followed by clustering for inferencing. We took twobenchmark survey datasets to demonstrate the effectiveness of augmentationand unification. The first dataset contains information on aspiring studententrepreneurs’ characteristics, while the second dataset comprises survey datarelated to breast cancer. We compare the inferences drawn from the originalsurvey data with those derived from the transformed data using the proposedscheme. The results of this study indicate that the machine learning approach,data augmentation with the unification of data followed by clustering, can bebeneficial for generalizing the inferences drawn from the survey data.
Wydawca

Czasopismo
Rocznik
Tom
Strony
63--93
Opis fizyczny
Bibliogr. 51 poz., rys., tab., wykr.
Twórcy
  • Jawaharlal Nehru University, Data to Knowledge (D2K) Lab, School of Computer & SystemsSciences, New Delhi 110 067, India, bkchauhan86@gmail.com
  • Jawaharlal Nehru University, Data to Knowledge (D2K) Lab, School of Computer & SystemsSciences, New Delhi 110 067, India, rajeevkumar.cse@gmail.com
Bibliografia
  • [1] Aggarwal C.C.: An introduction to cluster analysis. In: C.C. Aggarwal, C.K.Reddy (eds.),Data clustering,chapter 1, pp. 1–28, Chapman and Hall/CRC,2018. doi: 10.1201/9781315373515-1.
  • [2] Agrawal T., Choudhary P.: Segmentation and classification on chest radiography:a systematic survey,The Visual Computer, vol. 39(3), pp. 875–913, 2023.
  • [3] Ahmad A., Khan S.S.: Survey of State-of-the-Art Mixed Data ClusteringAlgorithms,IEEE Access, vol. 7, pp. 31883–31902, 2019.doi: 10.1109 /access.2019.2903568.
  • [4] Back B., Sere K., Vanharanta H.: Managing complexity in large databases usingself-organizing maps,Accounting, Management and Information Technologies,vol. 8(4), pp. 191–210, 1998. doi: 10.1016/s0959-8022(98)00009-5.
  • [5] Behrend T.S., Sharek D.J., Meade A.W., Wiebe E.N.: The viability of crowd-sourcing for survey research,Behavior Research Methods, vol. 43, pp. 800–813,2011. doi: 10.3758/s13428-011-0081-0.
  • [6] Belloni A., Chernozhukov V., Hansen C.: Inference on treatment effects af-ter selection among high-dimensional controls,The Review Economic Studies,vol. 81(2), pp. 608–650, 2014.
  • [7] Bowles C., Chen L., Guerrero R., Bentley P., Gunn R., Hammers A., Dickie D.A.,Hern ́andez M.V., Wardlaw J., Rueckert D.: GAN augmentation: Augmentingtraining data using generative adversarial networks,arXiv: 181010863, 2018.
  • [8] Buskirk T.D., Kirchner A., Eck A., Signorino C.S.: An Introduction to MachineLearning Methods for Survey Researchers, 2018. doi: 10.29115/sp-2018-0004.
  • [9] Bzdok D., Altman N., Krzywinski M.: Statistics versus machine learning,NatureMethods, vol. 15, pp. 233–234, 2018. doi: 10.1038/nmeth.4642.
  • [10] Cali ́nski T., Harabasz J.: A Dendrite Method for Cluster Analysis,Communi-cations in Statistics Theory&Methods, vol. 3(1), pp. 1–27, 1974. doi: 10.1080/03610927408827101.
  • [11] Cameron A.C., Miller D.L.: A Practitioner’s Guide to Cluster-Robust Infer-ence,Journal Human Resources, vol. 50(2), pp. 317–372, 2015. doi: 10.3368/jhr.50.2.317.
  • [12] Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: synthetic mi-nority over-sampling technique,Journal Artificial Intelligence Research, vol. 16,pp. 321–357, 2002. doi: 10.1613/jair.953.
  • [13] Chen C., Wang Y., Hu W., Zheng Z.: Robust multi-view K-means clusteringwith outlier removal,Knowledge-Based Systems, vol. 210, 106518, 2020.
  • [14] Chen Y., Tang S., Bouguila N., Wang C., Du J., Li H.: A fast clusteringalgorithm based on pruning unnecessary distance computations in DBSCANfor high-dimensional data,Pattern Recognition, vol. 83, pp. 375–387, 2018.doi: 10.1016/j.patcog.2018.05.030.
  • [15] Church A.H., Waclawski J.:Designing and using organizational surveys: A seven-step process, John Wiley & Sons, 2001.
  • [16] Dempster A.P., Laird N.M., Rubin D.B.: Maximum likelihood from incompletedata via the EM algorithm,Journal Royal Statistical Society: Series B (Method-ological), vol. 39(1), pp. 1–22, 1977.
  • [17] van Dyk D.A., Meng X.L.: The Art of Data Augmentation,Journal of Com-putational and Graphical Statistics, vol. 10(1), pp. 1–50, 2001. doi: 10.1198/10618600152418584.
  • [18] Firdaus S., Uddin M.A.: A survey on clustering algorithms and complexity anal-ysis,International Journal of Computer Science Issues, vol. 12(2), 62, 2015.
  • [19] Garc ́ıa-Jara G., Protopapas P., Est ́evez P.A.: Improving astronomical time-seriesclassification via data augmentation with generative adversarial networks,TheAstrophysical Journal, vol. 935(1), 23, 2022.
  • [20] Giordan M., Diana G.: A clustering method for categorical ordinal data,Com-munications in Statistics-Theory&Methods, vol. 40(7), pp. 1315–1334, 2011.doi: 10.1080/03610920903581010.
  • [21] Golinko E., Sonderman T., Zhu X.: CNFL: Categorical to Numerical FeatureLearning for Clustering and Classification. In:2017 IEEE Second InternationalConference on Data Science in Cyberspace (DSC), Shenzhen, China, pp. 585–594,2017. doi: 10.1109/DSC.2017.87.
  • [22] Graubardand B.I., Korn E.L.: Inference for Superpopulation Parameters usingSample Surveys,Statistical Science, vol. 17(1), pp. 73–96, 2002. doi: 10.1214/ss/1023798999.
  • [23] He H., Bai Y., Garcia E.A., Li S.: ADASYN: Adaptive synthetic sampling ap-proach for imbalanced learning. In:2008 IEEE International Joint Conference onNeural Networks (IEEE World Congress on Computational Intelligence), HongKong, pp. 1322–1328, 2008. doi: 10.1109/IJCNN.2008.4633969.
  • [24] Kern C., Klausch T., Kreuter F.: Tree-based machine learning methods for surveyresearch,Survey Research Methods, vol. 13 (1), pp. 73–93, 2019.
  • [25] Kim K., Hong J.S.: A hybrid decision tree algorithm for mixed numeric and cate-gorical data in regression analysis,Pattern Recognition Letters, vol. 98, pp. 39–45,2017. doi: 10.1016/j.patrec.2017.08.011.
  • [26] Kumar B., Kumar R.: Difference-Attribute-Based Clustering for Ordinal Sur-vey Data. In: A.K. Dubey, V. Sugumaran, P.H.J. Chong (eds.),AdvancedIoT Sensors, Networks and Systems. SPIN 2022, pp. 17–27, Springer, 2022.doi: 10.1007/978-981-99-1312-12.
  • [27] Kumar B., Kumar R.: Entropy-based clustering for subspace pattern discoveryin ordinal survey data. In: V. Bhateja, X.S. Yang, J. Chun-Wei Lin, R. Das(eds.),Intelligent Data Engineering and Analytics. FICTA 2022. Smart Innova-tion, Systems and Technologies, pp. 509–519, Springer, 2022. doi: 10.1007/978-981-19-7524-045.
  • [28] Kumar B., Kumar R.: Unification of Numerical and Ordinal Survey Data forClustering-based Inferencing,INFOCOMP Journal Computer Science, vol. 22(1),2023. https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/2492.
  • [29] Kumar R., Rockett P.: Multiobjective genetic algorithm partitioning for hi-erarchical learning of high-dimensional pattern spaces:a learning-follows-decomposition strategy,IEEE Transactions on Neural Networks, vol. 9(5),pp. 822–830, 1998. doi: 10.1109/72.712155.
  • [30] Ley C., Martin R.K., Pareek A., Groll A., Seil R., Tischer T.: Machine learningand conventional statistics: making sense of the differences,Knee Surgery, SportsTraumatology, Arthroscopy, vol. 30(3), pp. 753–757, 2022. doi: 10.1007/s00167-022-06896-6.
  • [31] Luchi D., Rodrigues A.L., Varej ̃ao F.M.: Sampling approaches for applyingDBSCAN to large datasets,Pattern Recognition Letters, vol. 117, pp. 90–96,2019. doi: 10.1016/j.patrec.2018.12.010.
  • [32] Mamabolo M.A., Myres K.: A detailed guide on converting qualitative datainto quantitative entrepreneurial skills survey instrument,The Electronic Journalof Business Research Methods, vol. 17(3), pp. 102–117, 2019. doi: 10.34190/JBRM.17.3.001.
  • [33] Mason M.: Sample size and saturation in PhD studies using qualitative inter-views,Forum Qualitative Sozialforschung/Forum: Qualitative Social Research,vol. 11(3), 2010. doi: 10.17169/fqs-11.3.1428.
  • [34] Nardo M.: The quantification of qualitative survey data: a critical assessment,Journal Economic Surveys, vol. 17(5), pp. 645–668, 2003.
  • [35] Pakhira M.K.: A Linear Time-Complexityk-Means Algorithm Using ClusterShifting. In:2014 International Conference on Computational Intelligence andCommunication Networks, CICN’2014, pp. 1047–1051, 2014.doi: 10.1109/CICN.2014.220.
  • [36] Rastogi R., Mondal P., Agarwal K., Gupta R., Jain S.: GA based clustering ofmixed data type of attributes (numeric, categorical, ordinal, binary, and ratio-scaled),BIJIT – BVICAM’s International Journal of Information Technology,vol. 7(2), pp. 861–866, 2015.
  • [37] Rich T.S.: South Korean perceptions of unification: Evidence from an experi-mental survey,Georgetown Journal of International Affairs, vol. 20, pp. 142–149,2019. doi: 10.1353/gia.2019.0022.
  • [38] Rodriguez M.Z., Comin C.H., Casanova D., Bruno O.M., Amancio D.R.,Costa L.d.F., Rodrigues F.A.: Clustering algorithms: A comparative approach,PloS one, vol. 14(1), e0210236, 2019. doi: 10.1371/journal.pone.0210236.
  • [39] Rousseeuw P.J.: Silhouettes: A Graphical Aid to the Interpretation and Vali-dation of Cluster Analysis,Journal of Computational&Applied Mathematics,vol. 20, pp. 53–65, 1987. doi: 10.1016/0377-0427(87)90125-7.
  • [40] Sadh R., Kumar R.: Clustering of Quantitative Survey Data based on MarkingPatterns,INFOCOMP Journal Computer Science, vol. 19(2), pp. 109–119, 2020.
  • [41] Sadh R., Kumar R.: Transformation and classification of ordinal survey data,Computer Science, vol. 24(2), 2023. doi: 10.7494/csci.2023.24.2.4871.
  • [42] Schliep E.M., Hoeting J.A.: Data augmentation and parameter expansion forindependent or spatially correlated ordinal data,Computational Statistics&DataAnalysis, vol. 90, pp. 1–14, 2015. doi: 10.1016/j.csda.2015.03.020.
  • [43] Stevens S.S.: On the theory of scales of measurement,Science, vol. 103(2684),pp. 677–680, 1946. doi: 10.1126/science.103.2684.677.
  • [44] Taylor L., Nitschke G.: Improving Deep Learning with Generic Data Augmen-tation. In:2018 IEEE Symposium Series on Computational Intelligence (SSCI),Bangalore, India, pp. 1542–1547, 2018. doi: 10.1109/SSCI.2018.8628742.
  • [45] Temraz M., Keane M.T.: Solving the class imbalance problem using a counterfac-tual method for data augmentation,Machine Learning with Applications, vol. 9,100375, 2022.
  • [46] Tourangeau R.: Cognitive aspects of survey measurement and mismeasurement,International Journal of Public Opinion Research, vol. 15(1), pp. 3–7, 2003.doi: 10.1093/ijpor/15.1.3.
  • [47] Valsiner J., Molenaar P.C., Lyra M.C.D.P., Chaudhary N.:Dynamic ProcessMethodology in the Social and Developmental Sciences, Springer, 2009.
  • [48] Van Hulse J., Khoshgoftaar T.M., Napolitano A.: Experimental perspectiveson learning from imbalanced data. In:ICML ’07: Proceedings of the 24th in-ternational conference on Machine learning, pp. 935–942, 2007. doi: 10.1145/1273496.1273614.
  • [49] Velleman P.F., Wilkinson L.: Nominal, ordinal, interval, and ratio typologies aremisleading,The American Statistician, vol. 47(1), pp. 65–72, 1993. doi: 10.1515/9783110887617.161.
  • [50] Zhang Y., Cheung Y.M.: Learnable Weighting of Intra-Attribute Distances forCategorical Data Clustering with Nominal and Ordinal Attributes,IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 44(7), pp. 3560–3576,2021. doi: 10.1109/TPAMI.2021.3056510.
  • [51] Zhang Y., Cheung Y.M., Tan K.C.: A Unified Entropy-Based Distance Metricfor Ordinal-and-Nominal-Attribute Data Clustering,IEEE Transactions on Neu-ral Networks and Learning Systems, vol. 31(1), pp. 39–52, 2019. doi: 10.1109/TNNLS.2019.2899381.
Uwagi
PL
Opracowanie rekordu ze środków MNiSW, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.baztech-a2ad6dda-ccc5-4fc4-905a-929357d6a0a8
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.