Enhancing naive classifier for positive unlabeled data based on logistic regression approach

Płatek, Mateusz; Mielniczuk, Jan

doi:10.15439/2023F1402

Artykuł - szczegóły

Tytuł artykułu

Enhancing naive classifier for positive unlabeled data based on logistic regression approach

Autorzy

Płatek Mateusz , Mielniczuk Jan

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2023F1402

Warianty tytułu

Języki publikacji

Abstrakty

It is argued that for analysis of Positive Unlabeled (PU) data under Selected Completely At Random (SCAR) assumption it is fruitful to view the problem as fitting of misspecified model to the data. Namely, it is shown that the results on misspecified fit imply that in the case when posterior probability of the response is modelled by logistic regression, fitting the logistic regression to the observable PU data which does not follow this model, still yields the vector of estimated parameters approximately colinear with the true vector of parameters. This observation together with choosing the intercept of the classifier based on optimisation of analogue of F1 measure yields a classifier which performs on par or better than its competitors on several real data sets considered.

Słowa kluczowe

computer science logistic regression fitting estimation size measurement probabilistic logic data models

informatyka regresja logistyczna dopasowanie estymacja pomiar wielkości logika probabilistyczna modele danych

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2023

Tom

Vol. 35

Strony

225--233

Opis fizyczny

Bibliogr. 14 poz., wz., tab.

Twórcy

autor

Płatek Mateusz

mateusz.platek.poczta@gmail.com

Warsaw University of Technology Faculty of Mathematics and Information Science Koszykowa 75, 00-662 Warsaw, Poland

autor

Mielniczuk Jan

jan.mielniczuk@ipipan.waw.pl

Institute of Computer Science Polish Academy of Sciences Jana Kazimierza 5, 01-248 Warsaw, Poland
Warsaw University of Technology Faculty of Mathematics and Information Science Koszykowa 75, 00-662 Warsaw, Poland

https://orcid.org/0000%E2%88%920003%E2%88%922621%E2%88%922303

Bibliografia

1. J. Bekker and J. Davis. Learning from positive and unlabeled data: a survey. Machine Learning, 109(4):719–760, April 2020. http://dx.doi.org/10.1007/S10994-020-05877-5.
2. Jessa Bekker and Jesse Davis. Estimating the class prior in positive and unlabeled data through decision tree induction. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1):2712–2719, April 2018. https://doi.org/10.1609/aaai.v32i1.11715.
3. T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, NY, 1991. http://dx.doi.org/10.1002/047174882X.
4. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 213–220, August 2008. http://dx.doi.org/10.1145/1401890.1401920.
5. E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of American Statistical Association, 78:573–586, 1981. https://doi.org/10.2307/2288117.
6. M. Łazecka, J. Mielniczuk, and P. Teisseyre. Estimating the class prior for positive and unlabelled data via logistic regression. Advances in Data Analysis and Classification, 15(4):1039–1068, June 2021. http://dx.doi.org/10.1007/S11634-021-00444-9.
7. W. Lee and B. Liu. Learning with positive and unlabeled exampled using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning, ICML ’03, pages 448–455, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc.
8. K-C. Li and N. Duan. Regression analysis under link violation. The Annals of Statistics, 17(3):1009–1052, 1989. http://dx.doi.org/10.1214/aos/1176347254.
9. P. Ruud. Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models. Econometrica, 51:225–228, 1983. http://dx.doi.org/10.2307/1912257.
10. S. Tabatabaei, J. Klein, and M Hoogendoorn. Estimating the F1 score for learning from positive and unlabeled examples. In LOD 2020. Springer, Cham, 2020. https://doi.org/10.1007/978-3-030-64583-0_15.
11. P. Teisseyre, J. Mielniczuk, and M Łazecka. Different strategies of fitting logistic regression for positive and unlabeled data. In Proceedings of the International Conference on Computational Science ICCS’20, pages 3–17, Cham, 2020. Springer International Publishing. https://doi.org/10.1007/978-3-030-50423-6_1.
12. Q. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57:307–333, 1989. https://doi.org/10.2307/1912557.
13. A. Wawrzenczyk and J. Mielniczuk. Strategies for fitting logistic regression for positive and unlabeled data revisited. Int.J. Appl. Math. Comp. Sci., pages 299–309, 2022. https://doi.org/10.34768/amcs-2022-0022.
14. H. White. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1–25, 1982. https://doi.org/10.2307/1912526.

Uwagi

1. Main Track Regular Papers

2. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-e1364cc5-43e1-48dd-a218-26bca656a441