Powiadomienia systemowe
- Sesja wygasła!
- Sesja wygasła!
- Sesja wygasła!
- Sesja wygasła!
Tytuł artykułu
Treść / Zawartość
Pełne teksty:
Identyfikatory
DOI
Warianty tytułu
Języki publikacji
Abstrakty
The main aim of this paper is to evaluate crawlers collecting the job offers from websites. In particular the research is focused on checking the effectiveness of ensemble machine learning methods for the validity of extracted position from the job ads. Moreover, in order to significantly reduce the training time of the algorithms (Random Forests and XGBoost), granularity methods were also tested to significantly reduce the input training dataset. Both methods achieved satisfactory results in accuracy and F1 measures, which exceeded 96%. In addition, granulation reduced the input dataset by more than 99%, and the results obtained were only slightly worse (accuracy between 1% and 5%, F1 between 3% and 8%). Thus, it can be concluded that the considered methods can be used in the evaluation of job web crawlers.
Słowa kluczowe
Rocznik
Tom
Strony
125--140
Opis fizyczny
Bibliogr. 22 poz., rys., tab., wykr.
Twórcy
autor
- Katedra Metod Matematycznych Informatyki, Wydział Matematyki i Informatyki, ul. Słoneczna 54,10-710 Olsztyn
- Emplocity SA, Warszawa
autor
- Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn
autor
- Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn
- Emplocity SA, Warszawa
autor
- Emplocity SA, Warszawa
autor
- Emplocity SA, Warszawa
Bibliografia
- ARTIEMJEW P., ROPIAK K. 2021. A Novel Ensemble Model – The Random Granular Reflections. Fundam. Informaticae, 179(2): 183-203.
- CHANG Y.J, TSAI K.L., JIANG W.C., LIU M.K. 2023. Content-aware malicious webpage detection using convolutional neural network. In Multimedia Tools and Applications, p. 1-19. https://doi.org/10.1007/s11042-023-15559-8
- CHEN T., GUESTRIN C.E. 2016. XGBoost: A Scalable Tree Boosting System. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 785-794. https://doi.org/10.1145/2939672.2939785
- DROZDA P., TALUN A., BUKOWSKI L. 2019. Emplobot – design of the system. In Proceedings of the 28th International Workshop on Concurrency, Specification and Programming.
- FINN A., KUSHMERICK N., SMYTH B. 2001. Fact or fiction: Content classification for digital libraries. In Proc. Joint DELOS-NSF Workshop, Personalization Recommender Syst. Digit. Libraries.
- HASHEMI M. 2020. Web page classification: a survey of perspectives, gaps, and future directions. Multimed Tools Appl, 79: 11921-11945. https://doi.org/10.1007/s11042-019-08373-8
- HO T.K. 1995. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition, 1: 278-282. https://doi.org/10.1109/ICDAR.1995.598994
- KAO A., POTEET S. 2006. Natural Language Processing and Text Mining. Springer, Berlin.
- KIM Y.S., LEE C.K. 2016. An Empirical Evaluation of Job Classification Using Online Job Advertisements. In AI 2016: Advances in Artificial Intelligence. LNCS, 9992. https://doi.org/10.1007/978-3-319-50127-7_65
- LEŚNIEWSKI S. 1916. Podstawy ogólnej teoryi mnogości. I. Prace Polskiego Koła Naukowego w Moskwie, Sekcya Matematyczno-Przyrodnicza, No. 2, Zakład Wyd. Popławski. Eng. tr. in S. Leśniewski. 1992. Collected Works. Kluwer, Dodrecht, p. 129-173.
- LOTFI C., SRINIVASAN S., ERTZ M., LATROUS I. 2021. Web Scraping Techniques and Applications: A Literature Review. In R. Pal, P.K. Shukla (eds), SCRS Conference Proceedings on Intelligent Systems. SCRS, India, p. 381-394. https://doi.org/10.52458/978-93-91842-08-6-38
- NOWICKI R.K, STARCZEWSKI J.T. 2017. A new method for classification of imprecise data using fuzzy rough fuzzification. Information Sciences, 414. https://doi.org/10.1016/j.ins.2017.05.049.
- PARVEZ M.S., TASNEEM K.S.A., RAJENDRA S.S., BODKE K.R. 2018. Analysis of Different Web Data Extraction Techniques. International Conference on Smart City and Emerging Technology (ICSCET), p. 1-7. https://doi.org/10.1109/ICSCET.2018.8537333
- PAWLAK Z. 1982. Rough sets. International Journal of Computer & Information Sciences, 11: 341–356.
- POLKOWSKI L. 2007. Granulation of knowledge in decision systems: The approach based on rough inclusions. the method and its applications. LNAI, 4585, proceedings for RSEISP 2007: Rough Sets and Intelligent Systems Paradigms, p. 69-79.
- QI J. 2012. Random Forest for Bioinformatics. In: Ensemble Machine Learning. Springer, New York. https://doi.org/10.1007/978-1-4419-9326-7_1
- RABBI J. 2021. How long does it take to land a new job and how to reduce this time. Retrieved from https://www.linkedin.com/pulse/how-long-does-take-land-new-job-reduce-time-juliana (2.03.2021).
- ROPIAK K., ARTIEMJEW P. 2018. A Study in Granular Computing: Homogenous Granulation. 24th International Conference, ICIST 2018, Vilnius, Lithuania, October 4-6, pp. 336-346. Proceedings. https://doi.org/10.1007/978-3-319-99972-2_27
- SHETE D., BOJEWAR S., SANGHVI A. 2021. Survey Paper on Web Content Extraction & Classification. 6th International Conference for Convergence in Technology (I2CT), pp. 1-6. https://doi.org/10.1109/I2CT51068.2021.9417947
- TALUN A., DROZDA P., BUKOWSKI L., SCHERER R. 2020. FastText and XGBoost ContentBased Classification for Employment Web Scraping. In: Artificial Intelligence and Soft Computing, ICAISC 2020. https://doi.org/10.1007/978-3-030-61534-5_39
- TREVISO M., LEE J.-U., JI T., VAN AKEN B., CAO Q., CIOSICI M.R., HASSID M., HEAFIELD K., HOOKER S., RAFFEL C., MARTINS P.H., MARTINS A.F.T., FORDE J.Z., MILDER P., SIMPSON E., SLONIM N., DODGE J., STRUBELL E., BALASUBRAMANIAN N., DERCZYNSKI L., GUREVYCH I., SCHWARTZ R. 2023. Efficient Methods for Natural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 11: 826-860. https://doi.org/10.1162/tacl_a_00577
- ZOU X.-Q., ZHANG P., HUANG C.-Y., BAO X.-G. 2019. Malicious Websites Identification Based on Active-Passive Method. CNCERT 2018. Communications in Computer and Information Science, 970. https://doi.org/10.1007/978-981-13-6621-5_9
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-1d6e8529-d479-46a4-b438-ca2b12394d3f