Reranking for a Polish Medical Search Engine

Pokrywka, Jakub; Jassem, Krzysztof; Wierzchoń, Piotr; Badylak, Piotr; Kurzyp, Grzegorz

doi:10.15439/2023F1627

Artykuł - szczegóły

Tytuł artykułu

Reranking for a Polish Medical Search Engine

Autorzy

Pokrywka Jakub , Jassem Krzysztof , Wierzchoń Piotr , Badylak Piotr , Kurzyp Grzegorz

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2023F1627

Warianty tytułu

Języki publikacji

Abstrakty

Healthcare professionals are often overworked, which may impair their efficacy. Text search engines may facilitate their work. However, before making health decisions, it is important for a medical professional to consult verified sources rather than unknown web pages. In this work, we present our approach for creating a text search engine based on verified resources in the Polish language, dedicated to medical workers. This consists of collecting and comprehensively analyzing texts annotated by medical professionals and evaluating various neural reranking models. During the annotation process, we differentiate between an abstract information need and a search query. Our study shows that even within a group of trained medical specialists there is extensive disagreement on the relevance of a document to the information need. We prove that available multilingual rerankers trained in the zero-shot setup are effective for the Polish language in searches initiated by both natural language expressions and keyword search queries.

Słowa kluczowe

computer science analytical models annotation computational modeling natural languages keyword search web page

informatyka modele analityczne modelowanie obliczeniowe języki naturalne wyszukiwanie słów kluczowych strona internetowa

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2023

Tom

Vol. 35

Strony

297--302

Opis fizyczny

Bibliogr. 27 poz., il., tab.

Twórcy

autor

Pokrywka Jakub

jakub.pokrywka@amu.edu.pl

Adam Mickiewicz University Faculty of Mathematics and Computer Science

autor

Jassem Krzysztof

krzysztof.jassem@amu.edu.pl

Adam Mickiewicz University Faculty of Mathematics and Computer Science

autor

Wierzchoń Piotr

piotr.wierzchon@amu.edu.pl

Adam Mickiewicz University Faculty of Mathematics and Computer Science

autor

Badylak Piotr

piotr.badylak@pwn.pl

WN PWN

autor

Kurzyp Grzegorz

grzegorz.kurzyp@pwn.pl

WN PWN

Bibliografia

1. I. Portoghese, M. Galletta, R. C. Coppola, G. Finco, and M. Campagna, “Burnout and workload among health care workers: the moderating role of job control,” Safety and Health at Work, vol. 5, no. 3, pp. 152–157, 2014.
2. P. Lewis, M. Ott, J. Du, and V. Stoyanov, “Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art,” in Proceedings of the 3rd Clinical Natural Language Processing Workshop, (Online), pp. 146–157, Association for Computational Linguistics, Nov. 2020.
3. H.-C. Shin, Y. Zhang, E. Bakhturina, R. Puri, M. Patwary, M. Shoeybi, and R. Mani, “BioMegatron: Larger biomedical domain language model,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Online), pp. 4700–4706, Association for Computational Linguistics, Nov. 2020.
4. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, pp. 1234–1240, 09 2019.
5. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” 2020.
6. R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu, “Biogpt: generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, 2022.
7. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, N. Scharli, A. Chowdhery, P. Mansfield, B. A. y. Arcas, D. Webster, G. S. Corrado, Y. Matias, K. Chou, J. Gottweis, N. Tomasev, Y. Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam, and V. Natarajan, “Large language models encode clinical knowledge,” 2022.
8. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Trans. Comput. Healthcare, vol. 3, oct 2021.
9. Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, and S. Yu, “Biomedical question answering: A survey of approaches and challenges,” ACM Comput. Surv., vol. 55, jan 2022.
10. Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Comput. Surv., nov 2022. Just Accepted.
11. Y. Xiao and W. Y. Wang, “On hallucination and predictive uncertainty in conditional language generation,” arXiv preprint https://arxiv.org/abs/2103.15025, 2021.
12. N. Dziri, S. Milton, M. Yu, O. Zaiane, and S. Reddy, “On the origin of hallucinations in conversational models: Is it the datasets or the models?,” arXiv preprint https://arxiv.org/abs/2204.07931, 2022.
13. N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models,” 2021.
14. L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Campiotti, M. Fadaee, R. Lotufo, and R. Nogueira, “mmarco: A multilingual version of the ms marco passage ranking dataset,” 2021.
15. P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al., “Ms marco: A human generated machine reading comprehension dataset,” arXiv preprint https://arxiv.org/abs/1611.09268, 2016.
16. A. Bondarenko, E. Shirshakova, M. Driker, M. Hagen, and P. Braslavski, “Misbeliefs and biases in health-related searches,” in Proceedings of the 30th ACM International Conference on Information Knowledge Management, CIKM ’21, (New York, NY, USA), p. 2894–2899, Association for Computing Machinery, 2021.
17. D. Cohen, K. Du, B. Mitra, L. Mercurio, N. Rekabsaz, and C. Eickhoff, “Inconsistent ranking assumptions in medical search and their downstream consequences,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, (New York, NY, USA), p. 2572–2577, Association for Computing Machinery, 2022.
18. N. Rekabsaz, O. Lesota, M. Schedl, J. Brassey, and C. Eickhoff, “Tripclick: The log files of a large health web search engine,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, (New York, NY, USA), p. 2507–2513, Association for Computing Machinery, 2021.
19. J. Jimmy, G. Zuccon, J. Palotti, L. Goeuriot, and L. Kelly, “Overview of the clef 2018 consumer health search task,” International Conference of the Cross-Language Evaluation Forum for European Languages, vol. 2125, 2018.
20. K. Roberts, D. Demner-Fushman, E. M. Voorhees, W. R. Hersh, S. Bedrick, A. J. Lazar, S. Pant, and F. Meric-Bernstam, “Overview of the trec 2019 precision medicine track,” in Proceedings of the Text Retrieval Conference (TREC), vol. 1250, NIH Public Access, 2019.
21. M. Miłkowski and P. IFiS, “Morfologik,” Web document: http://morfologik. blogspot. com, 2007.
22. Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu, “A theoretical analysis of ndcg type ranking measures,” in Conference on learning theory, pp. 25–54, PMLR, 2013.
23. R. Mroczkowski, P. Rybak, A. Wróblewska, and I. Gawlik, “HerBERT: Efficiently pretrained transformer-based language model for Polish,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, (Kiyv, Ukraine), pp. 1–10, Association for Computational Linguistics, Apr. 2021.
24. P. Rybak, R. Mroczkowski, J. Tracz, and I. Gawlik, “Klej: Comprehensive benchmark for polish language understanding,” arXiv preprint https://arxiv.org/abs/2005.00630, 2020.
25. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Online), pp. 483–498, Association for Computational Linguistics, June 2021.
26. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 11 2019.
27. O. Khattab and M. Zaharia, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48, 2020.

Uwagi

1. Main Track Short Papers

2. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-84672127-d0c6-4aff-bedf-7b051188f7fa