Enhanced Cluster Merging and Deep Learning Techniques for Entity Name Identification from Biomedical Corpus

Das, Nilanjana; Dutta, Rakesh; Mondal, Uttam Kumar; Majumder, Mukta; Mandal, Jyotsna Kumar

doi:10.7494/csci.2025.26.1.5600

Artykuł - szczegóły

Tytuł artykułu

Enhanced Cluster Merging and Deep Learning Techniques for Entity Name Identification from Biomedical Corpus

Autorzy

Das Nilanjana , Dutta Rakesh , Mondal Uttam Kumar , Majumder Mukta , Mandal Jyotsna Kumar

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2025.26.1.5600

Warianty tytułu

Języki publikacji

Abstrakty

For mining biomedical information identifying names is the prime task. Complex and uncertain naming styles of biomedical entities are the major setbackshere. Thus, state-of-the-art accuracy of biomedical name identification is reasonably inferior compared to general domain. This study includes Machine Learning and Deep Learning techniques to recognize names from biomedical corpus. In supervised classification, a classifier is built by finding required statistics from training corpus. Accordingly, performance of the system is primarily dependent on quantity and quality of training corpus. But manually preparing a large training dataset with enriched feature samples is laboriousand time-taking. Therefore, various techniques were adopted in the literature tomake effective use of raw corpora. We have incorporated a novel Cluster Merging technique and Attention Mechanism with BERT embedding for boosting Machine Learning and Deep Learning classifiers respectively. The suggested results outpour that profound techniques are competent and delineate signifying improvement over surviving methods.

Słowa kluczowe

bidirectional GRU BERT cluster merging conditional random field

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2025

Tom

T. 26 (1)

Strony

49--75

Opis fizyczny

Bibliogr. 69 poz., rys., tab., wykr.

Twórcy

autor

Das Nilanjana

Department of Computer Science, Vidyasagar University, Midnapore, West Bengal, India

autor

Dutta Rakesh

Department of Computer Science and Application, Hijli College, Kharagpur, India

autor

Mondal Uttam Kumar

Department of Computer Science, Vidyasagar University, Midnapore, West Bengal, India

autor

Majumder Mukta

Department of Computer Science & Technology, University of North Bengal, Siliguri, India

autor

Mandal Jyotsna Kumar

Department of Computer Science & Engineering, University of Kalyani, Kalyani, West Bengal, India

Bibliografia

[1] Balabantaray R.C., Sarma C., Jha M.: Document Clustering using K-Meansand K-Medoids,arXiv preprint arXiv:150207938, 2015. doi: 10.48550/arXiv.1502.07938.
[2] Biemann C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of TextGraphs: the first workshop on graph based methods for natural language processing, pp. 73–80, 2006. doi: 10.3115/1654758.1654774.
[3] Brown P.F., Della Pietra V.J., de Souza P.V., Lai J.C., Mercer R.L.: Class-basedn-gram models of natural language, Computational Linguistics, vol. 18(4),pp. 467–480, 1992.
[4] Cherry C., Guo H.: The unreasonable effectiveness of word representations for twitter named entity recognition. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 735–745, 2015. doi: 10.3115/v1/n15-1075.
[5] Chieu H.L., Ng H.T.: Named entity recognition: a maximum entropy approach using global information. In: COLING 2002: Proceedings of the 19thInternational Conference on Computational Linguistics, 2002. doi: 10.3115/1072228.1072253.
[6] Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., Kuksa P.: Natural language processing (almost) from scratch, Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. http://jmlr.org/papers/v12/collober-t11a.html.
[7] Deng J., Cheng L., Wang Z.: Self-attention-based BiGRU and capsule network for named entity recognition, arXiv preprint arXiv:200200735, 2020.
[8] Dutta R., Majumder M.: Attention-based bidirectional LSTM with embedding technique for classification of COVID-19 articles, Intelligent Decision Technologies, vol. 16(1), pp. 205–215, 2022. doi: 10.3233/idt-210058.
[9] Ekbal A., Saha S., Sikdar U.K.: Biomedical named entity extraction: some issues of corpus compatibilities, Springer Plus, vol. 2, 601, 2013. doi: 10.1186/2193-1801-2-601.
[10] Estivill-Castro V., Yang J.: Fast and robust general purpose clustering algorithms. In: R. Mizoguchi, J. Slaney (eds.), PRICAI 2000. Topics in artificial intelligence. 6th Pacific Rim International Conference on Artificial Intelligence, Melbourne, Australia, August/September 2000. Proceedings, Lecture Notes in Artificial Intelligence. Subseries of Lecture Notes in Computer Science, vol. 1886,pp. 208–218, 2000. doi: 10.1007/3-540-44533-1_24.
[11] Finkel J.R., Dingare S., Nguyen H., Nissim M., Manning C.D., Sinclair G.: Exploiting context for biomedical entity recognition: From syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp. 88–91, 2004.doi: 10.3115/1567594.1567614.
[12] Fraley C., Raftery A.E.: How many clusters? Which clustering method? Answersvia model-based cluster analysis,The Computer Journal, vol. 41(8), pp. 578–588,1998. doi: 10.1093/comjnl/41.8.578.
[13] Fukuda K., Tsunoda T., Tamura A., Takagi T.: Toward information extraction: identifying protein names from biological papers. In: Pacific Symposium on Bio-computing, pp. 707–718, 1998. https://psb.stanford.edu/psb-online/proceedings/psb98/abstracts/p707.html.
[14] Grishman R.: The NYU system for MUC-6 or where’s the syntax? In: Proceedings of the 6th Conference on Message Understanding, pp. 167–175, 1995.doi: 10.3115/1072399.1072415.
[15] Hammerton J.: Named entity recognition with long short-term memory. In:CONLL ’03: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003, vol. 4, pp. 172–175, 2003. doi: 10.3115/1119176.1119202.
[16] Han J., Pei J., Tong H.: Data mining: concepts and techniques, Morgan Kaufmann, 4th ed., 2022.
[17] He K., Zhang X., Ren S., Sun J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pp. 770–778, 2016. doi: 10.1109/cvpr.2016.90.
[18] Hinton G., Deng L., Yu D., Dahl G.E., Mohamed A.r., Jaitly N., Senior A.,et al.: Deep neural networks for acoustic modeling in speech recognition: The sharedviews of four research groups, IEEE Signal Processing Magazine, vol. 29(6),pp. 82–97, 2012. doi: 10.1109/msp.2012.2205597.
[19] Huang Z., Xu W., Yu K.: Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint arXiv:150801991, 2015. doi: 10.48550/arXiv.1508.01991.
[20] Jain A., Yadav D., Arora A., Tayal D.K.: Named-entity recognition for Hindilanguage using context pattern-based maximum entropy, Computer Science,vol. 23(1), 2022. doi: 10.7494/csci.2022.23.1.3977.
[21] Katiyar A., Cardie C.: Nested named entity recognition revisited. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 861–871,2018. doi: 10.18653/v1/n18-1079.
[22] Kazama J., Makino T., Ohta Y., Tsujii J.: Tuning support vector machines forbiomedical named entity recognition. In: BioMed ’02: Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, vol. 3,pp. 1–8, 2002. doi: 10.3115/1118149.1118150.
[23] Kim J.-D., Ohta T., Tateisi Y., Tsujii J.: GENIA corpus – a semantically anno-tated corpus for bio-textmining, Bioinformatics, vol. 19(suppl_1), pp. i180–i182,2003. doi: 10.1093/bioinformatics/btg1023.
[24] Kim J.-D., Ohta T., Tsuruoka Y., Tateisi Y., Collier N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International JointWorkshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp. 70–75, 2004. doi: 10.3115/1567594.1567610.
[25] Kingma D.P., Ba J.: Adam: A method for stochastic optimization,arXiv preprintarXiv:14126980, 2014. doi: 10.48550/arXiv.1412.6980.
[26] Kong J., Zhang L., Jiang M., Liu T.: Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition, Journal of Biomedical Informatics, vol. 116, p. 103737, 2021. doi: 10.1016/j.jbi.2021.103737.
[27] Lafferty J., McCallum A., Pereira F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289, Morgan Kaufmann, 2001. https://www.cs.columbia.edu/~jebara/6772/papers/crf.pdf.
[28] Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C.: Neural architectures for named entity recognition. In: K. Knight, A. Nenkova, O. Rambow(eds.),Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270, 2016. doi: 10.18653/v1/n16-1030.
[29] Leaman R., Gonzalez G.: BANNER: an executable survey of advances in biomedical named entity recognition. In: R.B. Altman, A.K. Dunker, L. Hunter, T. Mur-ray, T.E. Klein (eds.), Pacific Symposium on Biocomputing 2008. Kohala Coast, Hawaii, USA, 4–8 January 2008, pp. 652–663, World Scientific, 2008.
[30] LeCun Y., Bengio Y., Hinton G.: Deep learning, Nature, vol. 521(7553), pp. 436–444, 2015. doi: 10.1038/nature14539.
[31] Li Y., Qi H., Dai J., Ji X., Wei Y.: Fully convolutional instance-aware semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4438–4446, 2017. doi: 10.1109/cvpr.2017.472.
[32] Liang P.:Semi-supervised learning for natural language, Ph.D. thesis, Massachusetts Institute of Technology, 2005.
[33] Lin H., Lu Y., Han X., Sun L.: Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. In: A. Korhonen, D. Traum, L. Màrquez(eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5182–5192, 2019. doi: 10.18653/v1/p19-1511.
[34] Lin Y.F., Tsai T.H., Chou W.C., Wu K.P., Sung T.Y., Hsu W.L.: A maximumentropy approach to biomedical named entity recognition. In: BIOKDD’04: Proceedings of the 4th International Conference on Data Mining in Bioinformatics, pp. 56–61, Citeseer, 2004.
[35] Ma X., Hovy E.: End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. In: K. Erk, N.A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),pp. 1064–1074, 2016. doi: 10.18653/v1/p16-1101.
[36] Madhulatha T.S.: Comparison between K-Means and K-Medoids Clustering Algorithms. In: D.C. Wyld, M. Wozniak, N. Chaki, N. Meghanathan, D. Naga-malai (eds.), Advances in Computing and Information Technology: First International Conference, ACITY 2011, Chennai, India, July 15–17, 2011. Proceedings, pp. 472–481, Springer, 2011. doi: 10.1007/978-3-642-22555-0_48.
[37] Maity S., Das N., Majumder M., Dasadhikary D.R.: Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews, EAI Endorsed Transactions on Scalable Information Systems, vol. 8(33), pp. 1–11, 2021. doi: 10.4108/eai.14-5-2021.169918.
[38] Matsuo Y., Sakaki T., Uchiyama K., Ishizuka M.: Graph-based word clusteringusing a web search engine. In: EMNLP ’06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 542–550, 2006.doi: 10.3115/1610075.1610150.
[39] Miller S., Guinness J., Zamanian A.: Name tagging with word clusters and dis-criminative training. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 337–342, 2004.
[40] Patra R., Saha S.K.: A kernel-based approach for biomedical named entity recognition, The Scientific World Journal, vol. 2013, 2013. doi: 10.1155/2013/950796.
[41] Patra R., Saha S.K.: A novel word clustering and cluster merging technique fornamed entity recognition, Journal of Intelligent Systems, vol. 28(1), pp. 15–30,2019. doi: 10.1515/jisys-2016-0074.
[42] Pereira F., Tishby N., Lee L.: Distributional clustering of English words. In:31stAnnual Meeting of the Association for Computational Linguistics, pp. 183–190,Columbus, Ohio, USA, 1994. doi: 10.3115/981574.981598.
[43] Ponomareva N., Pla F., Molina A., Rosso P.: Biomedical named entity recogni-tion: A poor knowledge HMM-based approach. In: Z. Kedad, N. Lammari, E. Métais, F. Meziane, Y. Rezgui (eds.), Natural Language Processing and Information Systems: 12th International Conference on Applications of Natural Language toInformation Systems, NLDB 2007, Paris, France, June 27–29, 2007. Proceedings, Lecture Notes in Computer Science, vol. 4582, pp. 382–387, Springer, 2007.doi: 10.1007/978-3-540-73351-5_34.
[44] Pyysalo S., Ginter F., Heimonen J., Björne J., Boberg J., Järvinen J.,Salakoski T.: BioInfer: a corpus for information extraction in the biomedical domain, BMC Bioinformatics, vol. 8, pp. 1–24, 2007. doi: 10.1186/1471-2105-8-50.
[45] Ratinov L., Roth D.: Design challenges and misconceptions in named en-tity recognition. In:CoNLL ’09: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147–155, 2009. doi: 10.3115/1596374.1596399.
[46] Ritter A., Clark S., Mausam, Etzioni O.: Named entity recognition in tweets: an experimental study. In: EMNLP ’11: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534, 2011. https://aclanthology.org/D11-1141.pdf.
[47] Rössler M.: Adapting an NER-system for German to the biomedical domain.In:JNLPBA ’04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 92–95, 2004.doi: 10.3115/1567594.1567615.
[48] Saha S., Ekbal A., Sikdar U.K.: Named entity recognition and classification in biomedical text using classifier ensemble, International Journal of DataMining and Bioinformatics, vol. 11(4), pp. 365–391, 2015. doi: 10.1504/ijdmb.2015.067954.
[49] Saha S.K., Mitra P., Sarkar S.: A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition, Knowledge-Based Systems, vol. 27, pp. 322–332, 2012. doi: 10.1016/j.knosys.2011.09.015.
[50] Saha S.K., Sarkar S., Mitra P.: Feature selection techniques for maximum entropybased biomedical named entity recognition, Journal of Biomedical Informatics, vol. 42(5), pp. 905–911, 2009. doi: 10.1016/j.jbi.2008.12.012.
[51] Salton G., Wong A., Yang C.S.: A vector space model for automatic indexing, Communications of the ACM, vol. 18(11), pp. 613–620, 1975. doi: 10.1145/361219.361220.
[52] Settles B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: JNLPBA ’04: Proceedings of the International JointWorkshop on Natural Language Processing in Biomedicine and its Applications,pp. 107–110, 2004. doi: 10.3115/1567594.1567618.
[53] Shen D., Zhang J., Zhou G., Su J., Tan C.L.: Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. In: BioMed ’03: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, vol. 13, pp. 49–56, 2003. doi: 10.3115/1118958.1118965.
[54] Soni K.G., Patel A.: Comparative Analysis of K-means and K-medoids Algo-rithm on IRIS Data, International Journal of Computational Intelligence Research, vol. 13(5), pp. 899–906, 2017. https://www.ripublication.com/ijcir17/ijcirv13n5_21.pdf.
[55] Tang B., Cao H., Wang X., Chen Q., Xu H.: Evaluating word representation features in biomedical named entity recognition tasks, BioMed Research International, vol. 2014, 2014. doi: 10.1155/2014/240403.
[56] Toh Z., Chen B., Su J.: Improving twitter named entity recognition using word representations. In: Proceedings of the Workshop on Noisy User-generated Text,pp. 141–145, 2015. doi: 10.18653/v1/w15-4321.
[57] Tsai T.h., Chou W.C., Wu S.H., Sung T.Y., Hsiang J., Hsu W.L.: Integratin glinguistic knowledge into a conditional random field framework to identify biomedical named entities, Expert Systems with Applications, vol. 30(1), pp. 117–128,2006. doi: 10.1016/j.eswa.2005.09.072.
[58] Turian J., Ratinov L.A., Bengio Y.: Word representations: A simple and generalmethod for semi-supervised learning. In: Proceedings of the 48th annual meetingof the association for computational linguistics, pp. 384–394, 2010.
[59] Ushioda A.: Hierarchical clustering of words and application to NLP tasks. In: E. Ejerhed, I. Dagan (eds.), Fourth Workshop on Very Large Corpora, Universityof Copenhagen, Copenhagen, 1996. https://aclanthology.org/W96-0103.pdf.
[60] Uszkoreit J., Brants T.: Distributed word clustering for large scale class-basedlanguage modeling in machine translation. In: ACL 2008. Proceedings of the 46thAnnual Meeting of the Association for Computational Linguistics, June 15–20,2008, Columbus, Ohio, USA, pp. 755–762, 2008.
[61] Vapnik V.N.:The nature of statistical learning theory, Springer, New York, NY,1999. doi: 10.1007/978-1-4757-3264-1.
[62] Wan J., Ru D., Zhang W., Yu Y.: Nested named entity recognition with span-level graphs. In: Proceedings of the 60th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pp. 892–903, 2022.doi: 10.18653/v1/2022.acl-long.63.
[63] Xia C., Zhang C., Yang T., Li Y., Du N., Wu X., Fan W.,et al.: Multi-grainednamed entity recognition. In: A. Korhonen, D. Traum, L. Màrquez (eds.),Pro-ceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics, pp. 1430–1440, 2019. doi: 10.18653/v1/p19-1138.
[64] Yan H., Gui T., Dai J., Guo Q., Zhang Z., Qiu X.: A unified generative frame-work for various NER subtasks. In: C. Zong, F. Xia, W. Li, R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5808–5822, 2021. doi: 10.18653/v1/2021.acl-long.451.
[65] Yeh A., Morgan A., Colosimo M., Hirschman L.: BioCreAtIvE task 1A: genemention finding evaluation, BMC Bioinformatics, vol. 6, S2 (2005), 2005.doi: 10.1186/1471-2105-6-s1-s2.
[66] Yu J., Bohnet B., Poesio M.: Named entity recognition as dependency parsing.In: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (eds.),Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics, pp. 6470–6476,2020. doi: 10.18653/v1/2020.acl-main.577.
[67] Zhou G., Su J.: Named entity recognition using an HMM-based chunk tagger. In: P. Isabelle, E. Charniak, D. Lin (eds.),ACL ’02: Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics, pp. 473–480, 2002.doi: 10.3115/1073083.1073163.
[68] Zhou G., Su J.: Exploring deep knowledge resources in biomedical name recognition. In: JNLPBA ’04: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP),pp. 96–99, 2004. doi: 10.3115/1567594.1567616.
[69] Zhou H., Ning S., Liu Z., Lang C., Liu Z., Lei B.: Knowledge-enhanced biomedicalnamed entity recognition and normalization: application to proteins and genes, BMC Bioinformatics, vol. 21(1), 35, 2020. doi: 10.1186/s12859-020-3375-3

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-5a30e2e8-a427-4d08-a32d-d8926160cb83