Exploiting bert for malformed segmentation detection to improve scientific writings

Halawa, Abdelrahman; Gamalel-Din, Shehab; Nasr, Abdurrahman

doi:10.35784/acs-2023-20

Artykuł - szczegóły

Tytuł artykułu

Exploiting bert for malformed segmentation detection to improve scientific writings

Autorzy

Halawa Abdelrahman , Gamalel-Din Shehab , Nasr Abdurrahman

Treść / Zawartość

Pełne teksty:

3634-Article Text-18126-1-10-20230702.pdf

Pobierz

Identyfikatory

DOI

10.35784/acs-2023-20

Warianty tytułu

Języki publikacji

Abstrakty

Writing a well-structured scientific documents, such as articles and theses, is vital for comprehending the document's argumentation and understanding its messages. Furthermore, it has an impact on the efficiency and time required for studying the document. Proper document segmentation also yields better results when employing automated Natural Language Processing (NLP) manipulation algorithms, including summarization and other information retrieval and analysis functions. Unfortunately, inexperienced writers, such as young researchers and graduate students, often struggle to produce well-structured professional documents. Their writing frequently exhibits improper segmentations or lacks semantically coherent segments, a phenomenon referred to as "mal-segmentation." Examples of mal-segmentation include improper paragraph or section divisions and unsmooth transitions between sentences and paragraphs. This research addresses the issue of mal-segmentation in scientific writing by introducing an automated method for detecting mal-segmentations, and utilizing Sentence Bidirectional Encoder Representations from Transformers (sBERT) as an encoding mechanism. The experimental results section shows a promising results for the detection of mal-segmentation using the sBERT technique.

Słowa kluczowe

NLP text segmentation mal-segmentation BERT

Wydawca

Polskie Towarzystwo Promocji Wiedzy
Lublin University of Technology

Czasopismo

Applied Computer Science

Rocznik

2023

Tom

Vol. 19, no 2

Strony

126--141

Opis fizyczny

Bibliogr. 27 poz., fig., tab.

Twórcy

autor

Halawa Abdelrahman

ahalawa@azhar.edu.eg

Al-Azhar University, Faculty of Engineering, Systems and Computer, Egypt

https://orcid.org/0009-0004-7107-1049

autor

Gamalel-Din Shehab

drshehabg@yahoo.com

Al-Azhar University, Faculty of Engineering, Systems and Computer, Egypt

autor

Nasr Abdurrahman

anasr@azhar.edu.eg

Al-Azhar University, Faculty of Engineering, Systems and Computer, Egypt

Bibliografia

[1] Almuhareb, A. a.-T. (2019). Arabic word segmentation with long short-term memory neural networks and word embedding. IEEE Access, 7, 12879-12887. https://doi.org/10.1109/ACCESS.2019.2893460
[2] Barrow, J., Jain, R., Morariu, V., & Manjunatha, V. (2020). A joint model for document segmentation and segment labeling. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 313-322). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.29
[3] Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv. https://doi.org/10.48550/arXiv.1708.00055
[4] Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo- Cespedes, M., Yuan, S., Tar, Ch., Sung, Y.-H. Strope, B., & Kurzweil, R. (2018). Universal sentence encoder. arXiv. https://doi.org/10.48550/arXiv.1803.11175
[5] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
[6] Galanopoulos, D., & Mezaris, V.(2019). Temporal lecture video fragmentation using word embeddings. In Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., & Vrochidis, S. (Eds.) MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8--11, 2019, Proceedings, Part II (vol. 25, pp. 254--265). Springer. https://doi.org/10.1007/978-3-030-05716-9_21
[7] Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1), 33-64.
[8] Hinkel, E. (2001). Matters of cohesion in L2 academic texts. Applied language learning, 12(2), 111-132.
[9] ielts-mentor. (2022). Retrieved from https://www.ielts-mentor.com/reading-sample/gt-reading/3162-employment-in-japan ?
[10] Levy, C. M., & Ransdell. S. (1996). The science of writing: Theories, methods, individual differences and applications. Routledge. https://doi.org/10.4324/9780203811122
[11] Lin, M., Nunamaker, J.F., Chau, M., & Chen, H. (2004). Segmentation of lecture videos based on text: a method combining multiple linguistic features. 37th Annual Hawaii International Conference on System Sciences. (pp. 9-9). IEEE. https://doi.org/10.1109/HICSS.2004.1265045
[12] Lin, M., Chau, M., Cao, J., & Nunamaker, J. F. (2005). Automated video segmentation for lecture videos: A linguistics-based approach. International Journal of Technology and Human Interaction (IJTHI), 1(2), 27-45. https://doi.org/10.4018/jthi.2005040102
[13] Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., & Buntine, W. (2021). Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence. arXiv. https://doi.org/10.48550/arXiv.2110.07160
[14] Luckert, M., & Schaefer- Kehnert, M. (2016). Using machine learning methods for evaluating the quality of technical documents.
[15] Maraj, A., Martin, M. V., & Makrehchi, M. (2021). A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In Llads, J., Lopresti, D., & Uchida, S (Eds.), Document Analysis and Recognition--ICDAR 2021, (pp. 236-250). Springer. https://doi.org/10.1007/978-3-030-86337-1_16
[16] Ponceleon, D., & Srinivasan, S. (2001). Automatic discovery of salient segments in imperfect speech transcripts. Proceedings of the tenth international conference on Information and knowledge management, 490-497. The ACM Digital Library. https://doi.org/10.1145/502585.502668
[17] Precision_and_recall. (2022). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall?oldformat=true
[18] Reimers, N., & Gurevyvh, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084
[19] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. IEEE conference on computer vision and pattern recognition (CVPR) (pp.815-823). IEEE. https://doi.org/10.1109/CVPR.2015.7298682
[20] Shah, R. R., Yu, Y., Skaikh, A. D., & Zimmermann, R. (2015). TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts. 2015 IEEE International Symposium on Multimedia (ISM) (pp. 217-220). IEEE. https://doi.org/10.1109/ISM.2015.18
[21] Soares, E. R., & Barrére, E. (2019). An optimization model for temporal video lecture segmentation using word2vec and acoustic features. Proceedings of the 25th Brazillian Symposium on Multimedia and the Web, 513-520. The ACM Digital Library. https://doi.org/10.1145/3323503.3349548
[22] Solbiati, A., Heffernan, K., Damaskinos, G., Poddar, S., Modi, S., & Cali, J. (2021). Unsupervised topic segmentation of meetings with BERT embeddings. arXiv. https://doi.org/10.48550/arXiv.2106.12978
[23] Glavas, G., & Somasundaran, S. (2020). Two-level transformer and auxiliary coherence modeling for improved text segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 7797-7804. https://doi.org/10.1609/aaai.v34i05.6284
[24] Text_segmentation. (2011). Retrieved from wikipedia: https://en.wikipedia.org/wiki/Text_segmentation
[25] Ugur Akinci, G. K. (2012). Writing Transition Phrases and Sentences: 12 Types of Sentence and Paragraph Transitions with 112 Examples.
[26] University, UAH. (n.d.). WRITING EFFECTIVE TRANSITIONS. Retrieved from https://www.uah.edu/images/administrative/student-success-center/resources/handouts/handouts_2019/writing_effective_transitions.pdf
[27] Wang, Y., Li, S., & Yang, J. (2018). Toward fast and accurate neural discourse segmentation. arXiv. https://doi.org/10.48550/arXiv.1808.09147

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b1428454-1e68-4a22-86c0-d208fb66a632