Wyniki wyszukiwania - BazTech

1

Exploiting bert for malformed segmentation detection to improve scientific writings

Halawa Abdelrahman, Gamalel-Din Shehab, Nasr Abdurrahman

Applied Computer Science

|

2023

|

Vol. 19, no 2

126--141

EN

Writing a well-structured scientific documents, such as articles and theses, is vital for comprehending the document's argumentation and understanding its messages. Furthermore, it has an impact on the efficiency and time required for studying the document. Proper document segmentation also yields better results when employing automated Natural Language Processing (NLP) manipulation algorithms, including summarization and other information retrieval and analysis functions. Unfortunately, inexperienced writers, such as young researchers and graduate students, often struggle to produce well-structured professional documents. Their writing frequently exhibits improper segmentations or lacks semantically coherent segments, a phenomenon referred to as "mal-segmentation." Examples of mal-segmentation include improper paragraph or section divisions and unsmooth transitions between sentences and paragraphs. This research addresses the issue of mal-segmentation in scientific writing by introducing an automated method for detecting mal-segmentations, and utilizing Sentence Bidirectional Encoder Representations from Transformers (sBERT) as an encoding mechanism. The experimental results section shows a promising results for the detection of mal-segmentation using the sBERT technique.

2

Design and analysis of a lean interface for Sanskrit corpus annotation

Goyal P., Huet G.

Journal of Language Modelling

|

2016

|

Vol. 4, No. 2

145--182

EN

We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexiconacquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.

3

Methodology for the evaluation of the algorithms for text segmentation based on errors type

Brodic D.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 1b

259-263

EN

Text segmentation represents the key element in the optical character recognition process. Hence, testing procedure for text segmentation algorithms has significance importance. All previous works deal mainly with text database as a template. They are used for testing as well as for the evaluation of the text segmentation algorithm. However, because of inconsistencies in this process, some methodology for the experiments is required. In this manuscript, methodology for the evaluation of the algorithm for text segmentation based on errors type is proposed. It is established on the various multiline text samples linked with text segmentation. Final result is obtained by comparative analysis of cross linked data. At the end, its suitability for different type of scripts represents its main advantage.

PL

Segmentacja tekstu stanowi kluczowy element procesu optycznego rozpoznawania znaków. Wszystkie dotychczasowe prace dotyczą głównie bazy danych tekstu jako szablonu. Są one używane do testowania, jak i dla oceny algorytmu segmentacji tekstu. Jednak w taki, algorytmie występują nieścisłości. W pracy przedstawiono , metodologię oceny algorytmu segmentacji tekstu w oparciu o typ błędów. Badania przeprowadzono na różnych próbkach tekstu wielowierszowego. Końcowy wynik uzyskuje się poprzez analizę porównawczą danych.

4

System informacyjny na temat sieci hydrantów dla krajowego systemu ratowniczo-gaśniczego: metoda segmentacji tekstu i jej ocena

Mirończuk M., Maciak T.

Metody Informatyki Stosowanej

|

2011

|

nr 4

45-66

EN

This article describes the design process rule segmentator. The article also describes a reference design procedure set of segments. There have been a description of the numerical experiment and the reference created a collection of segments. Designed segmentator was used to extract segments from the fire service reports. Segmentation results were compared with other solutions to the segmentation of the text.

5

Multistage semi-automatic text image segmentation for training set acquisition in handwriting recognition

Sas J.

Systems Science

|

2008

|

Vol. 34, no 1

107-126

EN

In the paper, a complete method of text image segmentation into the images of individual characters is proposed. The ultimate aim of the segmentation process is to prepare a set of correctly labeled character samples that can be used to train the character classifier applied as the component of the handwritten word recognizer. The method proposed consists of two stages. At the first stage, the text image is first divided into lines and then the lines are segmented into words. In this phase, the known spelling representation of the text on the image is used, so as to obtain as many segments as the number of words in the text. The information about the expected width of known words is also utilized. At the second stage, the obtained images of known words are segmented into individual characters. The multiphase procedure is applied. It first segments individual words independently, using the estimates of character widths obtained by the complete text corpus analysis. Then the global text segmentation is elaborated, which maximizes the similarity measures of samples extracted for all alphabet characters. Genetic algorithm is applied in this phase. Finally, the segmentation variants represented by chromosomes in the terminal population of the genetic algorithm are locally refined and the most dissimilar samples in sets corresponding to the alphabet characters are rejected. The experiments conducted showed that the accuracy of handwriting recognition achieved by recognizers trained with the training set obtained with the proposed method is close to the accuracy achievable with the training set prepared by a human expert.