Wyniki wyszukiwania - BazTech

1

Bag of words and embedding text representation methods for medical article classification

Cichosz Paweł

International Journal of Applied Mathematics and Computer Science

|

2023

|

Vol. 33, no. 4

603--621

EN

Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

2

A contemporary multi-objective feature selection model for depression detection using a hybrid pBGSK optimization algorithm

Kavi Priya Santhosam, Pon Karthika Kasirajan

International Journal of Applied Mathematics and Computer Science

|

2023

|

Vol. 33, no. 1

117--131

EN

Depression is one of the primary causes of global mental illnesses and an underlying reason for suicide. The user generated text content available in social media forums offers an opportunity to build automatic and reliable depression detection models. The core objective of this work is to select an optimal set of features that may help in classifying depressive contents posted on social media. To this end, a novel multi-objective feature selection technique (EFS-pBGSK) and machine learning algorithms are employed to train the proposed model. The novel feature selection technique incorporates a binary gaining-sharing knowledge-based optimization algorithm with population reduction (pBGSK) to obtain the optimized features from the original feature space. The extensive feature selector (EFS) is used to filter out the excessive features based on their ranking. Two text depression datasets collected from Twitter and Reddit forums are used for the evaluation of the proposed feature selection model. The experimentation is carried out using naive Bayes (NB) and support vector machine (SVM) classifiers for five different feature subset sizes (10, 50, 100, 300 and 500). The experimental outcome indicates that the proposed model can achieve superior performance scores. The top results are obtained using the SVM classifier for the SDD dataset with 0.962 accuracy, 0.929 F1 score, 0.0809 log-loss and 0.0717 mean absolute error (MAE). As a result, the optimal combination of features selected by the proposed hybrid model significantly improves the performance of the depression detection system.

3

The impact of green finance development on ecological protection based on machine learning

Zhang Ting

Ecological Chemistry and Engineering. S

|

2023

|

Vol. 30, nr 1

103--110

EN

In the context of today’s green development, it is the core task of the financial sector at all levels to enhance the utilisation of resources and to guide the high-quality development of industries, especially to channel funds originally gathered in high-pollution and energy-intensive industries to sectors with green and high-technology, to achieve the harmonious development of the economy and the resources and environment. This paper proposes a green financial text classification model based on machine learning. The model consists of four modules: the input module, the data analysis module, the data category module, and the classification module. Among them, the data analysis module and the data category module extract the data information of the input information and the green financial category information respectively, and the two types of information are finally fused by the attention mechanism to achieve the classification of green financial data in financial data. Extensive experiments are conducted on financial text datasets collected from the Internet to demonstrate the superiority of the proposed green financial text classification method.

4

Impact of n-stage latent Dirichlet allocation on analysis of headline classification

Guven Zekeriya Anil, Diri Banu, Cakaloglu Tolgahan

Computer Science

|

2022

|

T. 23 (3)

375--394

EN

Data analysis becomes difficult when the amount of the data increases. More specifically, extracting meaningful insights from this vast amount of data and grouping it based on its shared features without human intervention requires advanced methodologies. There are topic-modeling methods that help overcome this problem in text analyses for downstream tasks (such as sentiment analysis, spam detection, and news classification). In this research, we benchmark several classifiers (namely, random forest, AdaBoost, naive Bayes, and logistic regression) using the classical latent Dirichlet allocation (LDA) and n-stage LDA topic-modeling methods for feature extraction in headline classification. We ran our experiments on three and five classes of publicly available Turkish and English datasets. We have demonstrated that, as a feature extractor, n-stage LDA obtains state-of-the-art performance for any downstream classifier. It should also be noted that random forest was the most successful algorithm for both datasets.

5

Detection of source code in internet texts using automatically generated machine learning models

Badurowicz Marcin

Applied Computer Science

|

2022

|

Vol. 18, no 1

89--98

EN

In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes.

6

K-Graph: knowledgeable graph for text documents

Mittal Varsha, Gangodkar Durgaprasad, Pant Bhaskar

Journal of KONBiN

|

2021

|

Vol. 51, iss. 1

73--89

EN

Graph databases are applied in many applications, including science and business, due to their low-complexity, low-overheads, and lower time-complexity. The graph-based storage offers the advantage of capturing the semantic and structural information rather than simply using the Bag-of-Words technique. An approach called Knowledgeable graphs (K-Graph) is proposed to capture semantic knowledge. Documents are stored using graph nodes. Thanks to weighted subgraphs, the frequent subgraphs are extracted and stored in the Fast Embedding Referral Table (FERT). The table is maintained at different levels according to the headings and subheadings of the documents. It reduces the memory overhead, retrieval, and access time of the subgraph needed. The authors propose an approach that will reduce the data redundancy to a larger extent. With realworld datasets, K-graph’s performance and power usage are threefold greater than the current methods. Ninety-nine per cent accuracy demonstrates the robustness of the proposed algorithm.

7

Model of the text classification system using fuzzy sets

Salahor Dmytro, Smołka Jakub

Journal of Computer Sciences Institute

|

2021

|

Vol. 19

144--150

PL

Klasyfikacja tematyki pracy według słów kluczowych jest aktualnym i ważnym zadaniem. W artykule opisano algorytmy klasyfikowania słów kluczowych według obszaru tematycznego. Model został opracowany przy użyciu dwóch algorytmów i przetestowany na danych testowych. Uzyskane wyniki porównano z wynikami innych istniejących algorytmów odpowiednich do tego zadania. Uzyskane wyniki modelu analizowano. Algorytm ten może być stosowany w zadaniach rzeczywistych.

EN

Classification of work’s subject area by keywords is an actual and important task. This article describes algorithms for classifying keywords by subject area. A model was developed using both algorithms and tested on test data. The results were compared with the results of other existing algorithms suitable for these tasks. The obtained results of the model were analysed. This algorithm can be used in real-life tasks.

8

Categorization of persons based on their mentions in Polish news texts

Pachocki Maciej, Wróblewska Anna

Journal of Automation Mobile Robotics and Intelligent Systems

|

2020

|

Vol. 14, No. 2

42--49

EN

Our goal described in this paper was to design, imple‐ ment and test a method of categorization of mentions of persons in Polish news texts. We gathered and classified the input data in order to measure the accuracy of the method. Train and test data were constructed by using lists of persons collected from YAGO knowledge base and Polish Wikipedia. During tests the efficiency of categori‐ zation depending on different representations of a per‐ son was studied. Experiments were executed on our and a chosen solution from literature. The results are shown and discussed in the paper.

9

Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

Lango Mateusz

Foundations of Computing and Decision Sciences

|

2019

|

Vol. 44, No. 2

151--178

EN

Sentiment classification is an important task which gained extensive attention both in academia and in industry. Many issues related to this task such as handling of negation or of sarcastic utterances were analyzed and accordingly addressed in previous works. However, the issue of class imbalance which often compromises the prediction capabilities of learning algorithms was scarcely studied. In this work, we aim to bridge the gap between imbalanced learning and sentiment analysis. An experimental study including twelve imbalanced learning preprocessing methods, four feature representations, and a dozen of datasets, is carried out in order to analyze the usefulness of imbalanced learning methods for sentiment classification. Moreover, the data difficulty factors - commonly studied in imbalanced learning - are investigated on sentiment corpora to evaluate the impact of class imbalance.

10

Analysis of data pre-processing methods for sentiment analysis of reviews

Parlar Tuba, Ozel Selma, Song Fei

Computer Science

|

2019

|

Vol. 20 (1)

123--141

EN

The goals of this study are to analyze the effects of data pre-processing methods for sentiment analysis and determine which of these pre-processing methods (and their combinations) are effective for English as well as for an agglutinative language like Turkish. We also try to answer the research question of whether there are any differences between agglutinative and non-agglutinative languages in terms of pre-processing methods for sentiment analysis. We find that the performance results for the English reviews are generally higher than those for the Turkish reviews due to the differences between the two languages in terms of vocabularies, writing styles, and agglutinative property of the Turkish language.

11

Experiments with language combinatorics in text classification: lessons learned and future implications

Ptaszynski M., Masui F.

Technical Transactions

|

2017

|

Vol. 11(114)

183--197

EN

This paper presents a meta-analysis of experiments performed with language combinatorics (LC), a novel language model generation and feature extraction method based on combinatorial manipulations of sentence elements (e.g., words). Along recent years LC has been applied to a number of text classification tasks, such as affect analysis, cyberbullying detection or future reference extraction. We summarize two of the most extensive experiments and discuss general implications for future implementations of combinatorial language model.

PL

W niniejszym artykule przedstawiono metaanalizę badań przeprowadzonych za pomocą kombinatoryki językowej (language combinatorics, LC), nowej metody generacji modelu języka i ekstrakcji cech, opartej o kombinacyjne manipulacje na elementach zdań (np. słowa). W trakcie ostatnich lat LC została zastosowana do wielu zadań z dziedziny klasyfikacji tekstu, takich jak analiza afektu, wykrywanie cyberagresji lub ekstrakcja odniesień do przyszłych wydarzeń. W niniejszym artykule podsumowujemy dwa z najbardziej obszernych doświadczeń i omawiamy ogólne implikacje dotyczące przyszłych zastosowań kombinatoryjnego modelu języka.

12

Wykorzystanie akceleracji sprzętowej przy implementacji metryk podobieństwa tekstów

Iwanecki Ł., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

426--428

PL

Artykuł opisuje badania na temat klasyfikatorów tekstów. Zadanie polegało na zaprojektowaniu akceleratora sprzętowego, który przyspieszyłby proces klasyfikacji tekstów pod względem znaczeniowym. Projekt został podzielony na dwie części. Celem części pierwszej było zaproponowanie sprzętowej implementacji algorytmu realizującego metrykę do obliczania podobieństwa dokumentów. W drugiej części zaprojektowany został cały systemem akceleratora sprzętowego. Kolejnym etapem projektowym jest integracja modelu metryki z system akceleracji.

EN

The aim of this project is to propose a hardware accelerating system to improve the text categorization process. Text categorization is a task of categorizing electronic documents into the predefined groups, based on the content. This process is complex and requires a high performance computing system and a big number of comparisons. In this document, there is suggested a method to improve the text categorization using the FPGA technology. The main disadvantage of common processing systems is that they are single-threaded – it is possible to execute only one instruction per a single time unit. The FPGA technology improves concurrence. In this case, hundreds of big numbers may be compared in one clock cycle. The whole project is divided into two independent parts. Firstly, a hardware model of the required metrics is implemented. There are two useful metrics to compute a distance between two texts. Both of them are shown as equations (1) and (2). These formulas are similar to each other and the only difference is the denominator. This part results in two hardware models of the presented metrics. The main purpose of the second part of the project is to design a hardware accelerating system. The system is based on a Xilinx Zynq device. It consists of a Cortex-A9 ARM processor, a DMA controller and a dedicated IP Core with the accelerator. The block diagram of the system is presented in Fig.4. The DMA controller provides duplex transmission from the DDR3 memory to the accelerating unit omitting a CPU. The project is still in development. The last step is to integrate the hardware metrics model with the accelerating system.

13

Classification of text documents by using expanded terms in Latent Semantic Analysis

Śmiałkowska B., Gibert M.

Theoretical and Applied Informatics

|

2013

|

Vol. 25, No. 3-4

239--250

EN

In this article attention is paid to improving the quality of text document classification. The common techniques of analysis of text documents used in classification are shown and the weakness of these methods arc stressed. Discussed here is the integration of quantitative and qualitative methods, which is increasing the quality of classification. In the proposed approach the expanded terms, obtained by using information patterns are used in the Latent Semantic Analysis. Finally empirical research is presented and based upon the quality measures of the text document classification, the effectiveness of the proposed approach is proved.

PL

W artykule skoncentrowano się na poprawie jakości klasyfikacji dokumentów tekstowych. Zostały przybliżone najpopularniejsze techniki analizy dokumentów tekstowych wykorzystywanych w klasyfikacji. Zwrócono uwagę na słabe strony opisanych technik. Omówiono możliwość integracji metod ilościowych i jakościowych analizy tekstu i jej wpływ na poprawę jakości klasyfikacji. Zaproponowano rozwiązanie, w którym rozbudowane wyrażenia otrzymane za pomocą wzorców informacyjnych są wykorzystywane w niejawnej analizie semantycznej. Ostatecznie w oparciu o miary jakości klasyfikacji dokumentów tekstowych zaprezentowano wyniki badań testowych, które potwierdzają skuteczność zaproponowanego rozwiązania.

14

A Hybrid Algorithm for Text Classification Problem

Liu X., Fu H.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 1b

8-11

EN

This paper investigates a novel algorithm-EGA-SVM for text classification problem by combining support vector machines (SVM) with elitist genetic algorithm (GA). The new algorithm uses EGA, which is based on elite survival strategy, to optimize the parameters of SVM. Iris dataset and one hundred pieces of news reports in Chinese news are chosen to compare EGA-SVM, GA-SVM and traditional SVM. The results of numerical experiments show that EGA-SVM can improve classification performance effectively than the other algorithms. This text classification algorithm can be extended easily to apply to literatures in the field of electrical engineering.

PL

W artykule przedstawiono nowy algorytm klasyfikacji tekstu bazujący na mechanizmie SVM (support vector machine) I algorytmie genetycznym. Algorytm zbadano na podstawie bazy danych Iris i setek innych chińskich przykładów. Algorytm wykazał swoją skuteczność. Może być on łatwo rozszerzony na analizę tekstów w inżynierii elektrycznej.

15

A semi-automated approach to building text summarisation classifiers

Garcia-Constantino M., Coenen F., Noble P. J., Radford A., Setzkorn C.

Journal of Theoretical and Applied Computer Science

|

2012

|

Vol. 6, nr 4

7-23

EN

An investigation into the extraction of useful information from the free text element of questionnaires, using a semi-automated summarisation extraction technique, is described. The summarisation technique utilises the concept of classification but with the support of domain/human experts during classifier construction. A realisation of the proposed technique, SARSET (Semi-Automated Rule Summarisation Extraction Tool), is presented and evaluated using real questionnaire data. The results of this evaluation are compared against the results obtained using two alternative techniques to build text summarisation classifiers. The first of these uses standard rule-based classifier generators, and the second is founded on the concept of building classifiers using secondary data. The results demonstrate that the proposed semi-automated approach outperforms the other two approaches considered.

16

A Review of Artificial Intelligence Algorithms in Document Classification

Bilski A.

International Journal of Electronics and Telecommunications

|

2011

|

Vol. 57, No. 3

263-270

EN

With the evolution of Internet, the meaning and accessibility of text documents and electronic information has increased. The automatic text categorization methods became essential in the information organization and data mining process. A proper classification of e-documents, various Internet information, blogs, emails and digital libraries requires application of data mining and machine learning algorithms to retrieve the desired data. The following paper describes the most important techniques and methodologies used for the text classification. Advantages and effectiveness of contemporary algorithms are compared and their most notable applications presented.

17

Text classification using word sequences

Chudzian P.

Studia Informatica : systems and information technology

|

2008

|

Vol. 1(10)

75--85

EN

The article discusses the use of word sequences in text classification. As opposed to ngrams, word sequences are not of a fixed length and therefore allow the classifier to obtain flexibility necessary to operate on documents collected from various sources. Presented classifier is built upon the suffix tree structure which enables word sequences to take part in classification process. During classification, both single words and longer sequences are taken into account and have impact on the category assignment with respect to their frequency and length. The Suffix Tree Classifier and well known Naive Bayes Classifier are compared and their properties are discussed. Obtained results show that incorporating word sequences into text classification can increase accuracy and reveal some interesting relations between maximal length of used sequences and classifier's error rate.

18

Speech command based application enabling Internet navigation

Mięsikowska M.

Pomiary Automatyka Kontrola

|

2007

|

R. 53, nr 5

87-89

EN

The paper presents an attempt to create an application enabling the user to surf much easier the resources of the Internet with the help of voice commands, as well as to classify and arrange the browsed information. The application has two basic modules which enable browsing the information on the Internet. The first navigation module processes websites, isolates navigation elements , such as links to other websites, from them and gives an identification name to the elements, which enables the user to pronounce voice commands. The website is presented to the user in a practically original form. The second module also processes websites, isolating navigation elements from them. The only difference in operation of the both modules is the mode of processing the website and its final presentation. The second module isolates from the elements vocabulary, which makes it possible to classify the information included in the website, this way acquiring and displaying, an ordered set of navigation elements. The application was implemented in Java language with the use of Oracle software. For the system of recognition and understanding of speech the Sphinx 4 tool was used.

PL

W tej pracy przedstawiono próbę stworzenia aplikacji umożliwiającej swobodniejszą nawigację użytkownika wśród zasobów Internetu za pomocą poleceń mowy, klasyfikację oraz uporządkowanie przeglądanej informacji. Aplikacja posiada dwa zasadnicze moduły, przy pomocy których możliwe jest przeglądanie informacji w Internecie. Pierwszy moduł nawigacji, przetwarza strony internetowe, wyodrębnia z nich elementy nawigacyjne takie jak odnośniki do innych stron, oraz nadaje elementom identyfikacyjną nazwę, dzięki której użytkownik może wydawać słowne polecenia. Strona internetowa wyświetlona zostaje użytkownikowi w niemalże oryginalnej postaci. Drugi moduł również przetwarza strony internetowe, wyodrębniając z nich elementy nawigacyjne. Jedyną różnicą w działaniu obu modułów jest sposób przetwarzania strony i ostatecznej jej reprezentacji. Drugi moduł wyodrębnia z elementów słownictwo, dzięki któremu możemy sklasyfikować informację znajdującą się na stronie, uzyskując i wyświetlając w ten sposób uporządkowany zbiór elementów nawigacyjnych. Aplikacja zaimplementowana została w języku Java z wykorzystaniem oprogramowania Oracle. W przypadku systemu rozpoznawania mowy zastosowano narzędzie Sphinx-4.

19

Rough set Based Ensemble Classifier for Web Page Classification

Saha S., Murthy C.A., Pal S.K.

Fundamenta Informaticae

|

2007

|

Vol. 76, nr 1-2

171-187

EN

Combining the results of a number of individually trained classification systems to obtain a more accurate classifier is a widely used technique in pattern recognition. In this article, we have introduced a rough set based meta classifier to classify web pages. The proposed method consists of two parts. In the first part, the output of every individual classifier is considered for constructing a decision table. In the second part, rough set attribute reduction and rule generation processes are used on the decision table to construct a meta classifier. It has been shown that (1) the performance of the meta classifier is better than the performance of every constituent classifier and, (2) the meta classifier is optimal with respect to a quality measure defined in the article. Experimental studies show that the meta classifier improves accuracy of classification uniformly over some benchmark corpora and beats other ensemble approaches in accuracy by a decisive margin, thus demonstrating the theoretical results. Apart from this, it reduces the CPU load compared to other ensemble classification techniques by removing redundant classifiers from the combination.

20

Rare class text categorization with SVM ensemble

Silva C., Ribeiro B.

Przegląd Elektrotechniczny

|

2006

|

R. 82, nr 1

28-31

EN

Text Classification is the assignment of a class from a predetermined set to a new document. In real world applications the number of positive examples for most classes is limited, while the overall number of examples is huge. In this setting classifiers' performance can experience a not so graceful degradation, especially where false negatives are concerned. To handle this problem, we propose a committee of several SVM, where the learning strategy uses the separating margin as differentiating factor on positive classifications. While enabling robustness, the method improves performance by correcting errors of one classifier using the accurate output of others. We demonstrate the practicality and effectiveness of the method by simulation results on Reuters-21578 data set.

PL

Kategoryzacja tekstu to przypisanie nowego tekstu do odpowiedniej kategorii ze zdefiniowanego wcześniej zbioru. W praktycznych zastosowaniach liczba wzorców dla większości klas jest ograniczona, podczas gdy liczba wszystkich danych wejściowych jest ogromna. Przy takich właściwościach problemu, zbudowanie klasyfikatora dobrze spełniającego swoje zadanie nie jest trywialne. Aby rozwiązać ten problem, zaproponowano zespól kilku struktur SVM, w których uczenie opiera się na maksymalizacji marginesu separacji pomiędzy dwiema różnymi klasami. Metoda wprowadza odporność poprzez korekcję wyjścia jednego z klasyfikatorów dzięki wykorzystaniu informacji z wyjść pozostałych. Skuteczność metody zilustrowano na przykładzie symulacji dla zbioru danych Reuters-21578.