Wyniki wyszukiwania - BazTech

1

Bag of words and embedding text representation methods for medical article classification

Cichosz Paweł

International Journal of Applied Mathematics and Computer Science

|

2023

|

Vol. 33, no. 4

603--621

EN

Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

2

A contemporary multi-objective feature selection model for depression detection using a hybrid pBGSK optimization algorithm

Kavi Priya Santhosam, Pon Karthika Kasirajan

International Journal of Applied Mathematics and Computer Science

|

2023

|

Vol. 33, no. 1

117--131

EN

Depression is one of the primary causes of global mental illnesses and an underlying reason for suicide. The user generated text content available in social media forums offers an opportunity to build automatic and reliable depression detection models. The core objective of this work is to select an optimal set of features that may help in classifying depressive contents posted on social media. To this end, a novel multi-objective feature selection technique (EFS-pBGSK) and machine learning algorithms are employed to train the proposed model. The novel feature selection technique incorporates a binary gaining-sharing knowledge-based optimization algorithm with population reduction (pBGSK) to obtain the optimized features from the original feature space. The extensive feature selector (EFS) is used to filter out the excessive features based on their ranking. Two text depression datasets collected from Twitter and Reddit forums are used for the evaluation of the proposed feature selection model. The experimentation is carried out using naive Bayes (NB) and support vector machine (SVM) classifiers for five different feature subset sizes (10, 50, 100, 300 and 500). The experimental outcome indicates that the proposed model can achieve superior performance scores. The top results are obtained using the SVM classifier for the SDD dataset with 0.962 accuracy, 0.929 F1 score, 0.0809 log-loss and 0.0717 mean absolute error (MAE). As a result, the optimal combination of features selected by the proposed hybrid model significantly improves the performance of the depression detection system.

3

The impact of green finance development on ecological protection based on machine learning

Zhang Ting

Ecological Chemistry and Engineering. S

|

2023

|

Vol. 30, nr 1

103--110

EN

In the context of today’s green development, it is the core task of the financial sector at all levels to enhance the utilisation of resources and to guide the high-quality development of industries, especially to channel funds originally gathered in high-pollution and energy-intensive industries to sectors with green and high-technology, to achieve the harmonious development of the economy and the resources and environment. This paper proposes a green financial text classification model based on machine learning. The model consists of four modules: the input module, the data analysis module, the data category module, and the classification module. Among them, the data analysis module and the data category module extract the data information of the input information and the green financial category information respectively, and the two types of information are finally fused by the attention mechanism to achieve the classification of green financial data in financial data. Extensive experiments are conducted on financial text datasets collected from the Internet to demonstrate the superiority of the proposed green financial text classification method.

4

Model of the text classification system using fuzzy sets

Salahor Dmytro, Smołka Jakub

Journal of Computer Sciences Institute

|

2021

|

Vol. 19

144--150

PL

Klasyfikacja tematyki pracy według słów kluczowych jest aktualnym i ważnym zadaniem. W artykule opisano algorytmy klasyfikowania słów kluczowych według obszaru tematycznego. Model został opracowany przy użyciu dwóch algorytmów i przetestowany na danych testowych. Uzyskane wyniki porównano z wynikami innych istniejących algorytmów odpowiednich do tego zadania. Uzyskane wyniki modelu analizowano. Algorytm ten może być stosowany w zadaniach rzeczywistych.

EN

Classification of work’s subject area by keywords is an actual and important task. This article describes algorithms for classifying keywords by subject area. A model was developed using both algorithms and tested on test data. The results were compared with the results of other existing algorithms suitable for these tasks. The obtained results of the model were analysed. This algorithm can be used in real-life tasks.

5

Experiments with language combinatorics in text classification: lessons learned and future implications

Ptaszynski M., Masui F.

Technical Transactions

|

2017

|

Vol. 11(114)

183--197

EN

This paper presents a meta-analysis of experiments performed with language combinatorics (LC), a novel language model generation and feature extraction method based on combinatorial manipulations of sentence elements (e.g., words). Along recent years LC has been applied to a number of text classification tasks, such as affect analysis, cyberbullying detection or future reference extraction. We summarize two of the most extensive experiments and discuss general implications for future implementations of combinatorial language model.

PL

W niniejszym artykule przedstawiono metaanalizę badań przeprowadzonych za pomocą kombinatoryki językowej (language combinatorics, LC), nowej metody generacji modelu języka i ekstrakcji cech, opartej o kombinacyjne manipulacje na elementach zdań (np. słowa). W trakcie ostatnich lat LC została zastosowana do wielu zadań z dziedziny klasyfikacji tekstu, takich jak analiza afektu, wykrywanie cyberagresji lub ekstrakcja odniesień do przyszłych wydarzeń. W niniejszym artykule podsumowujemy dwa z najbardziej obszernych doświadczeń i omawiamy ogólne implikacje dotyczące przyszłych zastosowań kombinatoryjnego modelu języka.

6

Wykorzystanie akceleracji sprzętowej przy implementacji metryk podobieństwa tekstów

Iwanecki Ł., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

426--428

PL

Artykuł opisuje badania na temat klasyfikatorów tekstów. Zadanie polegało na zaprojektowaniu akceleratora sprzętowego, który przyspieszyłby proces klasyfikacji tekstów pod względem znaczeniowym. Projekt został podzielony na dwie części. Celem części pierwszej było zaproponowanie sprzętowej implementacji algorytmu realizującego metrykę do obliczania podobieństwa dokumentów. W drugiej części zaprojektowany został cały systemem akceleratora sprzętowego. Kolejnym etapem projektowym jest integracja modelu metryki z system akceleracji.

EN

The aim of this project is to propose a hardware accelerating system to improve the text categorization process. Text categorization is a task of categorizing electronic documents into the predefined groups, based on the content. This process is complex and requires a high performance computing system and a big number of comparisons. In this document, there is suggested a method to improve the text categorization using the FPGA technology. The main disadvantage of common processing systems is that they are single-threaded – it is possible to execute only one instruction per a single time unit. The FPGA technology improves concurrence. In this case, hundreds of big numbers may be compared in one clock cycle. The whole project is divided into two independent parts. Firstly, a hardware model of the required metrics is implemented. There are two useful metrics to compute a distance between two texts. Both of them are shown as equations (1) and (2). These formulas are similar to each other and the only difference is the denominator. This part results in two hardware models of the presented metrics. The main purpose of the second part of the project is to design a hardware accelerating system. The system is based on a Xilinx Zynq device. It consists of a Cortex-A9 ARM processor, a DMA controller and a dedicated IP Core with the accelerator. The block diagram of the system is presented in Fig.4. The DMA controller provides duplex transmission from the DDR3 memory to the accelerating unit omitting a CPU. The project is still in development. The last step is to integrate the hardware metrics model with the accelerating system.

7

A Hybrid Algorithm for Text Classification Problem

Liu X., Fu H.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 1b

8-11

EN

This paper investigates a novel algorithm-EGA-SVM for text classification problem by combining support vector machines (SVM) with elitist genetic algorithm (GA). The new algorithm uses EGA, which is based on elite survival strategy, to optimize the parameters of SVM. Iris dataset and one hundred pieces of news reports in Chinese news are chosen to compare EGA-SVM, GA-SVM and traditional SVM. The results of numerical experiments show that EGA-SVM can improve classification performance effectively than the other algorithms. This text classification algorithm can be extended easily to apply to literatures in the field of electrical engineering.

PL

W artykule przedstawiono nowy algorytm klasyfikacji tekstu bazujący na mechanizmie SVM (support vector machine) I algorytmie genetycznym. Algorytm zbadano na podstawie bazy danych Iris i setek innych chińskich przykładów. Algorytm wykazał swoją skuteczność. Może być on łatwo rozszerzony na analizę tekstów w inżynierii elektrycznej.

8

Speech command based application enabling Internet navigation

Mięsikowska M.

Pomiary Automatyka Kontrola

|

2007

|

R. 53, nr 5

87-89

EN

The paper presents an attempt to create an application enabling the user to surf much easier the resources of the Internet with the help of voice commands, as well as to classify and arrange the browsed information. The application has two basic modules which enable browsing the information on the Internet. The first navigation module processes websites, isolates navigation elements , such as links to other websites, from them and gives an identification name to the elements, which enables the user to pronounce voice commands. The website is presented to the user in a practically original form. The second module also processes websites, isolating navigation elements from them. The only difference in operation of the both modules is the mode of processing the website and its final presentation. The second module isolates from the elements vocabulary, which makes it possible to classify the information included in the website, this way acquiring and displaying, an ordered set of navigation elements. The application was implemented in Java language with the use of Oracle software. For the system of recognition and understanding of speech the Sphinx 4 tool was used.

PL

W tej pracy przedstawiono próbę stworzenia aplikacji umożliwiającej swobodniejszą nawigację użytkownika wśród zasobów Internetu za pomocą poleceń mowy, klasyfikację oraz uporządkowanie przeglądanej informacji. Aplikacja posiada dwa zasadnicze moduły, przy pomocy których możliwe jest przeglądanie informacji w Internecie. Pierwszy moduł nawigacji, przetwarza strony internetowe, wyodrębnia z nich elementy nawigacyjne takie jak odnośniki do innych stron, oraz nadaje elementom identyfikacyjną nazwę, dzięki której użytkownik może wydawać słowne polecenia. Strona internetowa wyświetlona zostaje użytkownikowi w niemalże oryginalnej postaci. Drugi moduł również przetwarza strony internetowe, wyodrębniając z nich elementy nawigacyjne. Jedyną różnicą w działaniu obu modułów jest sposób przetwarzania strony i ostatecznej jej reprezentacji. Drugi moduł wyodrębnia z elementów słownictwo, dzięki któremu możemy sklasyfikować informację znajdującą się na stronie, uzyskując i wyświetlając w ten sposób uporządkowany zbiór elementów nawigacyjnych. Aplikacja zaimplementowana została w języku Java z wykorzystaniem oprogramowania Oracle. W przypadku systemu rozpoznawania mowy zastosowano narzędzie Sphinx-4.