Wyniki wyszukiwania - BazTech

1

Email Phishing Detection with BLSTM and Word Embeddings

Wolert Rafał, Rawski Mariusz

International Journal of Electronics and Telecommunications

|

2023

|

Vol. 69, No. 3

485--491

EN

Phishing has been one of the most successful attacks in recent years. Criminals are motivated by increasing financial gain and constantly improving their email phishing methods. A key goal, therefore, is to develop effective detection methods to cope with huge volumes of email data. In this paper, a solution using BLSTM neural network and FastText word embeddings has been proposed. The solution uses preprocessing techniques like stop-word removal, tokenization, and padding. Two datasets were used in three experiments: balanced and imbalanced, whereas in the imbalanced dataset, the effect of maximum token size was investigated. Evaluation of the model indicated the best metrics: 99.12% accuracy, 98.43% precision, 99.49% recall, and 98.96% f1-score on the imbalanced dataset. It was compared to an existing solution that uses the DL model and word embeddings. Finally, the model and solution architecture were implemented as a browser plug-in.

2

Experimental Comparison of Pre-Trained Word Embedding Vectors of Word2Vec, Glove, FastText for Word Level Semantic Text Similarity Measurement in Turkish

Tulu Cagatay Neftali

Advances in Science and Technology. Research Journal

|

2022

|

Vol. 16, no 4

147--156

EN

This study aims to evaluate experimentally the word vectors produced by three widely used embedding methods for the word-level semantic text similarity in Turkish. Three benchmark datasets SimTurk, AnlamVer, and RG65_Turkce are used in this study to evaluate the word embedding vectors produced by three different methods namely Word2Vec, Glove, and FastText. As a result of the comparative analysis, Turkish word vectors produced with Glove and FastText gained better correlation in the word level semantic similarity. It is also found that The Turkish word coverage of FastText is ahead of the other two methods because the limited number of Out of Vocabulary (OOV) words have been observed in the experiments conducted for FastText. Another observation is that FastText and Glove vectors showed great success in terms of Spearman correlation value in the SimTurk and AnlamVer datasets both of which are purely prepared and evaluated by local Turkish individuals. This is another indicator showing that these aforementioned datasets are better representing the Turkish language in terms of morphology and inflections.

3

Improving utilization of lexical knowledge in natural language inference

Chłędowski Jakub, Wesołowski Tomasz, Jastrzębski Stanisław

Schedae Informaticae

|

2018

|

Vol. 27

EN

Natural language inference (NLI) is a central problem in natural language processing (NLP) of predicting the logical relationship between a pair of sentences. Lexical knowledge, which represents relations between words, is often important for solving NLI problems. This knowledge can be accessed by using an external knowledge base (KB), but this is limited to when such a resource is accessible. Instead of using a KB, we propose a simple architectural change for attention based models. We show that by adding a skip connection from the input to the attention layer we can utilize better the lexical knowledge already present in the pretrained word embeddings. Finally, we demonstrate that our strategy allows to use an external source of knowledge in a straightforward manner by incorporating a second word embedding space in the model.

4

Evaluating KGR10 Polish Word Embeddings in the recognition of temporal expressions using BiLSTM-CRF

Kocoń Jan, Gawor Michał

Schedae Informaticae

|

2018

|

Vol. 27

EN

The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using Bidirectional Long-Short Term Memory (BiLSTM) network with additional CRF layer, where specific embeddings are essential. We presented experiments and conclusions drawn from them.

5

Building semantic user profile for polish web news portal

Misztal-Radecka J.

Computer Science

|

2018

|

Vol. 19 (3)

307–-332

EN

The aim of this research is to construct meaningful user profiles that are the most descriptive of user interests in the context of the media content that they browse. We use two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We train our models on the collection of news articles in Polish and compare them with a model built on a general language corpus. We compare the performance of these algorithms on two practical tasks. First, we perform a qualitative analysis of the semantic relationships for similar article retrieval, and then we evaluate the predictive performance of distinct feature combinations for user gender classification. We apply the algorithms to the real-world dataset of Polish news service Onet. Our results show that the choice of text representation depends on the task –Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the gender classification task, the best performance is obtained with a combination of features: topics from the article text and word embeddings from the title.

6

Word Embeddings for Morphologically Complex Languages

Jurdziński G.

Schedae Informaticae

|

2016

|

Vol. 25

127--138

EN

Recent methods for learning word embeddings, like GloVe or Word2- Vec, succeeded in spatial representation of semantic and syntactic relations. We extend GloVe by introducing separate vectors for base form and grammatical form of a word, using morphosyntactic dictionary for this. This allows vectors to capture properties of words better. We also present model results for word analogy test and introduce a new test based on WordNet.