Wyniki wyszukiwania - BazTech

1

Detection of a Source Code Plagiarism in a Student Programming Competition

Gniazdowski Zenon, Boniecki Maciej

Zeszyty Naukowe Warszawskiej Wyższej Szkoły Informatyki

|

2019

|

nr 21

74--94

EN

The article presents a system for testing the independence of solutions to algorithmic problems sent by students as part of the student programming competition. First, the context was discussed, as well as the need to organize programming competitions resulting from this context. Then, an algorithm was proposed to study the mutual similarity of source codes of programs sent as part of a programming competition. Since, after implementation, the algorithm was used in practice, examples of its application for detecting the plagiarism of source codes of solutions in two programming competitions conducted as part ofmclasses on Algorithms and Numerical Methods were also presented. Finally, the effectiveness of the solutions used in the work was discussed.

2

Niewiarowski Artur

Computer Assisted Methods in Engineering and Science

|

2019

|

vol. 26, no. 3-4

163--175

EN

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

3

The evaluation of text string matching algorithms as an aid to image search

Ochelska-Mierzejewska J.

Journal of Applied Computer Science

|

2018

|

Vol. 26, nr 1

33--62

EN

The main goal of this paper is to analyse intelligent text string matching methods (like fuzzy sets and relations) and evaluate their usefulness for image search. The present study examines the ability of different algorithms to handle multi-word and multi-sentence queries. Eight different similarity measures (N-gram, Levenshtein distance, Jaro coefficient, Dice coefficient, Overlap coefficient, Euclidean distance, Cosine similarity and Jaccard similarity) are employed to analyse the algorithms in terms of time complexity and accuracy of results. The outcomes are used to develop a hierarchy of methods, illustrating their usefulness to image search. The search response time increases significantly in the case of data sets containing several thousand images. The findings indicate that the analysed algorithms do not fulfil the response-time requirements of professional applications. Due to its limitations, the proposed system should be considered only as an illustration of a novel solution with further development perspectives. The use of Polish as the language of experiments affects the accuracy of measures. This limitation seems to be easy to overcome in the case of languages with simpler grammar rules (e.g. English).

4

Sentence sentiment classification using fuzzy word matching combined with fuzzy sentiment classifier

Pietras M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 2

107-111

EN

This article focuses on semantic tagging of content in terms of sentimental meaning which may often lead to ambiguities between the primary sense of the word and its meaning in a particular expression. To address this issue, a specially modified Levenshtein distance algorithm for suffix-mitigation was used to measure similarity of words. Sentence sentiment classification was based on fuzzy logic approach and a fuzzy classifier. The presented method was experimentally tested with the sentimental analysis of selected sentences in the Polish language. Limitations of the presented method and possible improvements are discussed.

PL

Artykuł skupia się na semantycznym tagowaniu zawartości tekstu w kategoriach znaczenia sentymentalnego, które często może prowadzić do niejednoznaczności między pierwotnym wydzwiekiem słowa i jego znaczeniem w danej wypowiedzi. Aby zmierzyć się z tym zagadnieniem zastosowano specjalnie zmodyfikowany algorytm na odległość Levenshteina z łagodzeniem znaczenia końcówki fleksyjnej wyrazu do pomiaru podobieństwa słów. Sentymentalna klasyfikacja zdań została oparta na logice rozmytej i podejścia rozmytego klasyfikatora. Przedstawiona metoda została eksperymentalnie sprawdzona z sentymentalnej analizy wybranych zdań w języku polskim. Ograniczenia prezentowanej metody oraz możliwe ulepszenia są również omówiane.

5

Parallelization of the Levenshtein distance algorithm

Niewiarowski A., Stanuszek M.

Czasopismo Techniczne. Nauki Podstawowe

|

2014

|

R. 111, z. 3-NP

109--122

EN

This paper presents a method for the parallelization of the Levenshtein distance algorithm deployed on very large strings. The proposed approach was accomplished using .NET Framework 4.0 technology with a specific implementation of threads using the System. Threading.Task namespace library. The algorithms developed in this study were tested on a high performance machine using Xamarin Mono (for Linux RedHat/Fedora OS). The computational results demonstrate a high level of efficiency of the proposed parallelization procedure.

PL

Artykuł przedstawia metodę zrównoleglenia algorytmu analizy odległości edycyjnej Levenshteina dedykowaną bardzo dużym ciągom tekstowym. Zaproponowane rozwiązanie zostało zaimplementowane na platformie .NET Framework 4.0 z uwzględnieniem metod dostępnych w przestrzeni nazw System.Threading.Task. Zastosowane algorytmy przetestowano na komputerze wysokiej wydajności, w oparciu o narzędzia Xamarin Mono (dla SO Linux RedHat/ Fedora). Otrzymane wyniki pokazują znacząco zwiększoną wydajność obliczeń dla przedstawionych w artykule rozwiązań.

6

Mechanizm identyfikacji i klasyfikacji treści

Niewiarowski A., Stanuszek M.

Studia Informatica

|

2013

|

Vol. 34, nr 2B

205--222

PL

Artykuł opisuje mechanizm identyfikacji i klasyfikacji treści, oparty na metodzie ważenia terminów, bazującej na odwrotnej częstości dokumentowej, częstości wystąpienia terminu i odległości Levenshteina. Zaproponowany mechanizm zaimplementowano w program analizujący tematy i opisy prac dyplomowych, w celu automatycznego doboru promotorów i recenzentów.

EN

This paper presents the mechanism of identification and classification of content, based on terms weighted method with inversed document frequency analysis and Levenstein distance technique. The proposed mechanism is applied in the analysis of topics and descriptions of selected diploma thesis, to automatic selection of supervisors and reviewers.

7

Guo L., Wang W., Chen F., Tang X., Wang W.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 1b

26-30

EN

The changing information technology makes data increase exponentially in all areas, the quality of the huge amounts of data is the core problems. Data cleaning is an effective technology to solve data quality problems. This paper focuses on the duplicate data cleaning techniques. It studies the quality of the data from the architectural level, the instance-level problems, the multi-source single-source problems, duplicated records cleaning application platform and the evaluation criteria. In these studies, a improved novel detection method adopts the fuzzy clustering algorithm with the Levenshtein distance combination to data cleaning .It can accurately and quickly detect and remove duplicate raw data. The improved method includes a similar duplicate records detection process, the major system framework design, system function modules of the implementation process and results analysis in the paper. The precision and recall rates are higher than several other data cleaning methods. These comparisons confirm the validity of the method. The experimental results exhibit that the proposed method is effective in data detection and cleaning process.

PL

Artykuł proponuje nowe metody czyszczenia danych z uwzględnieniem liczby przypadków, wielu źródeł, podwójnych rekordów i innych kryteriów oceny. Ulepszona metoda detekcji wykorzystuje algorytm rozmytego klastrowania w dystansem Levenshteina. W ten sposób szybko wykrywane są i usuwane podwójne wiersze danych.

8

Signatures recognition method by using the normalized Levenshtein distances

Doroz R., Wróbel K., Porwik P.

Journal of Medical Informatics & Technologies

|

2009

|

Vol. 13

73--77

EN

This study examines the effectiveness of normalized Levenshtein metrics in the process of recognition of handwritten signatures. Three methods of normalization of the Levenshtein metric were taken into consideration. In addition, it was determined, which signature features are most important during their comparisons with the use of the aforementioned metric. The following signature features were examined: coordinates of signature points, pen pressure in successive points, and different types of pen speed. The influence of individual parameters of the Levenshtein algorithm on the obtained results was also determined, and the best method of normalization was selected.

9

Automatyczne sprawdzanie poprawności pisowni w języku polskim oparte na odległości Levenshteina

Dorosz K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2008

|

T. 12, z. 1

29-40

PL

Ogólnie stosowane metody sprawdzania poprawności pisowni wyrazów opierają się na wykorzystaniu zagadnienia odległości Levenshteina. Metody te do działania wymagają obecności słownika fleksyjnego języka, w którym sprawdzane wyrazy zostały napisane. Ze względu na to, że metody te zostały pierwotnie utworzone na potrzeby języka angielskiego, nie są optymalne w użyciu do przetwarzania tekstów w języku polskim. W niniejszym artykule zaprezentowano charakterystyczne cechy języka polskiego, które wpływają na budowę spellcheckera oraz propozycję pewnej adaptacji metody odległości Levenshteina z uwzględnieniem tych specyficznych cech. Nowy algorytm wykazuje się poprawą jakościową w poprawianiu tekstów napisanych w języku polskim.

EN

Today's widely used spellchecking methods are based on Levenshtein distance algorithms. Inflectional dictionary of language is also needed in spellchecking process. These methods are not optimal for spellchecking texts written in Polish language, because they were inwented for use with English texts, and are optimized for it. This article provides information about characteristics of Polish language that have impact on spellchecking optimizations, as also some proposition of spellchecker implementation based on Levenshtein distance that will use Polish language characteristics and will bring some improvement in Polish texts spellchecking process.