Wyniki wyszukiwania - BazTech

1

TF-IDF inspired detection for cross-language source code plagiarism and collusion

Karnalim Oscar

Computer Science

|

2020

|

T. 21 (1)

113--134

EN

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Furthermore, it is comparable to those techniques when dealing with conventional disguises.

2

Importance of Text Data Preprocessing & Implementation in RapidMiner

Kalra V., Aggarwal R.

Annals of Computer Science and Information Systems

|

2018

|

Vol. 14

71--75

EN

Data preparation is an important phase before applying any machine learning algorithms. Same with the text data before applying any machine learning algorithm on text data, it requires data preparation. The data preparation is done by data preprocessing. The preprocessing of text means cleaning of noise such as: cleaning of stop words, punctuation, terms which doesn't carry much weightage in context to the text, etc. In this paper, we describe in detail how to prepare data for machine learning algorithms using RapidMiner tool. This preprocessing is followed by conversion of bag of words into term vector model and describe about the various algorithms which can be applied in RapidMiner for data analysis and predictive modeling. We also discussed about the challenges and applications of text mining in recent days.

3

Reprezentacje dokumentów tekstowych w kontekście wykrywania niechcianych wiadomości pocztowych w języku polskim z wtrąceniami w języku angielskim

Andruszkiewicz P.

Studia Informatica

|

2009

|

Vol. 30, nr 2A

273-286

PL

Klasyfikacja dokumentów tekstowych wiąże się z utworzeniem ich reprezentacji. Duża liczba dokumentów zachęca do prób stosowania jak najbardziej oszczędnych sposobów ich reprezentowania. W niniejszej pracy przedstawione zostały możliwe reprezentacje dokumentów tekstowych, sposoby ich ograniczania w kontekście wykrywania niechcianych wiadomości pocztowych w języku polskim z wtrąceniami w języku angielskim.

EN

Representation of text documents should be as small as possible and give high accuracy of classification. This paper presents representations of text documents and ways of their reduction in case of SPAM detection in Polish with English phrases.