Wyniki wyszukiwania - BazTech

Ograniczanie wyników

Znaleziono wyników: 2

Liczba wyników na stronie

Wyniki wyszukiwania

Wyszukiwano:
w słowach kluczowych: duplikacja danych

Sortuj według:

Ogranicz wyniki do:

A Gaussian-based WGAN-GP oversampling approach for solving the class imbalance problem

Zhou Qian, Sun Bo

International Journal of Applied Mathematics and Computer Science

2024

Vol. 34, no. 2

291--307

In practical applications of machine learning, the class distribution of the collected training set is usually imbalanced, i.e., there is a large difference among the sizes of different classes. The class imbalance problem often hinders the achievable generalization performance of most classifier learning algorithms to a large extent. To ameliorate the learning performance, some effective approaches have been proposed in the literature, where the recently presented GAN-based oversampling methods are very representative. However, their generated minority class examples have the risk of high similarity and duplication degree. To further ameliorate the quality of the generated minority class examples, i.e., to make the generated examples effectively expand the minority class region, a novel oversampling approach named the GWGAN-GP is proposed, which is based on the Gaussian distribution label within the framework of a Wasserstein generative adversarial network with gradient penalty (WGAN-GP). Our GWGAN-GP approach incorporates the Gaussian distribution as an input label, thereby making the generated examples more diverse and dispersive. The examples are then combined with the original dataset to form a balanced dataset, which is subsequently utilized to evaluate the classification performance of three selected classification algorithms. Experimental results on 16 imbalanced datasets demonstrate that the GWGAN-GP not only generates examples that better conform to the distribution of the original dataset, but also achieves superior classification performance. Specifically, when combined with the KNN classifier, the GWGAN-GP significantly outperforms other oversampling approaches considered in the study.

Software tools to measure the duplication of information

Kostrzewski M.

Journal of Applied Computer Science

2014

Vol. 22, nr 1

61--71

Data stored in average computer system usually is not unique, portions of stored data are duplicated. When duplicated data are stored in separate ﬁles containing source code of computer program of student homework, a possibility of cheating should be seriously considered. This paper presents software tools built, in order to detect re-use of pieces of code in supplied text ﬁles. Three aspects of information matching are considered: identity, similarity, and analogy. Built tools have proved useful in real life situations.