Wyniki wyszukiwania - BazTech

1

Using reinforcement learning to select an optimal feature set

Akhiat Yassine, Zinedine Ahmed, Chahhou Mohamed

Journal of Automation Mobile Robotics and Intelligent Systems

|

2024

|

Vol. 18, No. 1

56--66

EN

Feature Selection (FS) is an essential research topic in the area of machine learning. FS, which is the process of identifying the relevant features and removing the irrelevant and redundant ones, is meant to deal with the high dimensionality problem for the sake of selecting the best performing feature subset. In the literature, many feature selection techniques approach the task as a research problem, where each state in the search space is a possible feature subset. In this paper, we introduce a new feature selection method based on reinforcement learning. First, decision tree branches are used to traverse the search space. Second, a transition similarity measure is proposed so as to ensure exploit-explore trade-off. Finally, the informative features are the most involved ones in constructing the best branches. The performance of the proposed approaches is evaluated on nine standard benchmark datasets. The results using the AUC score show the effectiveness of the proposed system.

2

Dimensionality reduction of rotor fault dataset based on joint embedding of multi-class graphs

Zhang Yongfei, Wang Jiaxuan, Zhao Rongzhen, Deng Linfeng

Eksploatacja i Niezawodność

|

2024

|

Vol. 26, no. 1

art. no. 177417

EN

Traditional dimensionality reduction techniques usually rely on a single or a limited number of similar graphs for graph embedding, which limits their ability to extract more information about the internal structure of the data. To address this problem, this study proposes a rotor fault dataset dimensionality reduction algorithm based on multi-class graph joint embedding (MCGJE). The algorithm first overcomes the defect that the traditional feature space cannot take both local and global information into account by constructing local and global median feature line graphs; secondly, based on the graph embedding framework, the algorithm also constructs a hypergraph structure for inscribing complex multivariate relationships between high-dimensional data in the feature space, which in turn enables it to contain more fault information. Finally, we conducted two different rotor fault simulation experiments. The results show that the MCGJE-based algorithm has robustdimensionality reduction capability and can significantly improve the accuracy of fault identification.

3

A contemporary multi-objective feature selection model for depression detection using a hybrid pBGSK optimization algorithm

Kavi Priya Santhosam, Pon Karthika Kasirajan

International Journal of Applied Mathematics and Computer Science

|

2023

|

Vol. 33, no. 1

117--131

EN

Depression is one of the primary causes of global mental illnesses and an underlying reason for suicide. The user generated text content available in social media forums offers an opportunity to build automatic and reliable depression detection models. The core objective of this work is to select an optimal set of features that may help in classifying depressive contents posted on social media. To this end, a novel multi-objective feature selection technique (EFS-pBGSK) and machine learning algorithms are employed to train the proposed model. The novel feature selection technique incorporates a binary gaining-sharing knowledge-based optimization algorithm with population reduction (pBGSK) to obtain the optimized features from the original feature space. The extensive feature selector (EFS) is used to filter out the excessive features based on their ranking. Two text depression datasets collected from Twitter and Reddit forums are used for the evaluation of the proposed feature selection model. The experimentation is carried out using naive Bayes (NB) and support vector machine (SVM) classifiers for five different feature subset sizes (10, 50, 100, 300 and 500). The experimental outcome indicates that the proposed model can achieve superior performance scores. The top results are obtained using the SVM classifier for the SDD dataset with 0.962 accuracy, 0.929 F1 score, 0.0809 log-loss and 0.0717 mean absolute error (MAE). As a result, the optimal combination of features selected by the proposed hybrid model significantly improves the performance of the depression detection system.

4

Research on communication emitter identification based on semi-supervised dimensionality reduction in complex electromagnetic environment

Ge Wei, Qi Lin, Tong Lin, Zhu Jun, Zhang Jing, Zhao Dongyang, Li Ke

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2023

|

Vol. 71, nr 4

art. no. e145766

EN

The individual identification of communication emitters is a process of identifying different emitters based on the radio frequency fingerprint features extracted from the received signals. Due to the inherent non-linearity of the emitter power amplifier, the fingerprints provide distinguishing features for emitter identification. In this study, approximate entropy is introduced into variational mode decomposition, whose features performed in each mode which is decomposed from the reconstructed signal are extracted while the local minimum removal method is used to filter out the noise mode to improve SNR. We proposed a semi-supervised dimensionality reduction method named exponential semi-supervised discriminant analysis in order to reduce the high-dimensional feature vectors of the signals, and LightGBM is applied to build a classifier for communication emitter identification. The experimental results show that the method performs better than the state-of-the-art individual communication emitter identification technology for the steady signal data set of radio stations with the same plant, batch and model.

5

Efficient multi-classifier wrapper feature-selection model. Application for dimension reduction in credit scoring

Bouaguel Waad

Computer Science

|

2022

|

T. 23 (1)

133--155

EN

The task of identifying the most relevant features for a credit-scoring application is a challenging task. Reducing the number of redundant and unwanted features is an inevitable task for improving the performance of a credit-scoring model. The wrapper approach is usually used in credit-scoring applications to identify the most relevant features. However, this approach suffers from the issue of subset generation and the use of a single classifier as an evaluation function. The problem here is that each classifier may give different results that can be interpreted differently. Hence, we propose an ensemble wrapper featureselection model in this study that is based on a multi-classifier combination. In the first stage, we address the problem of subset generation by minimizing the search space through a customized heuristic. Then, a multi-classifier wrapper evaluation is applied using two-classifier-arrangement approaches in order to select a set of mutually approved sets of relevant features. The proposed method was evaluated on four credit datasets and has shown good performance as compared to individual classifier results.

6

Multiobjective design of permanent magnet generator with dimensionality reduction in criteria space

Woźniak Piotr

Przegląd Elektrotechniczny

|

2020

|

R. 96, nr 7

86--91

EN

This paper addresses the problem of dimensionality reduction while preserving the characteristics of the Pareto set approximation in multiobjective optimization. The real-life engineering design problem for permanent magnet generator is considered. The Pareto front approximations with constraints, ranging from the five objectives to the set of two, are presented and compared.

PL

W artykule przedstawiono rozwiązanie problemu optymalizacji wielokryterialnej poprzez redukcję wymiarów w przestrzeni kryteriów. Rozważono właściwości zbioru Pareto w zadaniu projektowania generatora z magnesami trwałymi. Zaprezentowano i porównano aproksymacje frontu Pareto przy optymalizacji z ograniczeniami przy redukcji z pięciu dwóch kryteriów.

7

Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster

Ali Ahmed Hussein, Abdullah Mahmood Zaki

Computer Science

|

2019

|

Vol. 20 (4)

411--429

EN

The big data concept has elicited studies on how to accurately and efficiently extract valuable information from such huge dataset. The major problem during big data mining is data dimensionality due to a large number of dimensions in such datasets. This major consequence of high data dimensionality is that it affects the accuracy of machine learning (ML) classifiers; it also results in time wastage due to the presence of several redundant features in the dataset. This problem can be possibly solved using a fast feature reduction method. Hence, this study presents a fast HP-PL which is a new hybrid parallel feature reduction framework that utilizes spark to facilitate feature reduction on shared/distributed-memory clusters. The evaluation of the proposed HP-PL on KDD99 dataset showed the algorithm to be significantly faster than the conventional feature reduction techniques. The proposed technique required >1 minute to select 4 dataset features from over 79 features and 3,000,000 samples on a 3-node cluster (total of 21 cores). For the comparative algorithm, more than 2 hours was required to achieve the same feat. In the proposed system, Hadoop’s distributed file system (HDFS) was used to achieve distributed storage while Apache Spark was used as the computing engine. The model development was based on a parallel model with full consideration of the high performance and throughput of distributed computing. Conclusively, the proposed HP-PL method can achieve good accuracy with less memory and time compared to the conventional methods of feature reduction. This tool can be publicly accessed at https://github.com/ahmed/Fast-HP-PL.

8

Ensemble Methods for Improving Classification of Data Produced by Latent Dirichlet Allocation

Jankowski M.

Computer Science and Mathematical Modelling

|

2018

|

No. 8

17--28

EN

Topic models are very popular methods of text analysis. The most popular algorithm for topic modelling is LDA (Latent Dirichlet Allocation). Recently, many new methods were proposed, that enable the usage of this model in large scale processing. One of the problem is, that a data scientist has to choose the number of topics manually. This step, requires some previous analysis. A few methods were proposed to automatize this step, but none of them works very well if LDA is used as a preprocessing for further classification. In this paper, we propose an ensemble approach which allows us to use more than one model at prediction phase, at the same time, reducing the need of finding a single best number of topics. We have also analyzed a few methods of estimating topic number.

PL

Modelowanie tematyczne, jest popularną metodą analizy tekstów. Jednym z najbardziej popularnych algorytmów modelowania tematycznego jest LDA (Latent Dirichlet Allocation) [14]. W ostatnim czasie zostało zaproponowanych wiele nowych rozszerzeń tego modelu, które pozwalają na przetwarzanie dużych ilości danych. Jednym z problemów podczas użycia algorytmu LDA jest to, że liczba tematów musi zostać wybrana przed uruchomieniem algorytmu. Ten krok, wymaga wcześniejszej analizy i zaangażowania analityka danych. Powstało kilka metod, które pozwalają automatyzować ten krok, ale żadna z nich, nie działa dobrze, gdy LDA jest użyte do redukcji wymiarów przed klasyfikacją danych. W tej pracy, proponujemy podejście oparte o ensemble wielu modeli. Taki model, unika problemu wybrania jednego, najlepszego modelu LDA. Pokażemy, że takie podejście pozwala uzyskać niższy błąd klasyfikacji. Zaproponujemy również, dwie nowe metody wyboru liczby tematów, gdy chcemy użyć tylko pojedynczego modelu.

9

Intrinsic dimensionality detection criterion based on Locally Linear Embedding

Meng L., Breitkopf P.

Computer Science

|

2018

|

Vol. 19 (3)

345--356

EN

In this work, we revisit the Locally Linear Embedding (LLE) algorithm that is widely employed in dimensionality reduction. With a particular interest to the correspondences of the nearest neighbors in the original and embedded spaces, we observe that, when prescribing low-dimensional embedding spaces, LLE remains merely a weight-preserving rather than a neighborhood-preserving algorithm. Thus, we propose a \neighborhood-preserving ratio" criterion to estimate the minimal intrinsic dimensionality required for neighborhood preservation. We validate its efficiency on sets of synthetic data, including S-curve, Swiss roll, and a dataset of grayscale images.

10

Predicting pairwise relations with neural similarity encoders

Horn F., Müller K. R.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2018

|

Vol. 66, nr 6

821--830

EN

Matrix factorization is at the heart of many machine learning algorithms, for example, dimensionality reduction (e.g. kernel PCA) or recommender systems relying on collaborative filtering. Understanding a singular value decomposition (SVD) of a matrix as a neural network optimization problem enables us to decompose large matrices efficiently while dealing naturally with missing values in the given matrix. But most importantly, it allows us to learn the connection between data points’ feature vectors and the matrix containing information about their pairwise relations. In this paper we introduce a novel neural network architecture termed similarity encoder (SimEc), which is designed to simultaneously factorize a given target matrix while also learning the mapping to project the data points’ feature vectors into a similarity preserving embedding space. This makes it possible to, for example, easily compute out-of-sample solutions for new data points. Additionally, we demonstrate that SimEc can preserve non-metric similarities and even predict multiple pairwise relations between data points at once.

11

A hybrid intelligent system for the prediction of Parkinson's Disease progression using machine learning techniques

Nilashi M., Ibrahim O., Ahmadi H., Shahmoradi L., Farahmand M.

Biocybernetics and Biomedical Engineering

|

2018

|

Vol. 38, no. 1

1--15

EN

Parkinson's Disease (PD) is a progressive degenerative disease of the nervous system that affects movement control. Unified Parkinson's Disease Rating Scale (UPDRS) is the baseline assessment for PD. UPDRS is the most widely used standardized scale to assess parkinsonism. Discovering the relationship between speech signal properties and UPDRS scores is an important task in PD diagnosis. Supervised machine learning techniques have been extensively used in predicting PD through a set of datasets. However, the most methods developed by supervised methods do not support the incremental updates of data. In addition, the standard supervised techniques cannot be used in an incremental situation for disease prediction and therefore they require to recompute all the training data to build the prediction models. In this paper, we take the advantages of an incremental machine learning technique, Incremental support vector machine, to develop a new method for UPDRS prediction. We use Incremental support vector machine to predict Total-UPDRS and Motor-UPDRS. We also use Non-linear iterative partial least squares for data dimensionality reduction and self-organizing map for clustering task. To evaluate the method, we conduct several experiments with a PD dataset and present the results in comparison with the methods developed in the previous research. The prediction accuracies of method measured by MAE for the Total-UPDRSand Motor-UPDRS were obtained respectively MAE = 0.4656 and MAE = 0.4967. The results of experimental analysis demonstrated that the proposed method is effective in predicting UPDRS. The method has potential to be implemented as an intelligent system for PD prediction in healthcare.

12

Short review of dimensionality reduction methods for failure detection

Pocha A., Misztal K., Morkisz P.

Schedae Informaticae

|

2017

|

Vol. 26

69--78

EN

Size of a dataset is often a challenge in real-life applications. Especially, when working with time series data, when the next sample is produced every few milliseconds and can include measurements from hundreds of sensors, one has to take dimensionality of the data into consideration. In this work, we compare various dimensionality reduction methods for time series data and check their performance on a failure detection task. We work on sensory data coming from existing machines.

13

Multilinear Filtering Based on a Hierarchical Structure of Covariance Matrices

Szwabe A., Misiorek P, Ciesielczyk M.

Schedae Informaticae

|

2015

|

Vol. 24

103--112

EN

We propose a novel model of multilinear filtering based on a hierarchical structure of covariance matrices – each matrix being extracted from the input tensor in accordance to a specific set-theoretic model of data generalization, such as derivation of expectation values. The experimental analysis results presented in this paper confirm that the investigated approaches to tensor-based data representation and processing outperform the standard collaborative filtering approach in the ‘cold-start’ personalized recommendation scenario (of very sparse input data). Furthermore, it has been shown that the proposed method is superior to standard tensor-based frameworks such as N-way Random Indexing (NRI) and Higher-Order Singular Value Decomposition (HOSVD) in terms of both the AUROC measure and computation time.

14

Dimensionality Reduction for Probabilistic Neural Network in Medical Data Classification Problems

Kusy M.

International Journal of Electronics and Telecommunications

|

2015

|

Vol. 61, No. 3

289–300

EN

This article presents the study regarding the problem of dimensionality reduction in training data sets used for classification tasks performed by the probabilistic neural network (PNN). Two methods for this purpose are proposed. The first solution is based on the feature selection approach where a single decision tree and a random forest algorithm are adopted to select data features. The second solution relies on applying the feature extraction procedure which utilizes the principal component analysis algorithm. Depending on the form of the smoothing parameter, different types of PNN models are explored. The prediction ability of PNNs trained on original and reduced data sets is determined with the use of a 10-fold cross validation procedure.

15

Tissue Classification Using Efficient Local Fisher Discriminant Analysis

Wang Z., Sun X., Sun L., Qian X.

Przegląd Elektrotechniczny

|

2013

|

R. 89, nr 3b

113--115

EN

A novel scatter-difference-based local Fisher discriminant analysis(SDLFDA) algorithm for tissue classification is proposed in this paper. SDLFDA explicitly considers the local manifold structure and interclass discrimination in gene expression data space. By using SDLFDA, each gene expression data can be projected into a lower-dimensional discriminative feature space. In addition, SDFLDA reduces the computational cost through QR decomposition. Experimental results demonstrate the effectiveness and efficiency of the proposed SDLFDA algorithm.

PL

W artykule przedstawiono algorytm analizy lokalnym wyróżnikiem Fisher’a opartym na różnicach rozproszenia (ang. SDLFDA), służący do klasyfikacji tkanek. Proponowana metoda pozwala na zmniejszenie wymiarowości przestrzeni wyróżnika, określającego dane GXD, a także redukcję kosztów obliczeniowych dzięki dekompozycji QR. Wyniki badań eksperymentalnych potwierdzają skuteczność i sprawność algorytmu.

16

Dimensionality reduction of dynamic mesh animations using HO-SVD

Romaszewski M., Gawron P., Opozda S.

Journal of Artificial Intelligence and Soft Computing Research

|

2013

|

Vol. 3, No. 4

277--289

EN

This work presents an analysis of Higher Order Singular Value Decomposition (HOSVD) applied to reduction of dimensionality of 3D mesh animations. Compression error is measured using three metrics (MSE, Hausdorff, MSDM). Results are compared with a method based on Principal Component Analysis (PCA) and presented on a set of animations with typical mesh deformations.

17

Random projections and Hotelling’s T2 statistics for change detection in high-dimensional data streams

Skubalska-Rafajłowicz E.

International Journal of Applied Mathematics and Computer Science

|

2013

|

Vol. 23, no. 2

447--461

EN

The method of change (or anomaly) detection in high-dimensional discrete-time processes using a multivariate Hotelling chart is presented. We use normal random projections as a method of dimensionality reduction. We indicate diagnostic properties of the Hotelling control chart applied to data projected onto a random subspace of Rn. We examine the random projection method using artificial noisy image sequences as examples.

18

Redukcja wymiarowości danych a selekcja cech w zastosowaniu do prognozowania maksymalnego dobowego obciążenia elektroenergetycznego

Siwek K.

Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska

|

2013

|

nr 2

9--12

PL

Prognozowanie obciążeń w systemie elektroenergetycznym jest ważnym problemem praktycznym zarówno z technicznego jak i ekonomicznego punktu widzenia. W małych systemach problem ten jest stosunkowo trudny do rozwiązania ze względu na dużą zmienność przebiegu obciążenia. Do jego rozwiązania niezbędne jest zastosowanie dobrego predykatora i wyselekcjonowanie cech procesu wpływających na prognozę. Artykuł przedstawia dwie metody selekcji cech – algorytm genetyczny oraz algorytmy redukcji wymiarowości. Jako predykator użyta była maszyna wektorów podtrzymujących działająca w trybie regresji (SVR). Zaprezentowano i omówiono uzyskane wyniki na rzeczywistych danych pomiarowych.

EN

Load forecasting task of small energetic region is a difficult problem due to high variability of power consumption. The accurate forecast of the power in the next hours is very important from the economic point of view. The most important problems in prediction are the choice of predictor and selection of features. Two methods of features selection was presented – simple selection using of genetic algorithm and dimensionality reduction methods for creating new features from many available measured data. As a predictor the Support Vector Machine working in regression mode (SVR) was chosen. The load forecasting results with SVR are presented and discussed.

19

Szacowanie selektywności zapytań oparte na transformacie Hougha i metodzie PCA

Augustyn D., Kostrzewa D.

Studia Informatica

|

2012

|

Vol. 33, nr 2A

211-227

PL

Oszacowanie selektywności zapytania jest istotnym elementem procesu uzyskiwania optymalnego planu wykonania tego zapytania. Wyznaczenie selektywności wymaga użycia nieparametrycznego estymatora rozkładu wartości atrybutu, na ogół histogramu. Wykorzystanie wielowymiarowego histogramu jako reprezentacji łącznego rozkładu wielowymiarowego jest nieekonomiczne z powodu zajętości pamięciowej takiej reprezentacji. W artykule zaproponowano nową metodę, nazwaną HPCA, oszczędną pod względem zajętości, gdzie rozkład dwuwymiarowy w przybliżeniu może być reprezentowany w postaci zbioru histogramów jednowymiarowych. Metoda HPCA opiera się na transformacji Hougha i metodzie analizy składowych głównych. Dzięki HPCA można uzyskiwać dokładniejsze oszacowania selektywności zapytań niż te, otrzymane przy wykorzystaniu standardowych 2-wymiarowych histogramów.

EN

Query selectivity estimation is an important element of obtaining optimal query execution plan. Selectivity estimation requires a nonparametric estimator of attribute values distribution – commonly a histogram. Using a multidimensional histogram as a representation of a joint multidimensional distribution of attributes values is not space-efficient. The paper introduces a new space-efficient method called HPCA, where a 2-dimesional distribution may be represented by a set of 1-dimensional histograms. HPCA is based on Hough transform and principal component analysis method. Using HPCA commonly gives more accurate selectivity estimation than standard methods based on a 2-dimensional histogram.

20

Motion capture as Data Source for Gait-based Human Identification

Josiński H., Świtoński A., MIchalczuk A., Wojciechowski K.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 12b

201-204

EN

The authors present results of the research aiming at human identification based on tensor representation of the gait motion capture data. High-dimensional tensor samples were reduced by means of the multilinear principal component analysis (MPCA). For the purpose of classification the following methods from the WEKA software were used: k Nearest Neighbors (kNN), Naive Bayes, Multilayer Perceptron, and Radial Basis Function Network. The maximum value of the correct classification rate (CCR) was achieved for the classifier based on the multilayer perceptron.

PL

Autorzy prezentują wyniki badań nad identyfikacją osób na podstawie danych chodu uzyskanych za pomocą techniki motion capture. Redukcję wymiarowości przeprowadzono stosując algorytm wieloliniowej analizy składowych głównych (MPCA), który operuje na tensorowej reprezentacji danych. Dla potrzeb identyfikacji osób zastosowano szereg metod klasyfikacji dostępnych w pakiecie WEKA uzyskując największą skuteczność dla perceptronu wielowarstwowego. (Technika motion capture jako źródło danych dla identyfikacji osób na podstawie chodu).