Wyniki wyszukiwania - BazTech

1

A comparative study for outlier detection methods in high dimensional text data

Park Cheong Hee

Journal of Artificial Intelligence and Soft Computing Research

|

2023

|

Vol. 13, No. 1

5--17

EN

Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semisupervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.

2

Signal-piloted processing metaheuristic optimization and wavelet decomposition based elucidation of arrhythmia for mobile healthcare

Qaisar Saeed Mian, Khan Sibghatullah I., Dallet Dominique, Tadeusiewicz Ryszard, Pławiak Paweł

Biocybernetics and Biomedical Engineering

|

2022

|

Vol. 42, no. 2

681--694

EN

The next generation healthcare systems will be based on the cloud connected wireless biomedical wearables. The key performance indicators of such systems are the compression, computational efficiency, transmission and power effectiveness with precision. The electrocardiogram (ECG) signals processing based novel technique is presented for the diagnosis of arrhythmia. It employs a novel mix of the Level-Crossing Sampling (LCS), Enhanced Activity Selection (EAS) based QRS complex selection, multirate processing, Wavelet Decomposition (WD), Metaheuristic Optimization (MO), and machine learning. The MIT-BIH dataset is used for experimentation. Dataset contains 5 classes namely, ‘‘Atrial premature contraction”, ‘‘premature ventricular contraction”, ‘‘right bundle branch block”, ‘‘left bundle branch block” and ‘‘normal sinus”. For each class, 450 cardiac pulses are collected from 3 different subjects. The performance of Marine Predators Algorithm (MPA) and Artificial Butterfly Optimization Algorithm (ABOA) is investigated for features selection. The selected features sets are passed to classifiers that use machine learning for an automated diagnosis. The performance is tested by using multiple evaluation metrics while following the 10-fold cross validation (10-CV). The LCS and EAS results in a 4.04-times diminishing in the average count of collected samples. The multirate processing lead to a more than 7-times computational effectiveness over the conventional fix-rate counter parts. The respective dimension reduction ratios and classification accuracies, for the MPA and ABOA algorithms, are 29.59-times & 22.19-times and 98.38% & 98.86%.

3

A new low SNR underwater acoustic signal classification method based on intrinsic modal features maintaining dimensionality reduction

Ju Yang, Wei Zhengxian, Li Huangfu, Feng Xiao

Polish Maritime Research

|

2020

|

nr 2

187--198

EN

The classification of low signal-to-noise ratio (SNR) underwater acoustic signals in complex acoustic environments and increasingly small target radiation noise is a hot research topic. . This paper proposes a new method for signal processing—low SNR underwater acoustic signal classification method (LSUASC)—based on intrinsic modal features maintaining dimensionality reduction. Using the LSUASC method, the underwater acoustic signal was first transformed with the Hilbert-Huang Transform (HHT) and the intrinsic mode was extracted. the intrinsic mode was then transformed into a corresponding Mel-frequency cepstrum coefficient (MFCC) to form a multidimensional feature vector of the low SNR acoustic signal. Next, a semi-supervised fuzzy rough Laplacian Eigenmap (SSFRLE) method was proposed to perform manifold dimension reduction (local sparse and discrete features of underwater acoustic signals can be maintained in the dimension reduction process) and principal component analysis (PCA) was adopted in the proces of dimension reduction to define the reduced dimension adaptively. Finally, Fuzzy C-Means (FCMs), which are able to classify data with weak features was adopted to cluster the signal features after dimensionality reduction. The experimental results presented here show that the LSUASC method is able to classify low SNR underwater acoustic signals with high accuracy.

4

A wavelet-based approach for business protocol discovery of web services from log files

Moudjari A., Kezzouli I., Talbi H., Draa A.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2019

|

Vol. 67, nr 3

535--546

EN

Recently, business protocol discovery has taken more attention in the field of web services. This activity permits a better description of the web service by giving information about its dynamics. The latter is not supported by theWSDL language which concerns only the static part. The problem is that the only information available to construct the dynamic part is the set of log files saving the runtime interaction of the web service with its clients. In this paper, a new approach based on the Discrete Wavelet Transformation (DWT) is proposed to discover the business protocol of web services. The DWT allows reducing the problem space while preserving essential information. It also overcomes the problem of noise in the log files. The proposed approach has been validated using artificially-generated log files.

5

Dimension reduction for objects composed of vector sets

Szemenyei M., Vajda F.

International Journal of Applied Mathematics and Computer Science

|

2017

|

Vol. 27, no. 1

169--180

EN

Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.

6

An algorithm for reducing the dimension and size of a sample for data exploration procedures

Kulczycki P., Łukasik S.

International Journal of Applied Mathematics and Computer Science

|

2014

|

Vol. 24, no. 1

133--149

EN

The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain’s fundamental tasks of clustering, classification and detection of atypical elements (outliers).

7

Classification methods for high-dimensional genetic data

Kalina J.

Biocybernetics and Biomedical Engineering

|

2014

|

Vol. 34, no. 1

10--18

EN

Standard methods of multivariate statistics fail in the analysis of high-dimensional data. This paper gives an overview of recent classification methods proposed for the analysis of high-dimensional data, especially in the context of molecular genetics. We discuss methods of both biostatistics and data mining based on various background, explain their principles, and compare their advantages and limitations. We also include dimension reduction methods tailor-made for classification analysis and also such classification methods which reduce the dimension of the computation intrinsically. A common feature of numerous classification methods is the shrinkage estimation principle, which has obtained a recent intensive attention in high-dimensional applications.

8

Marginal Discriminant Projection for Coal Mine Safety Data Dimensionality Reduction

Zhao Z., Qian J., Cheng J.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 9b

79-83

EN

Marginal Fisher Analysis (MFA) is a novel dimensionality reduction algorithm. However, the two nearest neighborhood parameters are difficult to select when constructing graphs. In this paper, we propose a nonparametric method called Marginal Discriminant Projection (MDP) which can solves the problem of parameters selection in MFA. Experiment on several benchmark datasets demonstrated the effectiveness of our proposed method, and appreciate performance was achieved when applying on coal mine safety data dimensionality reduction.

PL

W artykule zaproponowano nieparametryczna metodę nazwaną MDOP (marginal discriminant projection) która pomaga rozwiązać problem selekcji danych w algorytmie MFA (marginal Fisher analysis). Metodę zastosowano do redukcji danych w systemach bezpieczeństwa klopalni węglowych.

9

Curvilinear dimensionality reduction of data for gearbox condition monitoring

Bartkowiak A., Zimroz R.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 10b

268-271

EN

Our aim is to explore the CCA (Curvilinear Component Analysis) as applied to condition monitoring of gearboxes installed in bucket wheel excavators working in field condition, with the general goal to elaborate a probabilistic model describing the condition of the machine gearbox. To do it we need (a) information on the shape (probability distribution) of the analyzed data, and (b) some reduction of dimensionality of the data (if possible). We compare (for real set of data gathered in field conditions) the 2D representations yielded by the CCA and PCA methods and state that they are different. Our main result is: The analyzed data set describing the machine in a good state is composed of two different subsets of different dimensionality thus can not be modelled by one common Gaussian distribution. This is a novel statement in the domain of gearbox data analysis.

PL

W pracy przedstawiono wyniki prac nad zastosowaniem CCA (Curvilinear Component Analysis - analiza komponentów krzywoliniowych) do nieliniowej redukcji wymiarowości danych wykorzystywanych do diagnostyki przekładni planetarnej stosowanej w układach napędowych koparki kołowej. Do oceny stanu technicznego niezbędne jest zbudowanie modelu pobabilistycznego zbioru cech diagnostycznych. Modelowanie danych wielowymiarowych (gęstości prawdopodobieństwa) dla wszystkich wymiarów jest trudne, i ze względu na istniejącą redundancję, nieuzasadnione, dlatego prowadzi się badania nad redukcją wymiarowości zbiorów cech diagnostycznych. W artykule porównujemy dwuwymiarowe reprezentacje zbioru cech uzyskane metodami CCA i PCA (analiza składowych głównych) wykazując różnice w uzyskanych wynikach. Głównym wynikiem pracy jest identyfikacja w przestrzeni cech diagnostycznych dla przekładni w stanie prawidłowym dwóch podzbiorów danych o różnej rzeczywistej wymiarowości zatem nie mogą być one modelowane za pomocą jednego modelu o charakterystyce gaussowskiej. Interpretacja tych podzbiorów wiąże się z występowaniem różnych obciążeń maszyny.

10

Dimension Reduction and Domains of Attraction of Nonlinear Dynamical Systems

Hochlenert D.

Machine Dynamics Research

|

2011

|

Vol. 35, No. 4

32-48

EN

The appropriate modeling of technical systems usually results in dynamical systems having many or even an infinite number of degrees of freedom. Moreover, nonlinearities play an important role in many applications, so that the arising systems of nonlinear differential equations are difficult to analyze. However, it is well known that the asymptotic behavior of some high dimensional systems can be described by corresponding systems of much smaller dimension. The present paper deals with the dimension reduction of nonlinear systems close to a bifurcation point. Using the ideas of normal form theory, the asymptotic dynamics of the system is extracted by a nonlinear coordinate transformation. The solutions of the reducedordsr system are analyzed analytically with respect to their stability and their domains of attraction. Furthermore, the inverse of the near-identity transformations is used to construct adapted Lyapunov functions for the original system to estimate the attractors of the solutions as well. The procedure is applied to the Duffing equation and the equations of motion of a railway wheelset and compared with numerical solutions.

11

Analysis of correlation based dimension reduction methods

Shin Y. J., Park C. H.

International Journal of Applied Mathematics and Computer Science

|

2011

|

Vol. 21, no 3

549-558

EN

Dimension reduction is an important topic in data mining and machine learning. Especially dimension reduction combined with feature fusion is an effective preprocessing step when the data are described by multiple feature sets. Canonical Correlation Analysis (CCA) and Discriminative Canonical Correlation Analysis (DCCA) are feature fusion methods based on correlation. However, they are different in that DCCA is a supervised method utilizing class label information, while CCA is an unsupervised method. It has been shown that the classification performance of DCCA is superior to that of CCA due to the discriminative power using class label information. On the other hand, Linear Discriminant Analysis (LDA) is a supervised dimension reduction method and it is known as a special case of CCA. In this paper, we analyze the relationship between DCCA and LDA, showing that the projective directions by DCCA are equal to the ones obtained from LDA with respect to an orthogonal transformation. Using the relation with LDA, we propose a new method that can enhance the performance of DCCA. The experimental results show that the proposed method exhibits better classification performance than the original DCCA.

12

A hybrid approach to dimension reduction in classification

Krawczak M., Szkatuła G.

Control and Cybernetics

|

2011

|

Vol. 40, no 2

527-551

EN

In this paper we introduce a hybrid approach to data series classification. The approach is based on the concept of aggregated upper and lower envelopes, and the principal components here called 'essential attributes', generated by multilayer neural networks. The essential attributes are represented by outputs of hidden layer neurons. Next, the real valued essential attributes are nominalized and symbolic data series representation is obtained. The symbolic representation is used to generate decision rules in the IF. . . THEN. . . form for data series classification. The approach reduces the dimension of data series. The efficiency of the approach was verified by considering numerical examples.

13

Mining Outliers in Correlated Subspaces for High Dimensional Data Sets

Leng J., Hong T-P.

Fundamenta Informaticae

|

2010

|

Vol. 98, nr 1

71-86

EN

Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.

14

Random projection RBF nets for multidimensional density estimation

Skubalska-Rafajłowicz E.

International Journal of Applied Mathematics and Computer Science

|

2008

|

Vol. 18, no 4

455-464

EN

The dimensionality and the amount of data that need to be processed when intensive data streams are observed grow rapidly together with the development of sensors arrays, CCD and CMOS cameras and other devices. The aim of this paper is to propose an approach to dimensionality reduction as a first stage of training RBF nets. As a vehicle for presenting the ideas, the problem of estimating multivariate probability densities is chosen. The linear projection method is briefly surveyed. Using random projections as the first (additional) layer, we are able to reduce the dimensionality of input data. Bounds on the accuracy of RBF nets equipped with a random projection layer in comparison to RBF nets without dimensionality reduction are established. Finally, the results of simulations concerning multidimensional density estimation are briefly reported.