Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semisupervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Using reanalysed atmospheric data and applying a data-driven multiscale approximation to non-stationary dynamical processes, we undertake a systematic examination of the role of memory and dimensionality in defining the quasi-stationary states of the troposphere over the recent decades. We focus on the role of teleconnections characterised by either zonally-oriented wave trains or meridional dipolar structures. We consider the impact of various strategies for dimension reduction based on principal component analysis, diagonalization and truncation.We include the impact of memory by consideration of Bernoulli, Markovian and non-Markovian processes. We a priori explicitly separate barotropic and baroclinic processes and then implement a comprehensive sensitivity analysis to the number and type of retained modes. Our results show the importance of explicitly mitigating the deleterious impacts of signal degradation through ill-conditioning and under sampling in preference to simple strategies based on thresholds in terms of explained variance. In both hemispheres, the results obtained for the dominant tropospheric modes depend critically on the extent to which the higher order modes are retained, the number of free model parameters to be fitted, and whether memory effects are taken into account. Our study identifies the primary role of the circumglobal teleconnection pattern in both hemispheres for Bernoulli and Markov processes, and the transient nature and zonal structure of the Southern Hemisphere patterns in relation to their Northern Hemisphere counterparts. For both hemispheres, overfitted models yield structures consistent with the major teleconnection modes (NAO, PNA and SAM), which give way to zonally oriented wavetrains when either memory effects are ignored or where the dimension is reduced via diagonalising. Where baroclinic processes are emphasised, circumpolar wavetrains are manifest.
It is well known that nonparametric regression techniques do not have good performance in high dimensional regression. However nonparametric regression is successful in one- or low-dimensional regression problems and is much more flexible than the parametric alternative. Hence, for high dimensional regression tasks one would like to reduce the regressor space to a lower dimension and then use nonparametric methods for curve estimation. A possible dimension reduction approach is Sliced Inverse Regression (L i 1991). It allows to find a base of a subspace in the regressor space which still carries important information for the regression. The vectors spanning this subspace are found with a technique similar to Principal Component Analysis and can be judged with the eigenvalues that belong to these vectors. Asymptotic and simulation results for the eigenvalues and vectors are presented.
Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.
5
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).
6
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The appropriate modeling of technical systems usually results in dynamical systems having many or even an infinite number of degrees of freedom. Moreover, nonlinearities play an important role in many applications, so that the arising systems of nonlinear differential equations are difficult to analyze. However, it is well known that the asymptotic behavior of some high dimensional systems can be described by corresponding systems of much smaller dimension. The present paper deals with the dimension reduction of nonlinear systems close to a bifurcation point. Using the ideas of normal form theory, the asymptotic dynamics of the system is extracted by a nonlinear coordinate transformation. The solutions of the reducedordsr system are analyzed analytically with respect to their stability and their domains of attraction. Furthermore, the inverse of the near-identity transformations is used to construct adapted Lyapunov functions for the original system to estimate the attractors of the solutions as well. The procedure is applied to the Duffing equation and the equations of motion of a railway wheelset and compared with numerical solutions.
7
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The dimensionality and the amount of data that need to be processed when intensive data streams are observed grow rapidly together with the development of sensors arrays, CCD and CMOS cameras and other devices. The aim of this paper is to propose an approach to dimensionality reduction as a first stage of training RBF nets. As a vehicle for presenting the ideas, the problem of estimating multivariate probability densities is chosen. The linear projection method is briefly surveyed. Using random projections as the first (additional) layer, we are able to reduce the dimensionality of input data. Bounds on the accuracy of RBF nets equipped with a random projection layer in comparison to RBF nets without dimensionality reduction are established. Finally, the results of simulations concerning multidimensional density estimation are briefly reported.
The classification of low signal-to-noise ratio (SNR) underwater acoustic signals in complex acoustic environments and increasingly small target radiation noise is a hot research topic. . This paper proposes a new method for signal processing—low SNR underwater acoustic signal classification method (LSUASC)—based on intrinsic modal features maintaining dimensionality reduction. Using the LSUASC method, the underwater acoustic signal was first transformed with the Hilbert-Huang Transform (HHT) and the intrinsic mode was extracted. the intrinsic mode was then transformed into a corresponding Mel-frequency cepstrum coefficient (MFCC) to form a multidimensional feature vector of the low SNR acoustic signal. Next, a semi-supervised fuzzy rough Laplacian Eigenmap (SSFRLE) method was proposed to perform manifold dimension reduction (local sparse and discrete features of underwater acoustic signals can be maintained in the dimension reduction process) and principal component analysis (PCA) was adopted in the proces of dimension reduction to define the reduced dimension adaptively. Finally, Fuzzy C-Means (FCMs), which are able to classify data with weak features was adopted to cluster the signal features after dimensionality reduction. The experimental results presented here show that the LSUASC method is able to classify low SNR underwater acoustic signals with high accuracy.
9
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.
10
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Dimension reduction is an important topic in data mining and machine learning. Especially dimension reduction combined with feature fusion is an effective preprocessing step when the data are described by multiple feature sets. Canonical Correlation Analysis (CCA) and Discriminative Canonical Correlation Analysis (DCCA) are feature fusion methods based on correlation. However, they are different in that DCCA is a supervised method utilizing class label information, while CCA is an unsupervised method. It has been shown that the classification performance of DCCA is superior to that of CCA due to the discriminative power using class label information. On the other hand, Linear Discriminant Analysis (LDA) is a supervised dimension reduction method and it is known as a special case of CCA. In this paper, we analyze the relationship between DCCA and LDA, showing that the projective directions by DCCA are equal to the ones obtained from LDA with respect to an orthogonal transformation. Using the relation with LDA, we propose a new method that can enhance the performance of DCCA. The experimental results show that the proposed method exhibits better classification performance than the original DCCA.
11
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Marginal Fisher Analysis (MFA) is a novel dimensionality reduction algorithm. However, the two nearest neighborhood parameters are difficult to select when constructing graphs. In this paper, we propose a nonparametric method called Marginal Discriminant Projection (MDP) which can solves the problem of parameters selection in MFA. Experiment on several benchmark datasets demonstrated the effectiveness of our proposed method, and appreciate performance was achieved when applying on coal mine safety data dimensionality reduction.
PL
W artykule zaproponowano nieparametryczna metodę nazwaną MDOP (marginal discriminant projection) która pomaga rozwiązać problem selekcji danych w algorytmie MFA (marginal Fisher analysis). Metodę zastosowano do redukcji danych w systemach bezpieczeństwa klopalni węglowych.
12
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Dimension reduction is an important topic in data mining and machine learning. Especially dimension reduction combined with feature fusion is an effective preprocessing step when the data are described by multiple feature sets. Canonical Correlation Analysis (CCA) and Discriminative Canonical Correlation Analysis (DCCA) are feature fusion methods based on correlation. However, they are different in that DCCA is a supervised method utilizing class label information, while CCA is an unsupervised method. It has been shown that the classification performance of DCCA is superior to that of CCA due to the discriminative power using class label information. On the other hand, Linear Discriminant Analysis (LDA) is a supervised dimension reduction method and it is known as a special case of CCA. In this paper, we analyze the relationship between DCCA and LDA, showing that the projective directions by DCCA are equal to the ones obtained from LDA with respect to an orthogonal transformation. Using the relation with LDA, we propose a new method that can enhance the performance of DCCA. The experimental results show that the proposed method exhibits better classification performance than the original DCCA.
13
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Sometimes feature representations of measured individuals are better described by spherical coordinates than Cartesian ones. The author proposes to introduce a preprocessing step in LDA based on the arctangent transformation of spherical coordinates. This nonlinear transformation does not change the dimension of the data, but in combination with LDA it leads to a dimension reduction if the raw data are not linearly separated. The method is presented using various examples of real and artificial data.
Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.
15
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Our aim is to explore the CCA (Curvilinear Component Analysis) as applied to condition monitoring of gearboxes installed in bucket wheel excavators working in field condition, with the general goal to elaborate a probabilistic model describing the condition of the machine gearbox. To do it we need (a) information on the shape (probability distribution) of the analyzed data, and (b) some reduction of dimensionality of the data (if possible). We compare (for real set of data gathered in field conditions) the 2D representations yielded by the CCA and PCA methods and state that they are different. Our main result is: The analyzed data set describing the machine in a good state is composed of two different subsets of different dimensionality thus can not be modelled by one common Gaussian distribution. This is a novel statement in the domain of gearbox data analysis.
PL
W pracy przedstawiono wyniki prac nad zastosowaniem CCA (Curvilinear Component Analysis - analiza komponentów krzywoliniowych) do nieliniowej redukcji wymiarowości danych wykorzystywanych do diagnostyki przekładni planetarnej stosowanej w układach napędowych koparki kołowej. Do oceny stanu technicznego niezbędne jest zbudowanie modelu pobabilistycznego zbioru cech diagnostycznych. Modelowanie danych wielowymiarowych (gęstości prawdopodobieństwa) dla wszystkich wymiarów jest trudne, i ze względu na istniejącą redundancję, nieuzasadnione, dlatego prowadzi się badania nad redukcją wymiarowości zbiorów cech diagnostycznych. W artykule porównujemy dwuwymiarowe reprezentacje zbioru cech uzyskane metodami CCA i PCA (analiza składowych głównych) wykazując różnice w uzyskanych wynikach. Głównym wynikiem pracy jest identyfikacja w przestrzeni cech diagnostycznych dla przekładni w stanie prawidłowym dwóch podzbiorów danych o różnej rzeczywistej wymiarowości zatem nie mogą być one modelowane za pomocą jednego modelu o charakterystyce gaussowskiej. Interpretacja tych podzbiorów wiąże się z występowaniem różnych obciążeń maszyny.
16
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Standard methods of multivariate statistics fail in the analysis of high-dimensional data. This paper gives an overview of recent classification methods proposed for the analysis of high-dimensional data, especially in the context of molecular genetics. We discuss methods of both biostatistics and data mining based on various background, explain their principles, and compare their advantages and limitations. We also include dimension reduction methods tailor-made for classification analysis and also such classification methods which reduce the dimension of the computation intrinsically. A common feature of numerous classification methods is the shrinkage estimation principle, which has obtained a recent intensive attention in high-dimensional applications.
In this paper we introduce a hybrid approach to data series classification. The approach is based on the concept of aggregated upper and lower envelopes, and the principal components here called 'essential attributes', generated by multilayer neural networks. The essential attributes are represented by outputs of hidden layer neurons. Next, the real valued essential attributes are nominalized and symbolic data series representation is obtained. The symbolic representation is used to generate decision rules in the IF. . . THEN. . . form for data series classification. The approach reduces the dimension of data series. The efficiency of the approach was verified by considering numerical examples.
The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain’s fundamental tasks of clustering, classification and detection of atypical elements (outliers).
Recently, business protocol discovery has taken more attention in the field of web services. This activity permits a better description of the web service by giving information about its dynamics. The latter is not supported by theWSDL language which concerns only the static part. The problem is that the only information available to construct the dynamic part is the set of log files saving the runtime interaction of the web service with its clients. In this paper, a new approach based on the Discrete Wavelet Transformation (DWT) is proposed to discover the business protocol of web services. The DWT allows reducing the problem space while preserving essential information. It also overcomes the problem of noise in the log files. The proposed approach has been validated using artificially-generated log files.
20
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The dimensionality and the amount of data that need to be processed when intensive data streams are observed grow rapidly together with the development of sensors arrays, CCD and CMOS cameras and other devices. The aim of this paper is to propose an approach to dimensionality reduction as a first stage of training RBF nets. As a vehicle for presenting the ideas, the problem of estimating multivariate probability densities is chosen. The linear projection method is briefly surveyed. Using random projections as the first (additional) layer, we are able to reduce the dimensionality of input data. Bounds on the accuracy of RBF nets equipped with a random projection layer in comparison to RBF nets without dimensionality reduction are established. Finally, the results of simulations concerning multidimensional density estimation are briefly reported.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.