Visualizing data through Czekanowski's diagram has as its aim to present how objects are related to each other. Often, obvious clusters of observations are directly visible. However, exactly delimiting them is not a straightforward task. We present here a development of the RMaCzek package that includes cluster identification in Czekanowski's diagrams.
PL
Diagram Czekanowskiego ma na celu zaprezentowanie podobieństw wewnątrz próbki statystycznej. Najczęściej widać na nim wyraźne grupowania elementów. Jednakże dokładne wyznaczenie granic między skupieniami nie jest trywialnym zagdnieniem. W niniejszej pracy przedstawiamy rozszerzoną wersję pakietu RMaCzek, która pozwala na analizę skupień w diagramach Czekanowskiego.
With the introduction to the science paradigm of Granular Computing, in particular, information granules, the way of thinking about data has changed gradually. Both specialists and scientists stopped focusing on the single data records themselves, but began to look at the analyzed data in a broader context, closer to the way people think. This kind of knowledge representation is expressed, in particular, in approaches based on linguistic modelling or fuzzy techniques such as fuzzy clustering. Therefore, especially important from the point of view of the methodology of data research, is an attempt to understand their potential as information granules. In this study, we will present special cases of using the innovative method of representing the information potential of variables with the use of information granules. In a series of numerical experiments based on both artificially generated data and ecological data on changes in bird arrival dates in the context of climate change, we demonstrate the effectiveness of the proposed approach using classic, not fuzzy measures building information granules.
PL
Wraz z wprowadzeniem do nauki paradygmatu obliczeń ziarnistych, w szczególności ziaren informacji, sposób myślenia o danych stopniowo się zmieniał. Zarówno specjaliści, jak i naukowcy przestali skupiać się na samych rekordach pojedynczych danych, ale zaczęli patrzeć na analizowane dane w szerszym kontekście, bliższym ludzkiemu myśleniu. Ten rodzaj reprezentacji wiedzy wyraża się w szczególności w podejściach opartych na modelowaniu językowym lub technikach rozmytych, takich jak klasteryzacja rozmyta. Dlatego szczególnie ważna z punktu widzenia metodologii badania danych jest próba zrozumienia ich potencjału jako ziaren informacji. W niniejszym opracowaniu przedstawimy szczególne przypadki wykorzystania innowacyjnej metody reprezentacji potencjału informacyjnego zmiennych za pomocą ziaren informacji. W serii eksperymentów numerycznych opartych zarówno na danych generowanych sztucznie, jak i danych ekologicznych dotyczących zmian dat przylotów ptaków w kontekście zmian klimatycznych, demonstrujemy skuteczność proponowanego podejścia przy użyciu klasycznych, a nie rozmytych miar budujących ziarna informacji.
Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take the training and testing data of defect prediction from the same project; instead, dissimilar projects are employed. Selecting the proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method called complexFuzzy is presented for selecting the training data of CPDP. The method reveals the most defective instances that the experimental predictors exploit in order to complete the training. To that end, a fuzzy-based membership is constructed on the data sets. Hence, overfitting (which is a crucial problem in CPDP training) is alleviated. The performance of complexFuzzy is compared to its 5 counterparts on 29 data sets by utilizing 4 classifiers. According to the obtained results, complexFuzzy is superior to other clustering methods in CPDP performance.
Estimating travel time is one of the most important processes in logistics as well as in everyday life. In particular, when it comes to transportation services, efficient time management can be a competitive advantage, not to mention customer satisfaction, which can be easily translated into business success. Therefore, in this study we analyze various travel time estimation methods in combination with a well-known Fuzzy C-Means clustering algorithm. The proposed FCM-based solution has significant advantages, allowing for the determination of the optimal travel time. In an extensive numerical experiment, we present the application of the proposed method to estimate the time of a taxi trip around New York. Due to division of the city area into detailed areas and taking into account information about the travel time in the analysis, a model was obtained, that perfectly forecasts speed of taxi travel. In this study we consider various, competitive approaches to build such a model.
In the presented work two variants of the fuzzy clustering approach dedicated for determining the antecedents of the rules of the fuzzy rule-based classifier were presented. The main idea consists in adding additional prototypes (’prototypes in between’) to the ones previously obtained using the fuzzy c-means method (ordinary prototypes). The ’prototypes in between’ are determined using pairs of the ordinary prototypes, and the algorithm based on distances and densities finding such pairs was proposed. The classification accuracy obtained applying the presented clustering approaches was verified using six benchmark datasets and compared with two reference methods.
Artykuł zawiera opis wykorzystania klasteryzacji rozmytej dla potrzeb rozpoznawania rodzajów przepływów dwufazowych typu gaz-ciecz. Autorzy przedstawili szczegółowy opis procesu pozyskiwania trójwymiarowych danych tomograficznych, tak zwanych surowych danych tomograficznych, nowych metod gromadzenia, interpretacji oraz statystycznego przetwarzania tego typu danych. Dodatkowo w artykule znajduje się opis podstawowych zagadnień z zakresu logiki rozmytej i klasteryzacji rozmytej takich jak wyznaczanie wektora cech znaczących czy zasady działania klasyfikatora rozmytego (FCM) w odniesieniu do specyficznego rodzaju danych wykorzystanych podczas badań. Uzasadniając wybór klasteryzacji rozmytej autorzy zaprezentowali wyniki przeprowadzanych eksperymentów, które potwierdziły, że algorytmy rozmyte bardzo dobrze nadają się do badań nad zjawiskami o bardzo dynamicznym charakterze, jakimi bez wątpienia są przepływy dwufazowe typu gaz-ciecz.
EN
The paper contains a description of the fuzzy clustering method usage for the recognition of two-phase gas-liquid flows. The authors present a detailed description of the obtaining process of three dimensional tomographic data, the so-called raw tomographic data, and new methods of the data collection, interpretation and statistical processing. In addition, the article includes a description of the key issues in the field of fuzzy logic and fuzzy clustering such as the determination of the primary features vector or the fuzzy classifier (FCM) principle of use with a specific type of data used in the study. Justifying the choice of fuzzy clustering authors presented the results of experiments carried out, which confirmed that the fuzzy algorithms are very good matched to the study of phenomena of a very dynamic nature, which, definitely, are the two-phase gas-liquid flows.
Fuzzy clustering is a popular unsupervised learning method that is used in cluster analysis. Fuzzy clustering allows a data point to belong to two or more clusters. Fuzzy c-means is the most well-known method that is applied to cluster analysis, however, the shortcoming is that the number of clusters need to be predefined. This paper proposes a clustering approach based on Particle Swarm Optimization (PSO). This PSO approach determines the optimal number of clusters automatically with the help of a threshold vector. The algorithm first randomly partitions the data set within a preset number of clusters, and then uses a reconstruction criterion to evaluate the performance of the clustering results. The experiments conducted demonstrate that the proposed algorithm automatically finds the optimal number of clusters. Furthermore, to visualize the results principal component analysis projection, conventional Sammon mapping, and fuzzy Sammon mapping were used.
In the article algorithms for decision support for hardware and software complex are described. The complex is used for few precision farming tasks: data mining, data processing, decision making and control of fertilizers applying. The complex is designed to reduce costs and environmental burden on potato. The complex is based on processing aerial images photographs of potato fields.
The analysis of optokinetic nystagmus (OKN) provides valuable information about the condition of human vision system. One of the phenomena that is used in the medical diagnosis is optokinetic nystagmus. Nystagmus are voluntary or involuntarily eye movements being a response to a stimuli which activate the optokinetic systems. The electronystagmography (ENG) signal corresponding to the nystagmus has a form of a saw tooth waveform with fast components related to saccades. The accurate detection of the saccades in the ENG signal is the base for the further estimation of the nystagmus characteristic. The proposed algorithm is based on the proper filtering of the ENG signal providing a waveform with amplitude peaks corresponding the fast eyes rotation. The correct recognition of the local maxima of the signal is obtained by the means of fuzzy c-means clustering (FCM). The paper presents three variants of saccades detection algorithm based on the FCM. The performance of the procedures was investigated using the artificial as well as the real optokinetic nystagmus cycles. The proposed method provides high detection sensitivity and allows for the automatic and precise determination of the saccades location in the preprocessed ENG signal.
Sygnał elektrynystagmograficzny (ENG) z oczopląsem ma postać fali o piłokształtnym kształcie składającym się z fazy wolnej oraz szybkiej. Faza szybka to ruch sakkadyczny gałki ocznej. Skuteczna i dokładna detekcja sakkad ma kluczowe znaczenie w określeniu charakteru oczopląsu. W celu prawidłowej detekcji położenia sakkad sygnał ENG jest filtrowany a maksima lokalne są wykrywane za pomocą rozmytej metody c-średnich. Proponowany algorytm charakteryzuje się dużą czułością i pozwala na automatyczną i precyzyjną lokalizację sakkad w sygnale ENG.
EN
The electronystagmography (ENG) signal corresponding to nystagmus has a form of a saw tooth waveform with fast components related to saccades. The accurate detection of saccades in ENG signal is the base for the further estimation of the nystagmus characteristic. The proposed algorithm is based on the proper filtering of the ENG signal providing a waveform with amplitude peaks corresponding the fast eyes rotation. The correct recognition of the local maxima of the signal is obtained by the means of fuzzy c-means clustering (FCM). The proposed algorithm is highly sensitive and allows for the automatic and precise localization of the saccades in ENG signal.
Some data sets contain data clusters not in all dimension, but in subspaces. Known algorithms select attributes and identify clusters in subspaces. The paper presents a novel algorithm for subspace fuzzy clustering. Each data example has fuzzy membership to the cluster. Each cluster is defined in a certain subspace, but the the membership of the descriptors of the cluster to the subspace (called descriptor weight) is fuzzy (from interval [0; 1]) - the descriptors of the cluster can have partial membership to a subspace the cluster is defined in. Thus the clusters are fuzzy defined in their subspaces. The clusters are defined by their centre, fuzziness and weights of descriptors. The clustering algorithm is based on minimizing of criterion function. The paper is accompanied by the experimental results of clustering. This approach can be used for partition of input domain in extraction rule base for neuro-fuzzy systems.
PL
Niektóre dane zawierają grupy danych nie we wszystkich wymiarach, ale w pewnych podprzestrzeniach dziedziny. Artykuł przedstawia algorytm grupowania danych w rozmytych podprzestrzeniach. Każdy przykład danych ma pewną rozmytą przynależność do grupy (klastra). Każdy klaster z kolei jest rozpięty w pewnej podprzestrzeni dziedziny wejściowej. Klastry mogą być rozpięte w różnych podprzestrzeniach. Algorytm grupowania oparty jest na minimalizacji funkcji kryterialnej. W wyniku jego działania wypracowane są położenia klastrów, ich rozmycie i wagi ich deskryptorów. Przestawiono także wyniki eksperymentów grupowania danych syntetycznych i rzeczywistych
Analiza skupień wartości pobranej mocy w kolejnych odcinkach czasu może dostarczyć informacji zarówno o charakterystycznych wycinkach profili poboru mocy, jak i ilości typów całych profili. W artykule przedstawione zostały rezultaty działania zaimplementowanego algorytmu grupowania wektorów danych. Analizie poddano dane pomiarowe pobranej w 15-minutowych odcinkach czasu mocy biernej przez pojedynczych odbiorców w ciągu roku, grupując zarówno całe profile dobowe, jak i poszczególne punkty czasowe poboru mocy w ciągu doby. W przypadku grupowania całych profili, najlepsze podziały uzyskano dla liczby grup z zakresu od 2 do 6, większy rozrzut liczby klas uzyskano przy grupowaniu punktów w obrębie doby. Opracowany algorytm opierał się na rozmytej metodzie klasteryzacji c-średnich z dodatkowymi elementami poprawiającymi jego funkcjonalność. Zadanie polegało na takim podziale zbioru obserwacji z wielowymiarowej przestrzeni, aby dane przyporządkowane do jednej grupy były w większym stopniu podobne do siebie niż do jakiejkolwiek innej. Rozmyte podejście umożliwia częściową przynależność poszczególnych elementów do wielu grup. W pracy przebadano różne rodzaje wskaźników oceny jakości grupowania. Wyniki przeprowadzonych badań pozwalają stwierdzić, że najbardziej przydatne okazują się wskaźniki jakości Fukuyamy-Sugeno oraz wskaźnik rozmytych rozproszeń.
EN
A cluster analysis of data of power consumption in terms can inform about the characteristic points in power consumption profiles as well as a number of models of profiles. In this paper are presented the results of cluster algorithm application. By the analysis have been dealt with measurements data of reactive power consumption by single customers and clustering day-profile as well as individual term consumption have been executed. The best results of clustering have been obtained by clustering of profiles for the number of clusters between 2 and 6, more dispersed range has been observed by clustering of term consumption. In the adapted algorithm has been used Fuzzy C-Means with addition elements improving functionality. In the paper the test results of various indexes of cluster validity have been presented. The experiments have shown that Fukuyama-Sugeno index and Fuzzy Dispersion index are more useful.
13
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Short-term load forecasting (STLF) plays a decisive role in electric power system operation and planning. Accurate load forecasting not only reduces the generation costs of power systems, but also serves to maximize profit for participants in electricity markets. In recent years, power markets have grown more deregulated and competitive, adding to the complexity and uncertainties of load, and making it more difficult for conventional techniques to accurately forecast the load. To improve the accuracy of load forecasting, this paper suggests a hybrid method, called Gray-Fuzzy-Markov Chain Method (GFMCM), comprising three stages. In the first stage, daily load is forecasted by Gray model, with its training deviations classified, in a second stage, by fuzzy-set theory, and finally, fed into Markov chain model to predict future relative errors that might be supplied by the Gray model. The proposed approach has been verified by the historical data of power consumption in Ontario, PJM and Iranian electricity markets. The obtained forecasts by GFMCM proved to have better prediction properties compared to the other forecasting techniques, such as Gray models, specifically GM(1,1) and GM(1,2), ARIMA time series, wavelet-ARIMA and multi-layer perceptron (MLP) neural network.
PL
W celu poprawy jakości przewidywania zużycia energii autorzy zaproponowali hybrydową metodę GMMCM (Gray-Fuzzy-Markov Chan Method). W pierwszym etapie prognoza obciążeń jest prowadzona przy wykorzystaniu modelu Gray, następnie stosuje się metody logiki rozmytej. Błąd prognozowania analizowany jest metodą Markova.
14
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The changing information technology makes data increase exponentially in all areas, the quality of the huge amounts of data is the core problems. Data cleaning is an effective technology to solve data quality problems. This paper focuses on the duplicate data cleaning techniques. It studies the quality of the data from the architectural level, the instance-level problems, the multi-source single-source problems, duplicated records cleaning application platform and the evaluation criteria. In these studies, a improved novel detection method adopts the fuzzy clustering algorithm with the Levenshtein distance combination to data cleaning .It can accurately and quickly detect and remove duplicate raw data. The improved method includes a similar duplicate records detection process, the major system framework design, system function modules of the implementation process and results analysis in the paper. The precision and recall rates are higher than several other data cleaning methods. These comparisons confirm the validity of the method. The experimental results exhibit that the proposed method is effective in data detection and cleaning process.
PL
Artykuł proponuje nowe metody czyszczenia danych z uwzględnieniem liczby przypadków, wielu źródeł, podwójnych rekordów i innych kryteriów oceny. Ulepszona metoda detekcji wykorzystuje algorytm rozmytego klastrowania w dystansem Levenshteina. W ten sposób szybko wykrywane są i usuwane podwójne wiersze danych.
The paper presents an unsupervised approach to biomedical signal segmentation. The proposed segmentation process consists of several stages. In the first step, a state-space of the signal is reconstructed. In the next step, the dimension of the reconstructed state-space is reduced by projection into principal axes. The final step involves fuzzy clustering method. The clustering process is applied in the kernel-feature space. In the experimental part, the fetal heart rate (FHR) signal is used. The FHR baseline and the acceleration or deceleration patterns are the main signal nonstationarities but also the most clinically important signal features determined and interpreted in computer-aided analysis.
16
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
With the gaining popularity of rough clustering, soft computing research community is studying relationships between rough and fuzzy clustering as well as their relative advantages. Both rough and fuzzy clustering are less restrictive than conventional clustering. Fuzzy clustering memberships are more descriptive than rough clustering. In some cases, descriptive fuzzy clustering may be advantageous, while in other cases it may lead to information overload. Many applications demand use of combined approach to exploit inherent strengths of each technique. Our objective is to examine correlation between these two techniques. This paper provides an experimental description of how rough clustering results can be correlated with fuzzy clustering results. We illustrate procedural steps to map fuzzy membership clustering to rough clustering. However, such a conversion is not always necessary, especially if one only needs lower and upper approximations. Experiments also show that descriptive fuzzy clustering may not always (particularly for high dimensional objects) produce results that are as accurate as direct application of rough clustering. We present analysis of the results from both the techniques.
Classification plays very important role in medical diagnosis. This paper presents fuzzy clustering method dedicated to classification algorithms. It focuses on two additional sub-methods modifying obtained clustering prototypes and leading to final prototypes, which are used for creating the classifier fuzzy if-then rules. The main goal of that work was to examine a performance of the classifier which uses such rules. Commonly used including medical benchmark databases were applied. In order to validate the results, each database was represented by 100 pairs of learning and testing subsets. The obtained classification quality was better in relation to the one of the best classifiers - Lagrangian SVM and suggests that presented clustering with additional sub-methods are appropriate to application to classification algorithms.
Classification methods can be divided into supervised and unsupervised methods. The supervised classifier requires a training set for the classifier parameter estimation. In the case of absence of a training set, the popular classifiers (e.g. K-Nearest Neighbors) can not be used. The clustering methods are considered as unsupervised classification methods. This paper presents an idea of the unsupervised classification with the popular classifiers. The fuzzy clustering method is used to create a learning set. The learning set includes only these patterns that are the best representative of each class in the input dataset. The numerical experiment uses an artificial dataset as well as the medical datasets (PIMA, Wisconsin Breast Cancer) and illustrates the usefulness of the proposed method.
Fuzzy clustering is a well-established method for identifying the structure/fuzzy partitioning of Takagi-Sugeno (TS) fuzzy models. The clustering algorithms require choosing the fuzziness parameter m. Prior work in the area of pattern recognition shows, that a suitable choice of m is application- dependent. Yet, the default of m=2 is commonly chosen. This paper examines the suitable choice of m for identifying TS models. The focus is on models that use the classifiers resulting from fuzzy clustering as multi-dimensional membership functions or their projection and approximation. At first, the differentiability and grouping properties of the fuzzy classifiers are analyzed to make a general recommendation of choosing m(1;3). Besides, the effect of the cluster number c on the classification fuzziness is examined. Finally, requirements that are specific to TS modeling are introduced, which narrow down the suitable range for m. Building on algorithm analysis and four case studies (function approximation, a vehicle engine and an axial compressor application for nonlinear regression), it is demonstrated that choosing m2(1;1.3) for local and m2(1;1.5) for global estimation will typically provide for good results.
20
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this article, a novel concept is introduced by using both unsupervised and supervised learning. For unsupervised learning, the problem of fuzzy clustering in microarray data as a multiobjective optimization is used, which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. In this regards, a new multiobjective differential evolution based fuzzy clustering technique has been proposed. Subsequently, for supervised learning, a fuzzy majority voting scheme along with support vector machine is used to integrate the clustering information from all the solutions in the resultant Pareto-optimal set. The performances of the proposed clustering techniques have been demonstrated on five publicly available benchmark microarray data sets. A detail comparison has been carried out with multiobjective genetic algorithm based fuzzy clustering, multiobjective differential evolution based fuzzy clustering, single objective versions of differential evolution and genetic algorithm based fuzzy clustering as well as well known fuzzy c-means algorithm. While using support vector machine, comparative studies of the use of four different kernel functions are also reported. Statistical significance test has been done to establish the statistical superiority of the proposed multiobjective clustering approach. Finally, biological significance test has been carried out using a web based gene annotation tool to show that the proposed integrated technique is able to produce biologically relevant clusters of coexpressed genes.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.