The large data sample partitioning for diagnosis and pattern recognition

Subbotin, S.; Gromaszek, K.

Artykuł - szczegóły

Tytuł artykułu

The large data sample partitioning for diagnosis and pattern recognition

Autorzy

Subbotin S. , Gromaszek K.

Identyfikatory

Warianty tytułu

Partycjonowanie dużej próby danych do diagnozowania i rozpoznawania wzorców

Języki publikacji

Abstrakty

The article presents solving a problem of automation of partitioning of the original sample to the training and test samples to create diagnostic and recognizing models by precedents. We propose a new method of training and test sample forming. It preserves in a generated sub-sample the most important topological properties of the original sample and did not even needs to load of the original sample into computer memory as well as multiple passes over the original sample. This allows to significantly reduce the sample size, and to significantly decrease the requirements to computer resources. The proposed method is based on the cluster analysis of the sample with subsequent determination of exemplars located on the borders of the classes. The method automatically determines the number and coordinates of cluster centers. At the same time it provides a sequential exemplar processing, order not to keep the distance between all exemplars. The method also performs transformation of the multi-dimensional coordinate set to the one-dimensional, which is also discretized to improve the data generalization properties. The estimates of temporal and spatial complexities of the proposed method were determined. They allow to determine the possibility of a particular problem solving and to estimate the requirements to computer resources. The software that implements the proposed method of sampling has been developed. The experiments were conducted to study the proposed method at the real problem solution. The results of experiments allow to recommend the proposed method for use in practice.

Artykuł przedstawia rozwiązanie problemu automatycznego partycjonowania oryginalnej próby danych na próbę trenującą i testową, proponując nową metodę ich tworzenia. Metoda ta zachowuje w wygenerowanych podzbiorach najważniejsze właściwości topologiczne próby oryginalnej, nie wymaga ładowania jej w całości do pamięci operacyjnej komputera jak również wielokrotnego przeszukiwania. Pozwala to na znaczące zmniejszenie rozmiaru próby jak również znaczące zmniejszenie wymagań sprzętowych komputera. Proponowana metoda wykorzystuje analizę skupień dla próby z późniejszym określeniem wzorców zlokalizowanych na granicach klas. Metoda automatycznie określa liczbę oraz współrzędne środków skupień. Jednocześnie zapewnia ona szeregowe przetwarzanie wzorców bez zachowywania odległości pomiędzy wszystkimi wzorcami. Wspomniana metoda dokonuje transformacji zbioru wielowymiarowych współrzędnych do jednowymiarowego, który jest ponadto dyskretyzowany w celu zwiększenia reprezentatywności uzyskanych w ten sposób danych. Określono estymatory złożoności czasowej i przestrzennej zaproponowanej metody. Pozwalają one na określenie możliwości rozwiązania danego problemu oraz oszacować wymagania odnośnie zasobów komputera. Ponadto stworzono oprogramowanie, w którym zaimplementowano opisywaną metodę. Przeprowadzono szereg badań symulacyjnych, które potwierdziły, przydatności tej metody w praktyce.

Słowa kluczowe

original sample diagnosis pattern recognition

próba danych diagnostyka rozpoznawanie wzorców

Wydawca

Wydawnictwo SIGMA-NOT

Czasopismo

Elektronika : konstrukcje, technologie, zastosowania

Rocznik

2013

Tom

Vol. 54, nr 8

Strony

17--20

Opis fizyczny

Bibliogr. 15 poz.

Twórcy

autor

Subbotin S.

Zaporizhzhya National Technical University, Zaporizhzhya, Ukraine

autor

Gromaszek K.

Lublin University of Technology, Lublin, Polska

Bibliografia

[1] Ruan D., Intelligent hybrid systems, fuzzy logic, neural networks, and genetic algorithms Springer, Berlin, 2012.
[2] Sumathi S., Paneerselvam S., Computational intelligence paradigms, theory & applications using MATLAB. CRC Press, Boca Raton, 2010.
[3] Subbotin S., et al., Intelligent information technologies of automated diagnosis and pattern recognition systems design, monograph. Smith Company ltd., Kharkov, 2012 (in Russian).
[4] Chaudhuri A., Stenger H., Survey sampling theory and methods. Chapman & Hali, New York, 2005.
[5] Subbotin S., Methods of sampling based on exhaustive and evolutionary search. Automatic Control and Computer Sciences. Vol. 47, Nº 3, 2013, 113-121.
[6] Lavrakas P., Encyclopedia of survey research methods. Sage Publications, Thousand Oaks, 2008.
[7] Bernard H., Social research methods, qualitative and quantative approaches. Sage Publications, Thousand Oaks, 2006.
[8] Ghosh S., Multivariate analysis, design of experiments, and survey sampling. Marcel Dekker Inc., New York, 1999.
[9] Plutowski M., Selecting training exemplars for neural network learning, dissertation... doctor of philosophy in computer science and engineering. University of California, San Diego, 1994.
[10] Subbotin S., The training set quality measures for neural network learning. Optical Memory and Neural Networks (Information Optics). Vol. 19, Nº 2, 2010, 126-139.
[11] Everitt B., Cluster Analysis. John Wiley & Sons ltd., Chichester, 2011.
[12] Abonyi J., Feil B., Cluster analysis for data mining and system identification. Birkhäuser, Basel, 2007.
[13] Boguslaev A., et al., Progressive technologies of modeling, optimization and intelligent automation of aviation engine lifecycle stages, monograph. Motor Sich JSC, Zaporozhye, 2009 (in Russian).
[14] Subbotin S., Boichenko K., Automatic system of vehicle detection and recognition on the image. Software products and systems. Nº 1, 2010, 114-116 (in Russian).
[15] Dubrovin V., et al., The plant recognition on remote sensing results by the feed-forward neural networks. Intelligent engineering systems through artificial neural networks. Vol. 10, Smart engeneering systems design, neural networks, fuzzy logic, evolutionary programming, data mining, and complex systems. ASME Press, New York, 2000.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-e2563017-4b46-4401-a167-059844e60196