Metody reprezentacji atrybutów w zadaniach zgłębiania danych

Pelikant, A.

Artykuł - szczegóły

Tytuł artykułu

Metody reprezentacji atrybutów w zadaniach zgłębiania danych

Autorzy

Pelikant A.

Identyfikatory

Warianty tytułu

Methods of atributes representation in data mining problems

Języki publikacji

Abstrakty

Artykuł zawiera dyskusję metod związanych z reprezentacją atrybutów w procesie budowania aplikacji zgłębiania danych (Data Mining). Przedstawiono zagadnienie redukcji ilości wymiarów analizowanej przestrzeni zadania - liczby istotnych atrybutów, a także standaryzowania wymiarów do uogólnionego zakresu zmienności. Główną częścią pracy jest omówienie problemów związanych ze sposobami reprezentacji atrybutów. Związane jest to głównie z koniecznością dyskretyzacji danych ciągłych w modelach, które są dedykowane dla danych nieciągłych (dyskretnych) oraz ciągłej reprezentacji danych dyskretnych w modelach wymagających tego typu atrybutów. Przedstawione zostały konkurencyjne algorytmy począwszy od "naiwnych" przez bardziej rozbudowane dyskretyzacji wstępującej, zstępującej oraz opartej o dyskryminator Fishera. Wskazano na miejsce zastosowanie metod oceny z zastosowaniem kryterium separowalności lub krzywej ROC. W oparciu o przykłady wskazano cechy prezentowanych rozwiązań.

The paper contains the discussion of methods connected with attributes representation in the process of build data mining applications. The task of dimensions number reduction in the analyzed problem space was introduced and presents – the number of essential attributes, and standardization to generalized range variation. The main part of this work is description of the problems with attributes representation. This is connected mainly with necessity of continuous data discreetisation in mining models which are dedicated for non - continuous data (discreet) as well as the continuous representation of discrete data in exacting this attribute type mining models. Competitive algorithms were introduced begin from simple "naive" by more extending discreetisation algorithms ascending, descending as well Fisher’s discriminator. The place of methods use and evaluation due to separate criterion or curve the ROC was pointed. Based on examples the features of presented methods and algorithms were shown.

Słowa kluczowe

reprezentacja atrybutów kryterium separowalności krzywa ROC dane nieciągłe dane dyskretne zgłębianie danych Data Mining

Wydawca

Wyższa Szkoła Informatyki i Umiejętności

Czasopismo

Zeszyty Naukowe Wyższej Szkoły Informatyki

Rocznik

2008

Tom

Vol. 7, Nr 1

Strony

76--97

Opis fizyczny

Bibliogr. 16 poz., rys.

Twórcy

autor

Pelikant A.

Wyższa Szkoła Informatyki, Katedra Inżynierskich Zastosowań Informatyki

Bibliografia

[1] Kowalczyk A., Pelikant A.:: Fuzzy clustering in relational databases, XII International Conference - System Modelling and Control SMC’2007
[2] Pelikant: A.: Bazy danych w zastosowaniach praktycznych. Roz. Kierunki rozwoju baz danych i technologii z nimi związanych, monografia WSInf
[3] Agata M., Pelikant: A.: Support methods for weak learning algorithms – Adaboost, XII International Conference – System Modelling and Control SMC’2007
[4] Gomide F., Silva L., Yager R.: Participatory Learning Clustering, Forecasting and Evolutionary Fuzzy Systems, BISCSE’05 University of California, Berkeley, November 2005
[5] Ahmad A., Dey L.: A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering 63 (2007) 503–527 Elsevier
[6] Vapnik V.: The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[7] Vapnik V., Kotz S.: Estimation of Dependences Based on Empirical Data, Springer, 2006.
[8] Grąbczewski K., Duch W.: The separability of split value criterion. Proceedings of the 5th Conference on Neural Networks and Their Applications, s 201–208, Zakopane, Poland, 2000.
[9] Grąbczewski K., Duch W.: Heterogeneous forests of decision trees. Proceedings of International Conference on Artificial Neural Networks, Vol.2415 seria Lecture Notes in Computer Science, ss 504–509. Springer, 2002.
[10] Dietterich T. G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1924, 1998.
[11] Bruha I., Berka P.: Discretization and fuzzification of numerical attributes in attribute-based learning. Fuzzy Systems in Medicine, wolumen 41 serii Studies in Fuzziness and Soft Computing, strony 112–138. Physica-Verlag (Springer), Heidelberg, 2000.
[12] Freund Y., Schapire R.: Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, ss 148–156, 1996.
[13] Freund Y., Schapire R.: A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
[14] John G. H., Langley P.: Estimating continuous distributions in Bayesian classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, San Mateo, 1995. Morgan Kaufmann Publishers.
[15] Wilson D. R., Martinez T. R.: Improved heterogeneous distance functions, Journal of Artificial Intelligence Research, 11:1–34, 1997.
[16] Li B., Shen Y., Li B.: A New Algorithm for Computing the Minimum Hausdorff Distance Between Two Point Sets on A Line Under Translation, Information Processing Letters (2007).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUJ5-0050-0093