PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Boruta - A System for Feature Selection

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.
Wydawca
Rocznik
Strony
271--285
Opis fizyczny
Bibliogr. 25 poz., wykr.
Twórcy
autor
autor
Bibliografia
  • [1] Bishop, C.M. (1996) Neural Networks for Pattern Recognition. Clarendon Press, Oxford
  • [2] Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco.
  • [3] Vapnik, V. N. (1998) Statistical learning theory. New York, Wiley;
  • [4] Pawlak, Z. (1981) Information systems theoretical foundations, Inf. Syst. 6, 205-218.
  • [5] Breiman, L. Random Forests, Machine Learning 45 (2001), 5-32. Also see the bibliography at: http://www.stat.berkeley.edu/˜breiman/RandomForests/cc papers.htm
  • [6] Diaz-Uriarte, R., Alvarez de Andres, S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3.
  • [7] Segal, M. R., Barbour, J. D., Grant, R. M. (2004) Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests. Stat. Appl. Gen. Mol. Biol., 3:2.
  • [8] Rudnicki,W. R., Kierczak, M., Koronacki, J., Komorowski, H. J. (2006) A Statistical Method for Determining Importance of Variables in an Information System. In Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H. S., Slowinski, R. (Eds.): Lecture Notes in Computer Science ? 4259/2006 5th International Conference, RSCTC 2006, Kobe, Japan, November 6-8, 2006, Proceedings, 557-566.
  • [9] Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J. (2006) Evaluation of Different Biological Data and Computational Classification Methods for Use in Protein Interaction Prediction. Proteins, 63, 490-500.
  • [10] Guha, R., Jurs, P. C. (2003). Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors. J. Chem. Inf. Comp. Sci., 44, 2179-2189.
  • [11] Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., Feuston, B. P. (2003). Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comp. Sci., 43, 1947-1958.
  • [12] Ward, M. M., Pajevic, S., Dreyfuss, J., Malley, J. D. (2006). Short-Term Prediction of Mortality in Patients with Systemic Lupus Erythematosus: Classification of Outcomes Using Random Forests. Arthritis and Rheumatism, 55, 74-80.
  • [13] Lunetta, K. L., Hayward, L. B., Segal, J., Eerdewegh, P. V, (2004). Screening Large-Scale Association Study Data: Exploiting Interactions Using Random Forests. BMC Genetics, 5:32.
  • [14] Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., Eerdewegh, P. V. (2005) Identifying SNPs Predictive of Phenotype Using Random Forests. Gen. Epidem., 28:171-182.
  • [15] Statnikov, A., Wang, L., Aliferis, C. F. (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9:319
  • [16] Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution BMC Bioinformatics, 8:25
  • [17] Strobl, C., Zeileis, A., (2008). Danger: High Power! ? Exploring the Statistical Properties of a Test for Random Forest Variable Importance. Technical Report Number 017. Department of Statistics, University of Munich
  • [18] Strobl, C., Boulesteix, A.-L., Kneib, T., Augistin, T., Zeileis, A., (2008). Conditional Variable Importance for Random Forests. Technical Report Number 23. Department of Statistics, University of Munich
  • [19] R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org
  • [20] Liaw, A., Wiener, M., (2002). Classification and Regression by randomForest. R News 2(3), 18-22.
  • [21] Kierczak, M., Rudnicki, W. R., Koronacki, J., Komorowski, H. J. Kierczak M, Rudnicki WR, and Komorowski J. Construction of rough sets-based classifiers
  • [22] Ellington, A. D., Szostak, J.W. (1990) In vitro selection of RNA molecules that bind specific ligands. Nature, 346, 818-822.
  • [23] Lee, J. F., Hesselberth, J. R, Meyers, L. A, Ellington, A. D., (2004) Aptamer database. Nucleic Acids Res., 32, D95-D100.
  • [24] Hofacker, I. L., Fontana, W., Stadler, P. F., L. Sebastian Bonhoeffer, L. S., Tacker, M., Schuster, P. (1994) Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package) Monatsh. Chem. 125, 167-188.
  • [25] Jiang, F., Kumar, R. A., Jones, R. A., Patel, D. J. (1996) Structural basis of RNA folding and recognition in an AMP-RNA aptamer complex. Nature, 382, 183-186.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BUS8-0010-0072
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.