Boruta - A System for Feature Selection

Kursa, M.B.; Jankowski, A.; Rudnicki, W.R.

Artykuł - szczegóły

Tytuł artykułu

Boruta - A System for Feature Selection

Autorzy

Kursa M.B. , Jankowski A. , Rudnicki W.R.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.

Słowa kluczowe

random forests Boruta Algorithm Synthetic Data Sets Biological Data

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2010

Tom

Vol. 101, nr 4

Strony

271--285

Opis fizyczny

Bibliogr. 25 poz., wykr.

Twórcy

autor

Kursa M.B.

autor

Jankowski A.

autor

Rudnicki W.R.

ICM, University of Warsaw, Pawińskiego 5a, Warsaw, Poland, W.Rudnicki@icm.edu.pl

Bibliografia

[1] Bishop, C.M. (1996) Neural Networks for Pattern Recognition. Clarendon Press, Oxford
[2] Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco.
[3] Vapnik, V. N. (1998) Statistical learning theory. New York, Wiley;
[4] Pawlak, Z. (1981) Information systems theoretical foundations, Inf. Syst. 6, 205-218.
[5] Breiman, L. Random Forests, Machine Learning 45 (2001), 5-32. Also see the bibliography at: http://www.stat.berkeley.edu/˜breiman/RandomForests/cc papers.htm
[6] Diaz-Uriarte, R., Alvarez de Andres, S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3.
[7] Segal, M. R., Barbour, J. D., Grant, R. M. (2004) Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests. Stat. Appl. Gen. Mol. Biol., 3:2.
[8] Rudnicki,W. R., Kierczak, M., Koronacki, J., Komorowski, H. J. (2006) A Statistical Method for Determining Importance of Variables in an Information System. In Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H. S., Slowinski, R. (Eds.): Lecture Notes in Computer Science ? 4259/2006 5th International Conference, RSCTC 2006, Kobe, Japan, November 6-8, 2006, Proceedings, 557-566.
[9] Qi, Y., Bar-Joseph, Z., Klein-Seetharaman, J. (2006) Evaluation of Different Biological Data and Computational Classification Methods for Use in Protein Interaction Prediction. Proteins, 63, 490-500.
[10] Guha, R., Jurs, P. C. (2003). Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors. J. Chem. Inf. Comp. Sci., 44, 2179-2189.
[11] Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., Feuston, B. P. (2003). Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comp. Sci., 43, 1947-1958.
[12] Ward, M. M., Pajevic, S., Dreyfuss, J., Malley, J. D. (2006). Short-Term Prediction of Mortality in Patients with Systemic Lupus Erythematosus: Classification of Outcomes Using Random Forests. Arthritis and Rheumatism, 55, 74-80.
[13] Lunetta, K. L., Hayward, L. B., Segal, J., Eerdewegh, P. V, (2004). Screening Large-Scale Association Study Data: Exploiting Interactions Using Random Forests. BMC Genetics, 5:32.
[14] Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., Eerdewegh, P. V. (2005) Identifying SNPs Predictive of Phenotype Using Random Forests. Gen. Epidem., 28:171-182.
[15] Statnikov, A., Wang, L., Aliferis, C. F. (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9:319
[16] Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution BMC Bioinformatics, 8:25
[17] Strobl, C., Zeileis, A., (2008). Danger: High Power! ? Exploring the Statistical Properties of a Test for Random Forest Variable Importance. Technical Report Number 017. Department of Statistics, University of Munich
[18] Strobl, C., Boulesteix, A.-L., Kneib, T., Augistin, T., Zeileis, A., (2008). Conditional Variable Importance for Random Forests. Technical Report Number 23. Department of Statistics, University of Munich
[19] R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org
[20] Liaw, A., Wiener, M., (2002). Classification and Regression by randomForest. R News 2(3), 18-22.
[21] Kierczak, M., Rudnicki, W. R., Koronacki, J., Komorowski, H. J. Kierczak M, Rudnicki WR, and Komorowski J. Construction of rough sets-based classifiers
[22] Ellington, A. D., Szostak, J.W. (1990) In vitro selection of RNA molecules that bind specific ligands. Nature, 346, 818-822.
[23] Lee, J. F., Hesselberth, J. R, Meyers, L. A, Ellington, A. D., (2004) Aptamer database. Nucleic Acids Res., 32, D95-D100.
[24] Hofacker, I. L., Fontana, W., Stadler, P. F., L. Sebastian Bonhoeffer, L. S., Tacker, M., Schuster, P. (1994) Fast Folding and Comparison of RNA Secondary Structures (The Vienna RNA Package) Monatsh. Chem. 125, 167-188.
[25] Jiang, F., Kumar, R. A., Jones, R. A., Patel, D. J. (1996) Structural basis of RNA folding and recognition in an AMP-RNA aptamer complex. Nature, 382, 183-186.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0010-0072