Tytuł artykułu
Treść / Zawartość
Pełne teksty:
Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
We show that the Monte Carlo feature selection algorithm for supervised classification proposed, by Dramiński et al. (2008), is not biased towards features with many categories (levels or values). While the algorithm, later extended to include the functionality of discovering interdependencies between features, is surprisingly simple and has been successfully used on many biological data and transactional data of commercial origin, and it has never revealed any bias of the type mentioned, the alleged property of its unbiasedness required a closer scrutiny which is thus provided here. Admittedly, the algorithm does reveal some bias coming from another source, but it is negligible. Hence our final claim is that the algorithm is practically unbiased and the results it provides can be considered fully reliable.
Czasopismo
Rocznik
Tom
Strony
199--211
Opis fizyczny
Bibliogr. 18 poz., wykr.
Twórcy
autor
autor
autor
autor
autor
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Bibliografia
- Archer, K.J. and Kimes, R.V. (2008) Empirical Characterization of Random Forest Variable Importance Measures. Comp Stat & Data Anal, 52(4), 2249-2260.
- Breiman, L. and Cutler, A. (2008) Random Forests - Classification/Clustering Manual. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
- Chrysostomou, K., Chen, S.Y. and Liu, X. (2008) Combining Multiple Classifiers for Wrapper Feature Selection. Int. J. Data Mining, Modelling and Management, 1, 91-102.
- Diaz-Uriarte, R. and de Andres, S.A. (2006) Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics, 7(3), doi:10.1186/1471-2105-7-3.
- Dramiński,M.,Rada-Iglesias,A.,Enroth, S.,Wadelius,C.,Koronacki, J., Komorowski, J. (2008)Monte Carlo Feature Selection for Supervised Classification. Bioinformatics, 24(1), 110-117.
- Dramiński, M., Kierczak, M., Koronacki, J., Komorowski, J. (2010) Monte Carlo feature selection and interdependency discovery in supervised classification. In: J. Koronacki, Z.W. Ras, S.T. Wierzchon, J. Kacprzyk, eds., Advances in Machine Learning, vol. II, Springer, 371-385.
- Dudoit, S. and Fridlyand, J. (2003) Classification in Microarray Experiments. In: T. Speed, ed., Statistical Analysis of Gene Expression Microarray Data, Chapman & Hall/CRC, 93-158.
- Hothorn, T., Hornik, K. and Zeileis, A. (2006) Unbiased Recursive Partitioning: A Conditional Inference Framework. J. Computational and Graphical Statistics, 15, 651-674.
- Kierczak,M., Ginalski,K., Dramiński,M., Koronacki, J., Rudnicki, W. and Komorowski, J. (2009) A Rough Set-BasedModel of HIV-1 Reverse Transcriptase Resistome. Bioinformatics and Biology Insights, 3, 109- 127. http://www.la-press.com/a-rough-set-based-model-of-hiv-1-reversetranscriptase-resistome-a1685
- Kierczak,M., Dramiński,M., Koronacki, J. and Komorowski, J. (2010) Computational analysis of molecular interaction networks underlying change of HIV-1 resistance to selected reverse transcriptase inhibitors. Bioinformatics and Biology Insights 4, 137-146. http://www.la-press.com/computational-analysis-of-molecular-interactionnetworks- underlying-ch-article-a2395)
- Kierczak, M. (2009) From Physicochemical Properties to Interdependency Networks: A Monte Carlo Approach to Modeling HIV-1 Resistome and Post-translational Modifications. PhD Thesis, Uppsala University (for an introduction to the thesis see Publications at http://www.kierczak.pl)
- Li,Y., Campbell,C. and Tipping,M. (2002) Bayesian Automatic Relevance Determination Algorithms for Classifying Gene Expression data. Bioinformatics, 18,(10), 1332-1339.
- Lu,C., Devos,A., Suykens, J.A.K., Arus,C. and Van Huffel, S. (2007) Bagging Linear Sparse Bayesian Learning Models for Variable Selection in Cancer Diagnosis. IEEE Trans Inf Technol Biomed, 11, 338-347.
- Saeys, Y., Inza, I. and Larrañaga, P. (2007) A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics, 23 (19), 2507-2517.
- Strobl,C., Boulesteix,A.-L., Zeileis,A. and Hothorn,T. (2007) Bias in Random Forest Variable Importance Measures: Illustrations, Sources, and a Solution. BMC Bioinformatics, 8(25), doi:10.1186/1471-2105-8-25.
- Strobl,C., Boulesteix,A.-L., Kneib,T., Augustin,T. and Zeileis,A. (2008) Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9(307), doi:10.1186/1471-2105-9-307.
- Tibshirani,R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of multiple cancer types by nearest shrunken centroids of gene expressions. Proc Natl Acad Sci USA, 99, 6567-6572.
- Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003) Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays. Statistical Science, 18, 104-117.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-BATC-0007-0082