Uncovering Bivariate Interactions in High Dimensional Data Using Random Forests with Data Augmentation

Arevalillo, J. M.; Navarro, H.

Powiadomienia systemowe

Sesja wygasła!
Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Uncovering Bivariate Interactions in High Dimensional Data Using Random Forests with Data Augmentation

Autorzy

Arevalillo J. M. , Navarro H.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Random Forests (RF) is an ensemble technology for classification and regression which has become widely accepted in the bioinformatics community in the last few years. Its predictive strength, along with some of the utilities, rich in information, provided by the output, has made RF an efficient data mining tool for discovering patterns in high dimensional data. In this paper we propose a search strategy that explores a subset of the input space in an exhaustive way using RF as the search engine. Our procedure begins by taking the variables previously rejected by a sequential search procedure and uses the out of bag error rate of the ensemble, obtained when trained over an augmented data set, as criterion to capture difficult to uncover bivariate patterns associated with an outcome variable. We will show the performance of the procedure in some synthetic scenarios and will give an application to a real microarray experiment in order to illustrate how it works for gene expression data.

Słowa kluczowe

bivariate interactions random forests high dimensional data

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2011

Tom

Vol. 113, nr 2

Strony

97--115

Opis fizyczny

Bibliogr. 36 poz., wykr.

Twórcy

autor

Arevalillo J. M.

autor

Navarro H.

Department of Statistics and Operational Research. UNED. Paseo Senda del Rey 9. 28040. Madrid, Spain, jmartin@ccia.uned.es

Bibliografia

[1] Alon, U., Barkai, N., Notterdam, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues by oligonucleotide arrays, PNAS, 96, June 1999, 6745-6750.
[2] Ambroise, C., McLachlan, G. J.: Selection bias in gene extraction on the baseis of microarray geneexpression data, PNAS, 99(10),May 2002, 6562-6566.
[3] Bø, T. H., Jonassen, I.: New feature subset selection procedures for classification of expression profiles, Genome Biology, 3(4), March 2002, 0017.1-0017.11.
[4] Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakhini, Z.: Tissue classification with gene expression profiles, Journal of Computational Biology, 7, 2000, 559-583.
[5] Berrar, D., Dubitzky, W.: Neural Plasma, IFIP International Federation for Information Processing., 217, 2006, 159-168.
[6] Bishop, C.: Training with noise is equivalent to Tikhonov regularization, Neural Comput, 7, 1995, 108-116.
[7] Blanco, R., Larra˜naga, P., Inza, I., Sierra, B.: Gene Selection for cancer classification using wrapper approaches, International Journal of Pattern Recognition and Artificial Intelligence, 18(8), December 2004, 1373-1390.
[8] Breiman, L.: Random Forests, Machine Learning, 45(1), October 2001, 5-32.
[9] Breiman, L., Cutler, A., Liaw, A., Wiener, M.: Breiman and Cutler's random forests for classification and regression, http://CRAN.R-project.org/, 2008.
[10] Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees, Chapman & Hall, 1993.
[11] Brown, W., Gedeon, T., Groves, D.: Use of noise to augment training data: A neural network method of mineral-potential mapping in regions of limited known deposit examples, Natural Resources Research, 12, 2003, 141-152.
[12] Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones, D., Szabo, A.: Multivariate approach for selecting sets of differentially expressed genes, Mathematical Biosciences, 176, 2002, 59-69.
[13] Chow, M., Moler, E., Mian, I.: Identifying marker genes in transcription profiling data using a mixture of feature relevance experts, Physiol. Genomics, 8(5), March 2001, 99-111.
[14] Conlin, A., Martin, E., Morris, A.: Data augmentation: an alternative approach to the analysis of spectroscopic data, Chemometrics and Intelligent Laboratory Systems, 44, 1998, 161-173.
[15] Dıaz-Uriarte, R., álvarez de Andrés, S.: Gene selection and classification of microarray data using random forest, BMC Bioinformatics, 7(3), January 2006.
[16] Dettling, M., Gabrielson, E., Parmigiani, G.: Searching for differentially expressed gene combinations, Genome Biology, 6, September 2005, R88.
[17] Dudoit, S., Yang, Y. H., Callow,M. J., Speed, T. P.: Statisticalmethods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica, 12, 2002, 111-139.
[18] Genuer, R., Poggi, J.-M., Tuleau, C.: Random Forests: some methodological insights, Technical Report. INRIA, November 2008.
[19] Guan, Z., Zhao, H.: A semiparametric approach for marker gene selection based on gene expression data, Bioinformatics, 21(4), 2005, 529-536.
[20] Guyon, I.,Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines, Machine Learning, 46(1-3), January 2002, 389-422.
[21] Liaw, A., Wiener, M.: Classification and Regression by randomForest, R News, 8, 2002, 18-22.
[22] Lu, Y., Liu, P.-Y., Xiao, P., Deng, H.-W.: Hotelling's T 2 multivariate profiling for detecting differential expression in microarrays, Bioinformatics, 21(14), 2005, 3105-3113.
[23] Ng, V.W., Breiman, L.: Bivariate variable selection for classification problem, Technical Report. Department of Statistics. Berkeley, 2005, 1-22.
[24] Nilsson, R., Pe˜na, J. M., Bj¨orkegren, J., Tegnér, J.: Detecting multivariate differentially expressed genes, BMC Bioinformatics, 8(150),May 2007.
[25] Pan, W.: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments, Bioinformatics, 18(4), 2002, 546-554.
[26] Pepe, M. S., Longton, G., Anderson, G. L., Schummer, M.: Selecting Differentially Expressed Genes from Microarray Experiments, Biometrics, 59, March 2003, 133-142.
[27] Raviv, Y., Intrator, N.: Bootstrapping with noise: an effective regularization technique, Connection Science, 8(3), 1996, 355.
[28] Ruiz, R., Riquelme, J. C., Aguilar-Ruiz, J. S.: Incremental wrapper-based gene selection from microarray data for cancer classification, Pattern Recognition, 39(12), December 2006, 2383-2392.
[29] Saeys, Y., Inza, I., Larra˜naga, P.: A review of feature selection techniques in bioinformatics, Bioinformatics, 23(19), 2007, 2507-2517.
[30] Sáiz-Abajo,M.,Mevik, B., Segtnan, V., Næs, T.: Ensemblemethods and data augmentation by noise addition applied to the analysis of spectroscopic data, Anal. Chim. Acta, 533, 2005, 147-159.
[31] Szabo, A., Boucher, K., Jones, D., Tsodikov, A. D.: Multivariate exploratory tools for microarray data analysis, Biostatistics, 4(4), 2003, 555-567.
[32] Thomas, J. G., Olson, J. M., Tapscott, S. J., Zhao, L. P.: An Efficient and Robust Statistical Modelling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles, Genome Research, 11, 2001, 1227-1236.
[33] Tyekucheva, S., Chiaromonte, F.: Augmenting the bootstrap to analyze high dimensional genomic data, Test, 17(1), 2008, 1-18.
[34] Wang, S., Ethier, S.: A generalized likelihood ratio test to identify differentially expressed genes from microarray data, Bioinformatics, 20(1), 2004, 100-104.
[35] Yang, Y. H., Xiao, Y., Segal, M. R.: Identifying differentially expressed genes from microarray experiments via statistic synthesis, Bioinformatics, 21(7), 2005, 1084-1093.
[36] Zur, R., Jiang, Y., Pesce, L., Drukker, K.: Noise injection for training artificial neural networks: A comparison with weight decay and early stooping, Med. Phys, 36(10), 2009, 4810-4818.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0022-0069