An efficient fuzzy interactive multi-objective optimization method is proposed to select the sub-optimal subset of genes from large-scale gene expression data. It is based on the binary particle swarm optimization (BPSO) algorithm tuned by a chaotic method. The proposed method is able to select the sub-optimal subset of genes with the least number of features that can accurately distinguish between the two classes, e.g. the normal and cancerous samples. The proposed method is evaluated on several publicly available microarray and RNA-sequencing gene expression datasets such as leukemia, colon cancer, central nervous system, lung cancer, ovarian cancer, prostate cancer and RNA-seq lung disease. The results indicate that the proposed method can identify the minimum number of genes to achieve the most accuracy, sensitivity and specificity in the classification process. Achieving 100% accuracy in six out of the seven datasets investigated in this study, demonstrates the high capacity of the proposed algorithm to find the sub-optimal subset of genes. This approach is useful in clinical applications to extract the most influential genes on a disease and to find the treatment procedure for the disease.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
DNA microarray data is expected to be a great help in the development of efficient diagnosis and tumor classification. However, due to the small number of instances compared to a large number of genes, many of the computational learning methods encounter difficulties to select the low subgroups. In order to select significant genes from the high dimensional data for tumor classification, nowadays, several researchers are exploring microarray data using various gene selection methods. However, there is no agreement between existing gene selection techniques that produce the relevant gene subsets by which it improves the classification accuracy. This motivates us to invent a new hybrid gene selection method which helps to eliminate the misleading genes and classify a disease correctly in less computational time. The proposed method composes of two-stage, in the first stage, EGS method using multi-layer approach and f-score approach is applied to filter the noisy and redundant genes from the dataset. In the second stage, adaptive genetic algorithm (AGA) work as a wrapper to identify significant genes subsets from the reduced datasets produced by EGS that can contribute to detect cancer or tumor. AGA algorithm uses the support vector machine (SVM) and Naïve Bayes (NB) classifier as a fitness function to select the highly discriminating genes and to maximize the classification accuracy. The experimental results show that the proposed framework provides additional support to a significant reduction of cardinality and outperforms the state-of-art gene selection methods regarding accuracy and an optimal number of genes.
A big problem in applying DNA microarrays for classification is dimension of the dataset. Recently we proposed a gene selection method based on Partial Least Squares (PLS) for searching best genes for classification. The new idea is to use PLS not only as multiclass approach, but to construct more binary selections that use one versus rest and one versus one approaches. Ranked gene lists are highly instable in the sense, that a small change of the data set often leads to big change of the obtained ordered list. In this article, we take a look at the assessment of stability of our approaches. We compare the variability of the obtained ordered lists from proposed methods with well known Recursive Feature Elimination (RFE) method and classical t-test method. This paper focuses on effective identification of informative genes. As a result, a new strategy to find small subset of significant genes is designed. Our results on real cancer data show that our approach has very high accuracy rate for different combinations of classification methods giving in the same time very stable feature rankings.
4
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Praca przedstawia hierarchiczne podejście do selekcji genów odpowiedzialnych za choroby nowotworowe. Metoda składa się z dwu etapów. W pierwszym etapie zastosowano 8 różnych metod wartościowania genów według ich zdolności rozpoznawczej, w tym 2 metod opartych na liniowej sieci SVM, dyskryminancie Fishera, analizie korelacyjnej danych oraz zastosowaniu hipotez statystycznych, (3 odmiany metody Kołmogorowa-Smirnowa oraz test Wilcoxona). Na podstawie statystycznych wyników selekcji 100 najlepszych genów wyselekcjonowanych przy użyciu każdej metody w drugim etapie przetwarzania poszukuje się cech wspólnych, które traktuje się jako cechy optymalne, najlepiej różnicujące próbki danych należących do różnych klas nowotworowych. W pracy skoncentrowano się na wynikach eksperymentów numerycznych i ich analizie dla trzech przypadków nowotworów: białaczka, nowotwór prostaty i płuc. Pokazano, że zaproponowane podejście pozwala uzyskać dobre wyniki separacji różnych rodzajów nowotworów, widoczne zarówno na obrazie graficznym rozkładu macierzy ekspresji jak i w miarach numerycznych jakości separacji.
EN
The paper proposes the hierarchical approach to the selection of the optimal set of genes for cancer recognition on the basis of the gene expression microarray. In the first stage 8 different methods of gene selection are applied to the microarray of gene expression. They include the application of linear Support Vector Machine, the Fisher discriminant ratio, the correlation analysis and statistical hypothesis tests (Kolmogorov-Smirnov, Wilcoxon-Mann-Whitney). On the basis of statistical results of each selection method 100 most discriminative genes (the genes most often appearing in the selected set) are selected first. Then in the second stage the genes selected by all methods are compared. Only the genes discriminated simultaneously by all selected methods are chosen. In this way small number of the genes associated with the appropriate cancer type is selected. The numerical experiments performed for different types of cancer (prostate, lung cancer, leukemia) have proved the efficiency of the proposed approach. The PCA distribution of data and the distance measures associated with PCA have shown that the selected genes discriminate different cancer types very well. Also the graphical representation of the considered data show significant improvement of the recognition ability of the selected genes.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.