Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 14

Liczba wyników na stronie
first rewind previous Strona / 1 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  text categorization
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 1 next fast forward last
EN
This paper provides a comprehensive assessment of basic feature selection (FS) methods that have originated from nature-inspired (NI) meta-heuristics; two well-known filter-based FS methods are also included for comparison. The performances of the considered methods are compared on four balanced highdimensional and real-world text data sets regarding the accuracy, the number of selected features, and computation time. This study differs from existing studies in terms of the extent of experimental analyses that were performed under different circumstances where the classifier, feature model, and term-weighting scheme were different. The results of the extensive experiments indicated that basic NI algorithms produce slightly different results than filter-based methods for the text FS problem. However, filter-based methods often provide better results by using lower numbers of features and computation times.
EN
Malware is a shorthand of malicious software that are created with the intent of damaging hardware systems, stealing data, and causing a mess to make money, protest something, or even make war between governments. Malware is often spread by downloading some applications for your hardware from some download platforms. It is highly probable to face with a malware while you try to load some applications for your smart phones nowadays. Therefore it is very important that some tools are needed to detect malware before loading them to the hardware systems. There are mainly three different approaches to detect malware: i) static, ii) dynamic, and iii) hybrid. Static approach analyzes the suspicious program without executing it. Dynamic approach, on the other hand, executes the program in a controlled environment and obtains information from operating system during runtime. Hybrid approach, as its name implies, is the combination of these two approaches. Although static approach may seem to have some disadvantages, it is highly preferred because of its lower cost. In this paper, our aim is to develop a static malware detection system by using text categorization techniques. To reach our goal, we apply text mining techniques like feature extraction by using bag-of-words, n-grams, etc. from manifest content of suspicious programs, then apply text classification methods to detect malware. Our experimental results revealed that our approach is capable of detecting malicious applications with an accuracy between 94.0% and 99.3%.
EN
Automatic text categorization presents many difficulties. Modern algorithms are getting better in extracting meaningful information from human language. However, they often significantly increase complexity of computations. This increased demand for computational capabilities can be facilitated by the usage of hardware accelerators like general purpose graphic cards. In this paper we present a full processing flow for document categorization system. Gram-Schmidt process signatures calculation up to 12 fold decrease in computing time of system components.
4
Content available Analysis of methods and means of text mining
EN
In Big Data era when data volume doubled every year analyzing of all this data become really complicated task, so in this case text mining systems, techniques and tools become main instrument of analyzing tones and tones of information, selecting that information that suit the best for your needs and just help save your time for more interesting thing. The main aims of this article are explain basic principles of this field and overview some interesting technologies that nowadays are widely used in text mining.
EN
Increasing number of repositories of online documents resulted in growing demand for automatic categorization algorithms. However, in many cases the texts should be assigned to more than one class. In the paper, new multi-label classification algorithm for short documents is considered. The presented problem transformation Labels Chain (LC) algorithm is based on relationship between labels, and consecutively uses result labels as new attributes in the following classification process. The method is validated by experiments conducted on several real text datasets of restaurant reviews, with different number of instances, taking into account such classifiers as kNN, Naive Bayes, SVM and C4.5. The obtained results showed the good performance of the LC method, comparing to the problem transformation methods like Binary Relevance and Label Powerset.
EN
We deal with the problem of the multiaspect text categorization which calls for the classification of the documents with respect to two, in a sense, orthogonal sets of categories. We briefly define the problem, mainly referring to our previous work, and study the application of the k- nearest neighbours algorithm. We propose a new technique meant to enhance the effectiveness of this algorithm when applied to the problem in question. We show some experimental results confirming usefulness of the proposed approach.
EN
Feature selection is the main step in classification systems, a procedure that selects a subset from original features. Feature selection is one of major challenges in text categorization. The high dimensionality of feature space increases the complexity of text categorization process, because it plays a key role in this process. This paper presents a novel feature selection method based on particle swarm optimization to improve the performance of text categorization. Particle swarm optimization inspired by social behavior of fish schooling or bird flocking. The complexity of the proposed method is very low due to application of a simple classifier. The performance of the proposed method is compared with performance of other methods on the Reuters-21578 data set. Experimental results display the superiority of the proposed method.
EN
The similarity based decision rule computes the similarity between a new test document and the existing documents of the training set that belong to various categories. The new document is grouped to a particular category in which it has maximum number of similar documents. A document similarity based supervised decision rule for text categorization is proposed in this article. The similarity measure determine the similarity between two documents by finding their distances with all the documents of training set and it can explicitly identify two dissimilar documents. The decision rule assigns a test document to the best one among the competing categories, if the best category beats the next competing category by a previously fixed margin. Thus the proposed rule enhances the certainty of the decision. The salient feature of the decision rule is that, it never assigns a document arbitrarily to a category when the decision is not so certain. The performance of the proposed decision rule for text categorization is compared with some well known classification techniques e.g., k-nearest neighbor decision rule, support vector machine, naive bayes etc. using various TREC and Reuter corpora. The empirical results have shown that the proposed method performs significantly better than the other classifiers for text categorization.
PL
W artykule opisano podejście do identyfikacji powiązań między kategoriami w repozytorium danych tekstowych, bazując na Wikipedii. Przeprowadzając analizę podobieństwa między artykułami, określono miary pozwalające zidentyfikować powiązania między kategoriami, które nie były wcześniej uwzględnione, i nadawać im wagi określające stopień istotności. Przeprowadzono automatyczną ocenę uzyskanych rezultatów w odniesieniu do już istniejącej struktury kategorii.
EN
In the article we present an approach to identification of relations between categories organizing the repository of documents. We describe the metrics of category relevance based on similarity measures between articles. The metrics have been used to discover relations between categories within Wikipedia repository. The evaluation of the proposed method indicate it allows to reconstruct already existing associations in category structure as well as introduce new significant relations.
10
Content available remote Analysis of the Arabic using neural networks: an overview
EN
This paper is a quick review of some of the scholarly work aiming at solving various problems of the Arabic language using neural networks. It includes some research work concerning online recognition of handwritten Arabic characters, speech recognition, offline character text recognition, text categorization and recognition of printed text. This paper concludes that more research should be conducted in this area considering the importance of the Arabic language, the rapid growth of internet users in the Arab world, and the widespread usage of Arabic characters by many languages other than Arabic.
PL
W artykule przedstawiono metody analizy języka arabskiego z wykorzystaniem sieci neuronowych. Analizowano możliwości rozpoznawania pisma odręcznego, drukowanego jak i mowy.
11
Content available remote A novel text classification problem and its solution
EN
A new text categorization problem is introduced. As in the classical problem, there is a set of documents and a set of categories. However, in addition to being assigned to a specific category, each document belongs to a certain sequence of documents, referred to as a case. It is assumed that all documents in the same case belong to the same category. An example may be a set of news articles. Their categories may be sport, politics, entertainment, etc. In each category there exist cases, i.e., sequences of documents describing, for example evolution of some events. The problem considered is how to classify a document to a proper category and a proper case within this category. In the paper we formalize the problem and discuss two approaches to its solution.
PL
W artykule proponuje się nowe zadanie kategoryzacji dokumentów tekstowych. Podobnie jak w zadaniu klasycznym rozważa się zbiór dokumentów tekstowych i zbiór kategorii. W odróżnieniu od zadania klasycznego, dokumenty są przypisane nie tylko do kategorii, ale również do określonej sekwencji dokumentów w ramach danej kategorii, zwanej sprawą. Zakłada się, że wszystkie dokumenty danej sprawy należą do tej samej kategorii. Przykładem może być kolekcja wiadomości prasowych. Mogą one należeć do kategorii takich, jak sport, polityka, rozrywka itp. W ramach każdej kategorii występują sekwencje wiadomości (sprawy) opisujące np. rozwój pewnych zdarzeń. Zadanie polega więc na zaklasyfikowaniu dokumentu do właściwej kategorii i właściwej sprawy w jej ramach. W artykule formalnie definiuje się nowe zadanie kategoryzacji i proponuje się dwa podejścia do jego rozwiązania.
12
Content available remote Linear SVM for organizing data
EN
This paper demonstrates that the text categorization (TC) is a good automatic method for organizing data. Some features of the TC problem are described and explained that linear Support Vector Machines (SVM) is an appropriate technique for this task. Theoretical considerations are illustrated through examples in which the text categorization problem has been solved with SVM.
PL
Artykuł omawia problem kategoryzacji tekstu jako dobrego rozwiązania do automatycznej organizacji danych. Przedstawia cechy problemu TC oraz wyjaśnia, iż liniowa metoda wektorów podtrzymujących (SVM) doskonale sprawdza się w tego typu zadaniach. Teoretyczne rozważania ilustrowane są przykładami automatycznej organizacji dokumentów przy wykorzystaniu sieci typu SVM.
13
EN
This paper analyzes the incidence that dimensionality reduction techniques have in the process of text categorization of documents written in Basque. Classification techniques such as Naive Bayes, Winnow, SVMs and k-NN have been selected. The Singular Value Decomposition dimensionality reduction technique together with lemmatization and noun selection have been used in our experiments. The results obtained show that the approach combines SVD and k-NN for a lemmatized corpus gives the best accuracy rates of all with a remarkable difference.
14
Content available remote Complexity control of SVM network applied to text categorisation
EN
In this paper we discuss practical manner of the Vapnik-Chervonenkis dimension estimation for Support Vector Machines (SVMs). It will be demonstrated again that VC dimension has an influence on generalization ability. Results are ilustrated through examples in which SVM networks are used for automatic text categorization.
PL
W artykule omówiono praktyczny sposób oszacowania wartości wymiaru Vapnika-Chervonenkisa (Vcdim) dla sieci neuronowej typu Support Vector Machines (SVMs). Przedstawiono wpływ wymiaru Vcdim na zdolności generalizacyjne sieci. Wyniki ilustrowane są przykładami zastosowania sieci SVMs do automatycznej kategoryzacji tekstu.
first rewind previous Strona / 1 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.