With the emergence of social networks and micro-blogs, a huge amount of short textual documents are generated on a daily basis, for which effective tools for organization and classification are needed. These short text documents have extremely sparse representation, which is the main cause for the poor classification performance. We propose a new approach, where we identify relevant concepts in short text documents with the use of the DBpedia Spotlight framework and enrich the text with information derived from DBpedia ontology, which reduces the sparseness. We have developed six variants of text enrichment methods and tested them on four short text datasets using seven classification algorithms. The obtained results were compared to those of the baseline approach, among themselves, and also to some state-of-the-art text classification methods. Beside classification performance, the influence of the concepts similarity threshold and the size of the training data were also evaluated. The results show that the proposed text enrichment approach significantly improves classification of short texts and is robust with respect to different input sources, domains, and sizes of available training data. The proposed text enrichment methods proved to be beneficial for classification of short text documents, especially when only a small amount of documents are available for training.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Short text classification is an important task widely used in many applications. However, few works investigated applying Spiking Neural Networks (SNNs) for text classification. To the best of our knowledge, there were no attempts to apply SNNs as classifiers of short texts. In this paper, we offer a comparative study of short text classification using SNNs. To this end, we selected and evaluated three popular implementations of SNNs: evolving Spiking Neural Networks (eSNN), the NeuCube implementation of SNNs, as well as the SNNTorch implementation that is available as the Python language package. In order to test the selected classifiers, we selected and preprocessed three publicly available datasets: 20-newsgroup dataset as well as imbalanced and balanced PubMed datasets of medical publications. The preprocessed 20-newsgroup dataset consists of first 100 words of each text, while for the classification of PubMed datasets we use only a title of each publication. As a text representation of documents, we applied the TF-IDF encoding. In this work, we also offered a new encoding method for eSNN networks, that can effectively encode values of input features having non-uniform distributions. The designed method works especially effectively with the TF-IDF encoding. The results of our study suggest that SNN networks may provide the classification quality is some cases matching or outperforming other types of classifiers.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.