The paper is devoted to description and evaluation of a new method of linguistically annotated text compression. A semantically motivated transcoding scheme is proposed in which text is split into three distinct strems of data. By applying the scheme it is possible to reduce compressed text length by as high as 67%, compared to the initial compression algorithm. An important advantage of the method is the feasibility of processing text in its compressed form.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Given a sequence S of n symbols over some alphabet Σ of size σ, we develop new compression methods that are (i) very simple to implement; (ii) provide O(1) time random access to any symbol (or short substring) of the original sequence. Our simplest solution uses at most 2h+o(h) bits of space, where h = n(H_o(S)+1), and H_o(S) is the zeroth-order empirical entropy of S. We discuss a number of improvements and trade-offs over the basic method. For example, we can achieve n(H_k(S)+1)+o(n(H_k(S)+1)) bits of space, for k = o(logσ(n)). Several applications are discussed, including text compression, (compressed) full-text indexing and string matching.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
We consider the data compression using antidictionaries and give algorithms for faster compression and decompression. While the original method of Crochemore et al. uses finite transducers with e-moves, we (de)compress using e-free transducers. This is provably faster, assuming data non-negligibly compressible, but we have to consider the overhead due to building the new machines. In general, they can be quadratic in size compared to the ones allowing e-moves; we prove this bound optimal as it is reached for de Bruijn words. However, in practice, the size of the e-free machines turns out to be close to the size of the ones allowing e-moves and therefore we can achieve significantly faster (de)compression. We show our results for the files in Calgary corpus.
4
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The efficiency of textual input of disabled users can be considerably improved by employing prediction methods as they are used in data compression algorithms. In this paper, the application of the PPM algorithm is described along with a technique of utilizing multiple concurrent prediction models. In a prototypical implementation - called PRIS - the method has been found to reduce the number of keystrokes necessary to enter textual data by 55 to 62 percent over common text entry. Compared with zero-order prediction as used in most textual input aids that are currently used, PRIS reduces the number of keystrokes between 12 and 24 percent.
5
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In this article, we introduce new methods to compute the Longest Previous nonoverlapping Factor (LPnF) table. The LPnF table is the table that stores the maximal length of factors re-occurring at each position of a string without overlapping and this table is related to Ziv-Lempel factorization of a text which is useful for text compression and data compression. The LPnF table has the important role for data compression, string algorithms and computational biology. In this paper, we present three approaches to produce the LPnF table of a string from its augmented position heap, from its position heap, and from its suffix heap. We also present the experimental results from these three solutions. The algorithms run in linear time with linear memory space.
In this paper we present state-of-the-art algorithms implementing three competitive approaches to text document compression, based respectively on dynamic, semi-dynamic and static word dictionaries. We test them experimentally on a preselected sets of English and multilingual documents then discuss the results, explaining the strengths and weaknesses of each approach.
7
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Similarity detection is very important in the field of spam detection, plagiarism detection or topic detection. The main algorithm for comparison of text document is based on the Kolmogorov Complexity, which is one of the perfect measures for computation of the similarity of two strings in defined alphabet. Unfortunately, this measure is incomputable and we must define several approximations which are not metric at all, but in some circumstances are close to this behaviour and may be used in practice.
PL
W artykule omówiono metody rozpoznawania podobieństwa tekstu. Głównie używanym algorytmem jest Kolmogotov Complexity. Głównym ograniczeniem jest brak możliwości dane algorytmu są trudne do dalszego przetwarzania numerycznego – zaproponowano szereg aproksymacji.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.