Wyniki wyszukiwania - BazTech

Ograniczanie wyników

Powiadomienia systemowe

Sesja wygasła!

Znaleziono wyników: 2

Liczba wyników na stronie

Wyniki wyszukiwania

Sortuj według:

Ogranicz wyniki do:

Google Books Ngrams Recompressed and Searchable

Grabowski Sz., Swacha J.

Foundations of Computing and Decision Sciences

2012

Vol. 37, No. 4

273--283

One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams the textual format (compressed with Deflate) in which they are distributed is highly inefficient we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.

New algorithms for exact and approximate text matching

Grabowski Sz.

Zeszyty Naukowe. Rozprawy Naukowe / Politechnika Łódzka

2009

Z. 384

3-243

The presented dissertation focuses on various exact and approximate matching problems for textual data. Text should be understood rather broadly, including natural language, molecular biology and music information retrieval data. The work consists of five chapters, each dedicated to a separate problem. In the order of presentation, they deal with exact string matching, approximate string matching, matching with gaps (which could be considered a subclass of approximate string matching problems but the amount of contained material should justify a separate chapter), online compressed search and compressed full-text indexes. The author contributed to each of those research areas. Each chapter, however, starts with background presentation to place the author's achievements in proper context. Many of the algorithms proposed in the dissertation are based on bit-parallelism, i.e., a modern technique of making use of individual bits in a CPU word. In particular, two new bit-parallel techniques have been presented, one for efficient matching in the average case, the other to reduce time complexities in the worst case of bit-parallel algorithms making use on counters. Those are rather general techniques and they have been successfully applied for multiple known string matching problems. It has been shown that the problems of matching with gaps can be attacked from very different angles, and the arsenal of existing techniques in this area has been significantly expanded. The new results comprise the algorithmic techniques of bit-parallelism, sparse dynamic programming, compact bit-parallel NFA simulations and filtering. New algorithms are also presented for full-text searching in compressed data; they are both simple and efficient. Apart from theoretical analyses, most of the proposed algorithms have been experimentally validated on modern hardware and the achieved results usually place them among very competitive ones.

W rozprawie skupiono się na wybranych problemach wyszukiwania dokładnego i przybliżonego w tekście. Pojęcie tekstu winno być rozumiane szeroko, obejmując dane w językach naturalnych, sekwencje bioinformatyczne oraz zapisy utworów muzycznych (nutowe). Praca składa się z pięciu rozdziałów, z których każdy poświęcony jest osobnemu zagadnieniu. W kolejności, są to: wyszukiwanie dokładne, wyszukiwanie przybliżone, wyszukiwanie sekwencji z przerwami (ang. gaps), wyszukiwanie online w tekście skompresowanym oraz pełnotekstowe indeksy skompresowane. Rozprawa wnosi wkład w rozwój każdego z tych problemów. Każdy rozdział zaczyna się jednak od przedstawienia odnośnego stanu wiedzy. Wiele z zaproponowanych w pracy algorytmów wykorzystuje równoległość bitową, nowoczesną technikę obliczeń z wykorzystaniem poszczególnych bitów rejestru procesora. W szczególności, zaprezentowano dwie nowe techniki równoległości bitowej, jedną mającą na celu optymalizację przypadku średniego w wyszukiwaniu dokładnym, drugą redukującą złożoność w przypadku najgorszym w algorytmach wykorzystujących liczniki. Te dość ogólne techniki algorytmiczne zostały pomyślnie zastosowane w szeregu konkretnych znanych algorytmów wyszukiwania. W pracy pokazano, iż do wyszukiwania sekwencji z przerwami można stosować rozliczne podejścia, rozszerzając znacznie istniejący arsenał metod dla problemów tej kategorii. Nowe wyniki bazują m. in. na technikach algorytmicznych równoległości bitowej, programowania dynamicznego rzadkiego i oszczędnych bitowo-równoległych symulacjach automatów NFA. Przedstawiono również algorytmy wyszukiwania w danych skompresowanych, cechujące się zarówno wydajnością, jak i prostotą. Obok analiz teoretycznych, większość algorytmów zaimplementowano i poddano testom empirycznym, a osiągnięte wyniki zwykle pozwalają zaliczyć nowe metody do najefektywniejszych dla danych problemów.