Wyniki wyszukiwania - BazTech

1

Query-condition-aware V-optimal histogram in range query selectivity estimation

Augustyn D. R.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2014

|

Vol. 62, nr 2

287--303

EN

Obtaining the optimal query execution plan requires a selectivity estimation. The selectivity value allows to predict the size of a query result. This lets choose the best method of query execution. There are many selectivity obtaining methods that are based on different types of estimators of attribute values distribution (commonly they are based on histograms). The adaptive method, proposed in this paper, uses either attribute values distribution or range query condition boundaries one. The new type of histogram - the Query-Conditional-Aware V-optimal one (QCA-V-optimal) - is proposed as a non-parametric estimator of a probability density function of attribute values distribution. This histogram also takes into account information about already processed queries. This information is represented by the 1-dimensional Query Condition Distribution histogram (HQCD) which is an estimator of the include function PI which is also introduced in this paper. PI describes so-called regions of user interest, i.e. it shows how often regions of attribute values domain were used by processed queries. Advantages of the proposed method based on QCA-V-optimal are presented. Conducted experiments reveal small values of a mean relative selectivity estimation error comparing to the error values obtained by methods based on the relevant classical V-optimal histogram and Equi-height one.

2

Zastosowanie predykcji rozkładu wartości atrybutu w celu poprawy dokładności estymacji selektywności zapytań

Augustyn D.

Studia Informatica

|

2013

|

Vol. 34, nr 2A

23--42

PL

Parametr selektywności jest wykorzystywany w procesie optymalizacji zapytań. Uzyskanie selektywności wymaga nieparametrycznego estymatora rozkładu wartości atrybutu, tj. histogramu. Histogramy są tworzone w ramach procesu aktualizacji statystyk. Dla dużych baz danych aktualizacja statystyk jest wykonywana raczej rzadko, np. tylko w momentach małego obciążenia systemu. To powoduje, że histogramy nie opisują aktualnego rozkładu danych. Aby uzyskać bardziej aktualne histogramy, powinno się zastosować mechanizm predykcji rozkładu. Pozwoli to na bardziej dokładną estymację selektywności. W niniejszym artykule zaproponowano metodę ekstrapolacji rozkładu wartości atrybutów. Metoda ta dokonuje predykcji momentów szukanego, ekstrapolowanego rozkładu. W celu jego wyznaczenia opisywana metoda wykorzystuje zasadę maksimum entropii z uwzględnieniem wartości momentów znalezionych w ramach procedury predykcji.

XX

A selectivity parameter is needed in query optimization process. Obtaining the query selectivity requires a non-parametric estimator of attribute value distribution, i.e. a histogram. Histograms are produced during update statistics process. For large databases the update statistics process is performed rather seldom, e.g. only during time of low workload of a system. This results that histograms do not describe actual data distribution. To obtain a more accurate histogram, a prediction mechanism should be introduced. This results obtaining a more accurate estimation of selectivity. 24 D. R. Augustyn The method of extrapolation of attribute value distribution is proposed in this paper. This method predicts moments of the extrapolated distribution. It uses the maximum entropy principle for obtaining the extrapolated distribution subject to the predicted values of the distribution moments.

3

M2HSE - metoda estymacji selektywności pewnej klasy zapytań zakresowych oparta na wielowymiarowym rozkładzie wartości atrybutów oraz rozkładach brzegowych

Augustyn D.

Studia Informatica

|

2013

|

Vol. 34, nr 2A

43--56

PL

Selektywność jest parametrem wyznaczanym przez bazodanowy optymalizator zapytań w celu wczesnego oszacowania rozmiaru danych spełniających warunek zapytania. Jest to czynność niezbędna do znalezienia optymalnego planu wykonania zapytania. Selektywność jest na ogół oszacowywana na podstawie histogramów, które są nieparametrycznymi estymatorami rozkładów wartości atrybutów. Wyznaczanie selektywności dla zapytań z warunkiem selekcji opartym na kilku atrybutach wymaga wykorzystania wielowymiarowego histogramu estymującego łączny rozkład wartości atrybutów. Dokładność histogramów wielowymiarowych spada wraz ze wzrostem liczby wymiarów, co jest powszechnie znane pod nazwą problemu przekleństwa wymiarowości. Natomiast jednowymiarowe histogramy zbudowane dla pojedynczych atrybutów, które charakteryzują rozkład brzegowy, opisują ten jednowymiarowy rozkład dokładniej, ale oczywiście nie opisują zależności pomiędzy atrybutami. W niniejszym artykule zaproponowano metodę wyznaczania selektywności, opartą na histogramach opisujących zarówno rozkład łączny, jak i rozkłady brzegowe. Zaproponowana metoda (nazwana M2HSE) dotyczy pewnej klasy zapytań, w których zakresowy warunek selekcji oparty jest na wielu atrybutach. Dla takich zapytań przedstawiona metoda może pozwolić na wyznaczenie dokładniejszych przybliżeń wartości selektywności niż klasyczne metody, wykorzystujące histogramy opisujące tylko rozkład łączny albo tylko rozkłady brzegowe (gdzie zastosowane jest założenie o niezależności atrybutów).

EN

Selectivity is a parameter obtained by database query optimizer for early estimation of size of data that satisfying a query condition. This is needed for finding the optimal query execution plan. Commonly, selectivity is estimated using histograms that are non-parametric estimators of attribute values distribution. Obtaining a selectivity for a query with a selection condition bases on a few attributes requires a multimensional histogram estimating joint distribution. Accuracy of multidimensional histograms decreases for high dimensions. It is well-known as the curse of dimensionality problem. One-dimensional histograms describing marginal distributions are more accurate, but they do not describe dependency between attributes. In this paper we propose a method of selectivity estimation based on both types of histograms describing either a multidimensional joint distribution or marginal ones. The method (named M2HSE) may be used for some kind of queries with a range selection condition based on many attributes. For such kind of queries, this method may give more accurate selectivity estimations than classical methods based on multidimensional histogram only or marginal histograms only (where the AVI rule is assumed).

4

Asymptotically error - optimal shape of sampling zone for query selectivity estimation method based on discrete cosine transform

Augustyn D. R.

Theoretical and Applied Informatics

|

2012

|

Vol. 24, No. 1

3-22

EN

The problem of query selectivity estimation for database queries is critical for efficient query execution by database management systems. A query execution method strongly depends on early estimated size of a query result. This estimation determines a data access method used later during the query execution. The selectivity parameter is a fraction of table rows that satisfy a single-table query condition. For a selection condition of a range query where an attribute has a continuous domain, the selectivity is equivalent to a definite integral form probability density function (PDF) of attribute values distribution. For a compound selection condition based on many attributes we need a multidimensional space-efficient non-parametric estimator of multivariate PDF of attribute values distribution. A known approach based on Discrete Cosine Transform (DCT) spectrum as an representation of multidimensional PDF is considered. The energy compaction property of DCT lets omit a region of spectrum coefficients with small absolute values without significant losing an accuracy of selectivity estimation. An area of relevant spectrum coefficients is called a sampling zone. Results of experiments from previous works shows that applying the reciprocal shape of the sampling zone gives the least selectivity estimation error subject to a predetermined size of the zone. The main result of this work is a theoretical confirmation of only experimental results from previous works. The paper presents the proof of the theorem that the reciprocal shape of the sampling zone is asymptotically error-optimal. The proof is based on calculus of variations and the isoperimetric problem.

PL

Szacowanie selektywności zapytań jest krytyczne dla efektywnej realizacji zapytań w systemach zarządzania bazami danych. Sposób realizacji zapytania zależy od wstępnego oszacowania rozmiaru danych spełniających kryteria zapytania. Takie oszacowanie pozwala wybrać metodę dostępu do danych użytą później podczas realizacji zapytania. Selektywność dla zapytań jednotablicowych to stosunek liczby wierszy spełniających kryteria zapytania do liczby wszystkich wierszy tablicy. Dla zakresowych warunków zapytania, określonych na atrybutach z ciągła dziedziną, selektywność jest całką oznaczoną z funkcji gęstości prawdopodobieństwa (PDF), określającej rozkład wartości tego atrybutu. Dla złożonych warunków zapytania, opartych na kilku atrybutach, istnieje potrzeba użycia nieparametrycznego estymatora wielowymiarowej PDF, którego reprezentacja powinna być oszczędna pod względem zajętości pamięci. Jedno ze znanych podejść do konstrukcji takiego estymatora oparte jest na dyskretnej transformacie kosinusowej (DCT) - tzn. widmie z histogramu wielowymiarowego. Własność kompakcji energii pozwala na pominięcie nieznaczących współczynników widma DCT bez istotnej utraty oszacowania selektywności. Obszar znaczących współczynników widma nazywany jest strefą próbkowania. Wyniki prac eksperymentalnych innych autorów wskazują, że dla zadanego rozmiaru reprezentacji widma, optymalną strefą próbkowania (kształtem strefy o najmniejszym błędzie oszacowania selektywności) jest tzw. strefa odwrotnie proporcjonalna. Głównym wynikiem tego opracowania jest teoretyczne potwierdzenie tych eksperymentów. Artykuł przedstawia dowód twierdzenia o asymptotycznej optymalności strefy odwrotnie proporcjonalnej dla przypadku dwuwymiarowego. Dowód opiera się na elementach rachunku wariacyjnego i zagadnieniu izoperymetrycznym.