Wyniki wyszukiwania - BazTech

1

Revisiting the optimal probability estimator from small samples for data mining

Cestnik Bojan

International Journal of Applied Mathematics and Computer Science

|

2019

|

Vol. 29, no. 4

783--796

EN

Estimation of probabilities from empirical data samples has drawn close attention in the scientific community and has been identified as a crucial phase in many machine learning and knowledge discovery research projects and applications. In addition to trivial and straightforward estimation with relative frequency, more elaborated probability estimation methods from small samples were proposed and applied in practice (e.g., Laplace’s rule, the m-estimate). Piegat and Landowski (2012) proposed a novel probability estimation method from small samples Eph√2 that is optimal according to the mean absolute error of the estimation result. In this paper we show that, even though the articulation of Piegat’s formula seems different, it is in fact a special case of the m-estimate, where pa = 1/2 and m = √2. In the context of an experimental framework, we present an in-depth analysis of several probability estimation methods with respect to their mean absolute errors and demonstrate their potential advantages and disadvantages. We extend the analysis from single instance samples to samples with a moderate number of instances. We define small samples for the purpose of estimating probabilities as samples containing either less than four successes or less than four failures and justify the definition by analysing probability estimation errors on various sample sizes.

2

Specialized, MSE-optimal m-estimators of the rule probability especially suitable for machine learning

Piegat A., Landowski M.

Control and Cybernetics

|

2014

|

Vol. 43, no. 1

133--160

EN

The paper presents an improved sample based rule- probability estimation that is an important indicator of the rule quality and credibility in systems of machine learning. It concerns rules obtained, e.g., with the use of decision trees and rough set theory. Particular rules are frequently supported only by a small or very small number of data pieces. The rule probability is mostly investigated with the use of global estimators such as the frequency-, the Laplace-, or the m-estimator constructed for the full probability interval [0,1]. The paper shows that precision of the rule probability estimation can be considerably increased by the use of m-estimators which are specialized for the interval [phmin, phmax] given by the problem expert. The paper also presents a new interpretation of the m-estimator parameters that can be optimized in the estimators.

3

Mean square error optimal completeness estimator Eph2 of probability

Piegat A., Landowski M.

Journal of Theoretical and Applied Computer Science

|

2013

|

Vol. 7, nr 3

3--20

EN

The paper presents the optimal estimator of probability for the binomial and multinomial case that was called ”completeness estimator Eph2” and theoretical proof of its optimality. The estimator accuracy was compared with accuracy of the universally used frequency estimator. The comparison was realized both theoretically and experimentally. Both comparison ways show superiority of the completeness estimator Eph2 over the frequency estimator frh = nh=n. A prooved solution of the single case problem is given.

4

Optimal estimator of hypothesis probability for data mining problems with small samples

Piegat A, Landowski M.

International Journal of Applied Mathematics and Computer Science

|

2012

|

Vol. 22, no. 3

629-645

EN

The paper presents a new (to the best of the authors' knowledge) estimator of probability called the "[...] completeness estimator" along with a theoretical derivation of its optimality. The estimator is especially suitable for a small number of sample items, which is the feature of many real problems characterized by data insufficiency. The control parameter of the estimator is not assumed in an a priori, subjective way, but was determined on the basis of an optimization criterion (the least absolute errors).The estimator was compared with the universally used frequency estimator of probability and with Cestnik's m-estimator with respect to accuracy. The comparison was realized both theoretically and experimentally. The results show the superiority of the [...] completeness estimator over the frequency estimator for the probability interval ph (0.1, 0.9). The frequency estimator is better for ph [0, 0.1] and ph [0.9, 1].

5

A New Definition of a Collision Zone For a Geometrical Model For Ship-Ship Collision Probability Estimation

Montewka J., Goerlandt F., Kujala P.

TransNav : International Journal on Marine Navigation and Safety of Sea Transportation

|

2011

|

Vol. 5, no. 4

497--504

EN

In this paper, a study on a newly developed geometrical model for ship-ship collisions probability estimation is conducted. Most of the models that are used for ship-ship collision consider a collision be-tween two ships a physical contact between them. The model discussed in this paper defines the collision cri-terion in a novel way. A critical distance between two meeting ships at which such meeting situation can be considered a collision is calculated with the use of a ship motion model. This critical distance is named the minimum distance to collision (MDTC). Numerous factors affect the MDTC value: a ship type, an angle of intersection of ships’ courses, a relative bearing between encountering ships and a maneuvering pattern. They are discussed in the paper.

6

A Method for Assessing a Causation Factor for a Geometrical MDTC Model for Ship-Ship Collision Probability Estimation

Montewka J., Goerlandt F., Lammi H., Kujala P.

TransNav : International Journal on Marine Navigation and Safety of Sea Transportation

|

2011

|

Vol. 5, no. 3

365--373

EN

In this paper a comparative method for assessing a causation factor for a geometrical model for ship-ship collision probability estimation is introduced. The results obtained from the model are compared with the results of an analysis of near-collisions based on recorded AIS data and then with the historical data on maritime accidents in the Gulf of Finland. The causation factor is obtained for three different meeting types, for a chosen location and prevailing traffic conditions there.

7

Heterogeneous distance functions for prototype rules : influence of parameters on probability estimation

Blachnik M., Duch W., Wieczorek T.

Studia Informatica : systems and information technology

|

2006

|

Vol. 1(7)

19--30

EN

An interesting and little explored way to understand data is based on prototype rules (P-rules). The goal of this approach is to find optimal similarity (or distance) functions and position of prototypes to which unknown vectors are compared. In real applications similarity functions frequently involve different types of attributes, such as continuous, discrete, binary or nominal. Heterogeneous distance functions that may handle such diverse information are usually based on probability distance measure, such as the Value Difference Metrics (VDM). For continuous attributes calculation of probabilities requires estimations of probability density functions. This process requires careful selection of several parameters that may have important impact on the overall classification of accuracy. In this paper, various heterogeneous distance function based on VDM measure are presented, among them some new heterogeneous distance functions based on different types of probability estimation. Results of many numerical experiments with such distance functions are presented on artificial and real datasets, and quite simple P-rules for several heterogeneous databases extracted.