Wyniki wyszukiwania - BazTech

1

Dirichlet's principle revisited : an inverse Dirichlet's principle definition and its bound estimation improvement using stochastic combinatorics

Štepánek Lubomír, Habarta Filip, Malá Ivana, Marek Luboš

Annals of Computer Science and Information Systems

|

2022

|

Vol. 32

113--116

EN

Dirichlet's principle, also known as a pigeonhole principle, claims that if n item are put into m containers, with n > m, then there is a container that contains more than one item. In this work, we focus rather on an inverse Dirichlet's principle (by switching items and containers), which is as follows: considering n items put in m containers, when n < m, then there is at least one container with no item inside. Furthermore, we refine Dirichlet's principle using discrete combinatorics within a probabilistic framework. Applying stochastic fashion on the principle, we derive the number of items n may be even greater than or equal to m, still very likely having one container without an item. The inverse definition of the problem rather than the original one may have some practical applications, particularly considering derived effective upper bound estimates for the items number, as demonstrated using some applied mini-studies.

2

A short note on post-hoc testing using random forests algorithm: Principles, asymptotic time complexity analysis, and beyond

Štěpánek Lubomír, Habarta Filip, Mala Ivana, Marek Luboš

Annals of Computer Science and Information Systems

|

2022

|

Vol. 30

489--497

EN

When testing whether a continuous variable differs between categories of a factor variable or their combinations, taking into account other continuous covariates, one may use an analysis of covariance. Several post-hoc methods, such as Tukey’s honestly significant difference test, Scheffé’s, Dunn’s, or Nemenyi’s test are well-established when the analysis of covariance rejects the hypothesis there is no difference between any categories. However, these methods are statistically rigid and usually require meeting statistical assumptions. In this work, we address the issue using a random forest-based algorithm, practically assumption-free, classifying individual observations into the factor’s categories using the dependent continuous variable and covariates on input. The higher the proportion of trees classifying the observations into two different categories is, the more likely a statistical difference between the categories is. To adjust the method’s first-type error rate, we change random forest trees’ complexity by pruning to modify the proportions of highly complex trees. Besides simulations that demonstrate a relationship between the tree pruning level, tree complexity, and first-type error rate, we analyze the asymptotic time complexity of the proposed random forest-based method compared to established techniques.

3

Alternatives for Greedy Discrete Subsampling: Various Approaches Including Cluster Subsampling of COVID-19 Data With No Response Variable

Štěpánek Lubomír, Habarta Filip, Malá Ivana, Marek Luboš

Annals of Computer Science and Information Systems

|

2021

|

Vol. 26

103--111

EN

An exhaustive selection of all possible combinations of n = 400 from N = 698 observations of the COVID-19 dataset was used as a benchmark. Building a random set of subsamples and choosing the one that minimized an averaged sum of squares of each variable's category frequency returned similar results as a "forward" subselection reducing the dataset one-by-one observation by the same metric's permanent lowering. That works similarly as k-means clustering (with a random clusters' number) over the original dataset's observations and choosing a subsample from each cluster proportionally to its size. However, the approaches differ significantly in asymptotic time complexity.

4

A random forest-based approach for survival curves comparing: principles, computational aspects and asymptotic time complexity analysis

Štěpánek Lubomír, Habarta Filip, Malá Ivana, Marek Luboš

Annals of Computer Science and Information Systems

|

2021

|

Vol. 25

301--311

EN

The log-rank test and Cox’s proportional hazard model can be used to compare survival curves but are limited by strict statistical assumptions. In this study, we introduce a novel, assumption-free method based on a random forest algorithm able to compare two or more survival curves. A proportion of the random forest’s trees with sufficient complexity is close to the test’s p-value estimate. The pruning of trees in the model modifies trees’ complexity and, thus, both the method’s robustness and statistical power. The discussed results are confirmed using a simulation study, varying the survival curves and the tree pruning level.

5

Analysis of asymptotic time complexity of an assumption-free alternative to the log-rank test

Štěpánek Lubomír, Habarta Filip, Malá Ivana, Marek Luboš

Annals of Computer Science and Information Systems

|

2020

|

Vol. 21

453--460

EN

Comparison of two time-event survival curves representing two groups of individuals' evolution in time is relatively usual in applied biostatistics. Although the log-rank test is the suggested tool how to face the above-mentioned problem, there is a rich statistical toolbox used to overcome some of the properties of the log-rank test. However, all of these methods are limited by relatively rigorous statistical assumptions. In this study, we introduce a new robust method for comparing two time-event survival curves. We briefly discuss selected issues of the robustness of the log-rank test and analyse a bit more some of the properties and mostly asymptotic time complexity of the proposed method. The new method models individual time-event survival curves in a discrete combinatorial way as orthogonal monotonic paths, which enables direct estimation of the p-value as it was originally defined. We also gently investigate how the surface of an area, bounded by two survival curves plotted onto a plane chart, is related to the test’s p-value. Finally, using simulated time-event data, we check the robustness of the introduced method in comparison with the log-rank test. Based on the theoretical analysis and simulations, the introduced method seems to be a promising and valid alternative to the log-rank test, particularly in case on how to compare two time-event curves regardless of any statistical assumptions.