Wyniki wyszukiwania - Biblioteka Nauki

1

The handling of missing binary data in language research

100%

Pichette F. , Beland S. , Jolani S. , Leśniewska J.

Studies in Second Language Learning and Teaching

|

2015

|

tom 5

|

nr 1

153-169

EN

Researchers are frequently confronted with unanswered questions or items on their questionnaires and tests, due to factors such as item difficulty, lack of testing time, or participant distraction. This paper first presents results from a poll confirming previous claims (Rietveld & van Hout, 2006; Schafer & Gra- ham, 2002) that data replacement and deletion methods are common in research. Language researchers declared that when faced with missing answers of the yes/no type (that translate into zero or one in data tables), the three most common solutions they adopt are to exclude the participant’s data from the analyses, to leave the square empty, or to fill in with zero, as for an incorrect answer. This study then examines the impact on Cronbach’s α of five types of data insertion, using simulated and actual data with various numbers of participants and missing percentages. Our analyses indicate that the three most common methods we identified among language researchers are the ones with the greatest impact n Cronbach's α coefficients; in other words, they are the least desirable solutions to the missing data problem. On the basis of our results, we make recommendations for language researchers concerning the best way to deal with missing data. Given that none of the most common simple methods works properly, we suggest that the missing data be replaced either by the item’s mean or by the participants’ overall mean to provide a better, more accurate image of the instrument’s internal consistency.

2

The handling of missing binary data in language research

100%

Pichette F. , Béland S. , Jolani S. , Leśniewska J.

|

nr 1

153-169

EN

Researchers are frequently confronted with unanswered questions or items on their questionnaires and tests, due to factors such as item difficulty, lack of testing time, or participant distraction. This paper first presents results from a poll confirming previous claims (Rietveld & van Hout, 2006; Schafer & Gra- ham, 2002) that data replacement and deletion methods are common in research. Language researchers declared that when faced with missing answers of the yes/no type (that translate into zero or one in data tables), the three most common solutions they adopt are to exclude the participant’s data from the analyses, to leave the square empty, or to fill in with zero, as for an incorrect answer. This study then examines the impact on Cronbach’s α of five types of data insertion, using simulated and actual data with various numbers of participants and missing percentages. Our analyses indicate that the three most common methods we identified among language researchers are the ones with the greatest impact n Cronbach's α coefficients; in other words, they are the least desirable solutions to the missing data problem. On the basis of our results, we make recommendations for language researchers concerning the best way to deal with missing data. Given that none of the most common simple methods works properly, we suggest that the missing data be replaced either by the item’s mean or by the participants’ overall mean to provide a better, more accurate image of the instrument’s internal consistency.

3

Predicting in multivariate incomplete time series. Application of the expectation-maximisation algorithm supplemented by the Newton-Raphson method

100%

Korczyński A.

|

tom 68

|

nr 1

17-46

EN

Statistical practice requires various imperfections resulting from the nature of data to be addressed. Data containing different types of measurement errors and irregularities, such as missing observations, have to be modelled. The study presented in the paper concerns the application of the expectation-maximisation (EM) algorithm to calculate maximum likelihood estimates, using an autoregressive model as an example. The model allows describing a process observed only through measurements with certain level of precision and through more than one data series. The studied series are affected by a measurement error and interrupted in some time periods, which causes the information for parameters estimation and later for prediction to be less precise. The presented technique aims to compensate for missing data in time series. The missing data appear in the form of breaks in the source of the signal. The adjustment has been performed by the EM algorithm to a hybrid version, supplemented by the Newton-Raphson method. This technique allows the estimation of more complex models. The formulation of the substantive model of an autoregressive process affected by noise is outlined, as well as the adjustment introduced to overcome the issue of missing data. The extended version of the algorithm has been verified using sampled data from a model serving as an example for the examined process. The verification demonstrated that the joint EM and Newton-Raphson algorithms converged with a relatively small number of iterations and resulted in the restoration of the information lost due to missing data, providing more accurate predictions than the original algorithm. The study also features an example of the application of the supplemented algorithm to some empirical data (in the calculation of a forecasted demand for newspapers).

4

Hybrid multiple imputation in a large scale complex survey

100%

Razzak H. , Heumann C.

|

tom 20

|

nr 4

33-58

EN

Large-scale complex surveys typically contain a large number of variables measured on an even larger number of respondents. Missing data is a common problem in such surveys. Since usually most of the variables in a survey are categorical, multiple imputation requires robust methods for modelling highdimensional categorical data distributions. This paper introduces the 3-stage Hybrid Multiple Imputation (HMI) approach, computationally efficient and easy to implement, to impute complex survey data sets that contain both continuous and categorical variables. The proposed HMI approach involves the application of sequential regression MI techniques to impute the continuous variables by using information from the categorical variables, already imputed by a non-parametric Bayesian MI approach. The proposed approach seems to be a good alternative to the existing approaches, frequently yielding lower root mean square errors, empirical standard errors and standard errors than the others. The HMI method has proven to be markedly superior to the existing MI methods in terms of computational efficiency. The authors illustrate repeated sampling properties of the hybrid approach using simulated data. The results are also illustrated by child data from the multiple indicator survey (MICS) in Punjab 2014.

5

Computational intensive methods for prediction and imputation in time series analysis

100%

Neves M. , Cordeiro C.

Discussiones Mathematicae Probability and Statistics

|

2011

|

tom 31

|

nr 1-2

121-139

EN

One of the main goals in times series analysis is to forecast future values. Many forecasting methods have been developed and the most successful are based on the concept of exponential smoothing, based on the principle of obtaining forecasts as weighted combinations of past observations. Classical procedures to obtain forecast intervals assume a known distribution for the error process, what is not true in many situations. A bootstrap methodology can be used to compute distribution free forecast intervals. First an adequately chosen model is fitted to the data series. Afterwards, and inspired on sieve bootstrap, an AR(p) is used to filter the series of the random component, under the stationarity hypothesis. The centered residuals are then resampled and the initial series is reconstructed. This methodology will be used to obtain forecasting intervals and for treating missing data, which often appear in a real time series. An automatic procedure was developed in R language and will be applied in simulation studies as well as in real examples.

6

An Analysis of Probabilistic Approximations for Rule Induction from Incomplete Data Sets

88%

Clark P. G. , Grzymala-Busse J. W. , Hippe Z. S.

Fundamenta Informaticae

|

2014

|

tom Vol. 132, nr 3

365--379

EN

The main objective of our research was to test whether the probabilistic approximations should be used in rule induction from incomplete data. For our research we designed experiments using six standard data sets. Four of the data sets were incomplete to begin with and two of the data sets had missing attribute values that were randomly inserted. In the six data sets, we used two interpretations of missing attribute values: lost values and “do not care” conditions. In addition we used three definitions of approximations: singleton, subset and concept. Among 36 combinations of a data set, type of missing attribute values and type of approximation, for five combinations the error rate (the result of ten-fold cross validation) was smaller than for ordinary (lower and upper) approximations; for other four combinations, the error rate was larger than for ordinary approximations. For the remaining 27 combinations, the difference between these error rates was not statistically significant.

7

On classification with missing data using rough-neuro-fuzzy systems

88%

Nowicki R.

|

tom 20

|

nr 1

55-67

EN

The paper presents a new approach to fuzzy classification in the case of missing data. Rough-fuzzy sets are incorporated into logical type neuro-fuzzy structures and a rough-neuro-fuzzy classifier is derived. Theorems which allow determining the structure of the rough-neuro-fuzzy classifier are given. Several experiments illustrating the performance of the roughneuro-fuzzy classifier working in the case of missing features are described.

8

Chemometric exploration of sea water chemical component data sets with missing elements

88%

Smoliński A. , Falkowska L. , Pryputniewicz D.

Oceanological and Hydrobiological Studies

|

2008

|

tom Vol. 37, No.3

49-62

EN

The results of the application of chemometric methods, such as principal component analysis (PCA) and its generalization for N-way data, the Tucker3 model, for the analysis of an environmental data set are presented. The analyzed data consists of concentration values of chemical compounds of organic matter, and their transformed products, in a short-term study of a sea water column measured at the Gdańsk Deep (.[fi]= 55°1’N, [lambda] = 19°10’E). The main goal of this paper is to present improved approaches for exploration of data sets with missing elements, based on the expectation-maximization (EM) algorithm. The most common methods for dealing with missing data, generally consisting of setting the missing elements to zero or to mean values of the measured data, are often unacceptable as they destroy data correlations or influence interpretation of relationships between objects and variables. The EM algorithm may be built into different computational procedures used for exploratory analysis (i.e. EM/PCA or EM/TUCKER3).

9

Drzewa klasyfikacyjne w medycynie

75%

Owczarek A. J.

Annales Academiae Medicae Silesiensis

|

2014

|

tom 68

|

nr 6

449-456

EN

The paper presents the use of computerized diagnostic decision support systems for medical diagnostics in medicine. The structure of a classical decision tree and the advantages and disadvantages of using classification trees have been discussed. Moreover, the paper deals with the effect of classification trees with respect to other classic statistical methods, such as discriminant analysis and logistic regression, taking into account the problem of variable multicollinearity and the problem of the occurrence of so-called missing data. Additionally, some examples of the application of classification trees in medicine have been shown.

PL

W pracy zaprezentowano wykorzystanie w medycynie komputerowych systemów diagnostyki medycznej. Przedstawiono budowę klasycznego drzewa decyzyjnego oraz zalety i wady stosowania drzew klasyfikacyjnych. Ponadto omówiono działanie drzew klasyfikacyjnych w świetle innych klasycznych metod statystycznych, takich jak analiza dyskryminacyjna czy regresja logistyczna, z uwzględnieniem problemu współliniowości zmiennych czy problemu występowania tzw. danych niepełnych. Podano wybrane przykłady zastosowania drzew klasyfikacyjnych w medycynie.

10

MODELE HARMONICZNE ZE ZŁOŻONĄ SEZONOWOŚCIĄ W PROGNOZOWANIU SZEREGÓW CZASOWYCH Z LUKAMI SYSTEMATYCZNYMI

75%

Szmuksta-Zawadzka M. , Zawadzki J.

Metody Ilościowe w Badaniach Ekonomicznych

|

2013

|

tom 14

|

nr 3

81-90

PL

W modelowaniu zmiennych ze złożoną sezonowością dla pełnych danych i danych z lukami niesystematycznymi mogą być wykorzystywane zarówno modele ze zmiennymi zero-jedynkowymi jak i modele harmoniczne. Natomiast w przypadku występowania luk systematycznych- jedynie oszczędne modele harmoniczne. W modelach tych każdy rodzaj wahań opisywany jest za pomocą odrębnych zestawów składowych sinuso- i kosinusoidalnych. Rozważania teoretyczne zostaną zilustrowane przykładem empirycznym.

EN

In the modeling of the variables with complex seasonality for complete time series and with unsystematic data gaps can be used both types of models: with dummy variables and harmonic models. However, in modeling variable with systematic gaps can be used only harmonic models. In these models, each type of fluctuation is described by separate sets of sine- and cosine component. Theoretical considerations are illustrated by an empirical example.

11

Censored Random Variable as a Form of Coping with Missing Data in Studying the Leachability of Heavy Metals from Hardening Slurries

75%

Szarek Ł. , Kledyński Z. , Szarek Ł. , Kledyński Z.

|

12

Energy associated tuning method for short-term series forecasting by complete and incomplete datasets

75%

Rodríguez-Rivero C. , Pucheta J. , Laboret S. , Sauchelli V. , Patińo D.

Journal of Artificial Intelligence and Soft Computing Research

|

2017

|

tom Vol. 7, No. 1

5--16

EN

This article presents short-term predictions using neural networks tuned by energy associated to series based-predictor filter for complete and incomplete datasets. A benchmark of high roughness time series from Mackay Glass (MG), Logistic (LOG), Henon (HEN) and some univariate series chosen from NN3 Forecasting Competition are used. An average smoothing technique is assumed to complete the data missing in the dataset. The Hurst parameter estimated through wavelets is used to estimate the roughness of the real and forecasted series. The validation and horizon of the time series is presented by the 15 values ahead. The performance of the proposed filter shows that even a short dataset is incomplete, besides a linear smoothing technique employed; the prediction is almost fair by means of SMAPE index. Although the major result shows that the predictor system based on energy associated to series has an optimal performance from several chaotic time series, in particular, this method among other provides a good estimation when the short-term series are taken from one point observations.

13

Z badań nad metodami prognozowania na podstawie niekompletnych szeregów czasowych z wahaniami okresowymi (sezonowymi)

63%

Szmuksta-Zawadzka M. , Zawadzki J.

|

nr numer specjalny 1

140-154

PL

Praca została poświęcona syntetycznemu omówieniu wyników wieloletnich badań autorów nad zastosowaniami metod prognozowania w warunkach braku pełnej informacji w szeregach czasowych z wahaniami sezonowymi. Rozważania odnosić się będą do dwóch rodzajów luk w danych: systematycznych i niesystematycznych. Z lukami systematycznymi mamy do czynienia wtedy, gdy nie są dostępne informacje liczbowe przynajmniej o jednym podokresie w całym przedziale czasowym „próby”. Rozpatrywane będą metody prognozowania zarówno dla danych oryginalnych (z sezonowością) jak i danych, z których wyeliminowano wahania sezonowe. Egzemplifikacją rozważań o charakterze teoretycznym będzie przykład empiryczny.

EN

This work presents discussion about results of long-term of authors research on applications of different forecasting methods in condition of lack of full information. There will be considered two types of gaps in data: systematic and unsystematic. The systematic gaps in data are only when we have not any information about at least one sub-period in the whole of analyzed data. There will be presented two types of methods applied to time series with and without seasonal component. Exemplification of theoretical considerations will be an empirical example.

14

On classification with missing data using rough-neuro-fuzzy systems

63%

Nowicki R. K.

International Journal of Applied Mathematics and Computer Science

|

2010

|

tom Vol. 20, no 1

55-67

EN

The paper presents a new approach to fuzzy classification in the case of missing data. Rough-fuzzy sets are incorporated into logical type neuro-fuzzy structures and a rough-neuro-fuzzy classifier is derived. Theorems which allow determining the structure of the rough-neuro-fuzzy classifier are given. Several experiments illustrating the performance of the roughneuro-fuzzy classifier working in the case of missing features are described.

15

Analiza metod uzupełniania brakujących danych w procesie konstruowania syntetycznych wskaźników zrónoważonego rozwoju transportu

63%

Barczak A.

Logistyka

|

2012

|

tom nr 3

PL

Ze względu na złożoność problematyki zrównoważonego rozwoju transportu, w procesie podejmowania decyzji, a także w procesie edukacji i informowania społeczeństwa celowe jest zastosowanie syntetycznych wskaźników. Ich wieloaspektowość powoduje jednak, że dane do ich wyznaczania nie zawsze są dostępne. Konieczne jest zatem wykorzystanie numerycznych metod wstawiania brakujących danych. W artykule przeprowadzono analizę metod wstawiania brakujących danych pod kątem ich wykorzystania oraz przeprowadzono numeryczną symulację uzupełniania brakujących danych dla przebiegów emisji zanieczyszczeń wywołanych ruchem pojazdu.

EN

Due to the complexity of sustainable development in transportation, in the processes of decision-making as well as in education and public information, it is advisable to apply composite indicators. However, because of the multiplicity of aspects involved in the process of composite indicators usage the necessary data is not always available. It is therefore required to use numerical methods to insert the missing data. In the article, the analysis of missing data imputation methods was conducted and the numerical simulation of imputation of missing data was performed for emissions measured during on-road test.

16

Missing data estimation based on the chaining technique in survey sampling

63%

Singh Thakur N. , Shukla D.

|

tom 23

|

nr 4

91-111

EN

Sample surveys are often affected by missing observations and non-response caused by the respondents' refusal or unwillingness to provide the requested information or due to their memory failure. In order to substitute the missing data, a procedure called imputation is applied, which uses the available data as a tool for the replacement of the missing values. Two auxiliary variables create a chain which is used to substitute the missing part of the sample. The aim of the paper is to present the application of the Chain-type factor estimator as a means of source imputation for the non-response units in an incomplete sample. The proposed strategies were found to be more efficient and bias-controllable than similar estimation procedures described in the relevant literature. These techniques could also be made nearly unbiased in relation to other selected parametric values. The findings are supported by a numerical study involving the use of a dataset, proving that the proposed techniques outperform other similar ones.

17

The problem of imputation of the missing data from the continuous counts of road traffic

63%

Spławińska M.

Archives of Civil Engineering

|

2015

|

tom Vol. 61, nr 1

131--145

EN

Missing traffic data is an important issue for road administration. Although numerous ways can be found to impute them in foreign literature (inter alia, the most effective method, that is Box-Jenkins models), in Poland, still only proven and simplified methods are applied. The article presents the analyses including an assessment of the completeness of the existing traffic data and works related to the construction of SARIMA model. The study was conducted on the basis of hourly traffic volumes, derived from the continuous traffic counts stations located in the national road network in Poland (Golden River stations) from the years 2005 – 2010. As a result, the proposed model was used to impute the missing data in the form of SARIMA (1.1,1)(0,1,1)168. The newly developed model can be used effectively to fill in the missing required days of measurement for estimating AADT by AASHTO method. In other cases, due to its accuracy and laboriousness of the process, it is not recommended.

18

Empirical Evaluation of Methods of Filling the Missing Data in Learning Probabilistic Models

63%

Falkowski A. A. , Łupińska-Dubicka A.

Advances in Computer Science Research

|

2018

|

tom Nr 14

55--67

EN

Missing data is a common problem in statistical analysis and most practical databases contain missing values of some of their attributes. Missing data can appear for many reasons. However, regardless of the reason for the missing values, even a small percent of missing data can cause serious problems with analysis reducing the statistical power of a study and leading to draw wrong conclusions. In this paper the results of handling missing observations in learning probabilistic models were presented. Two data sets taken from UCI Machine Learning Repository were used to learn the quantitative part of the Bayesian networks. To provide the opportunity to compare selected data sets did not contain any missing values. For each model data sets with variety of levels of missing values were artificially generated. The main goal of this paper was to examine whether omitting observations has an influence on model’s reliability. The accuracy was defined as the percentage of correctly classified records and has been compared to the results obtained in the data set not containing missing values.

PL

Brakujące dane są częstym problemem w analizie statystycznej, a większość baz danych zawiera brakujące wartości niektórych z ich atrybutów. Brakujące dane mogą pojawiać się z wielu powodów. Jednak bez względu na przyczynę brakujących wartości nawet ich niewielki procent może spowodować poważne problemy z analizą, zmniejszając siłę statystyczną badania i prowadząc do wyciągnięcia błędnych wniosków. W artykule przedstawiono wyniki uzupełniania danych brakujących w uczeniu modeli probabilistycznych. Dwa zestawy danych pobrane z repozytorium uczenia maszynowego UCI posłużyły do wytrenowania ilościowej części sieci bayesowskich. Aby zapewnić możliwość porównania wybrane zbiory danych nie zawierały żadnych brakujących wartości. Dla każdego modelu zbiory danych z różnymi poziomami brakujących wartości zostały sztucznie wygenerowane. Głównym celem tego artykułu było zbadanie, czy braki w obserwacjach mają wpływ na niezawodność modelu. Dokładność została zdefiniowana jako procent poprawnie zaklasyfikowanych rekordów i została porównana z wynikami uzyskanymi w zbiorze danych niezawierającym brakujących wartości.

19

Aproksymacja stężeń zanieczyszczeń powietrza za pomocą neuronowych modeli szeregów czasowych

51%

Hoffman S.

Inżynieria i Ochrona Środowiska

|

2009

|

tom T. 12, nr 3

231-239

PL

W pracy oceniono możliwości aproksymacji stężeń zanieczyszczeń mierzonych na stacjach monitoringu powietrza. Do predykcji stężeń wykorzystano neuronowe modele szeregów czasowych. Jakość modelowania testowano na rzeczywistych danych pochodzących ze stacji monitoringu powietrza Łódź-Widzew, zarejestrowanych w latach 2004-2008. Analizie poddano względnie kompletny zbiór danych, obejmujący stężenia 6 podstawowych zanieczyszczeń powietrza: O3, NO2, NO, PM10, SO2, CO. Celem badawczym było określenie i porównanie dokładności predykcji stężeń różnych zanieczyszczeń powietrza. Modelowanie przeprowadzono, stosując sztuczne sieci neuronowe. Trening sieci odbywał się przy użyciu liniowego algorytmu pseudoinwersji. Wyjściem modelu było stężenie wybranego zanieczyszczenia w określonym czasie. Wejściami były wartości stężeń zarejestrowane w godzinach wcześniejszych. Każdy model charakteryzowały dwie wielkości: horyzont prognozy i liczba wartości opóźnionych. W analizie określono dokładność predykcji stężeń wybranych zanieczyszczeń dla stałej liczby wartości opóźnionych równej 24 przy zmieniającym się horyzoncie prognozy od 1 do 240 godz. Jako kryterium jakości modelowania przyjęto wartość błędu aproksymacji.

EN

An assessment of quality of air pollutants concentration modeling was the main research purpose. The examination was made by means of artificial neural networks, which were employed to create time-series models. The quality of approximation was tested on the actual set of air monitoring data, gathered over a 5-year period at the measure site in Lodz-Widzew (Central Poland). The examined time-series involved hourly concentrations of main air pollutants: O3, NO2, NO, PM10, SO2, CO. The research aim was the estimation and the comparison of prediction accuracy for different air pollutants. Time-series models were characterized by two parameters which might influence the prediction quality: lookahead and steps. For all models the constant number of steps equal 24 hours was assumed. The effect of changes of lookahead in the range 1÷ 240 hours was analyzed. It was stated that the decreasing of precision of time-series models with the increase of lookahead is observed. The drop of accuracy depends on pollutant. The furthest reasonable prognosis may be done for ozone concentration. Approximation accuracy shortens in the order: O3, CO, SO2, PM10, NO2, NO.

20

Censored random variable as a form of coping with missing data in studying the leachability of heavy metals from hardening slurries

51%

Szarek Ł. , Kledyński Z.

Archives of Civil Engineering

|

2021

|

tom Vol. 67, nr 1

233--247

EN

Missing data in test result tables can significantly impact the analysis quality, especially in relation to technical sciences, where the mechanism generating missing data is often non-random, and their presence depends on the non-observed part of studied variables. In such cases, the application of an inappropriate method for dealing with missing data will lead to bias in the estimated distribution parameters. The article presents a relatively simple method to implement in dealing with missing data generated as a result of the MNAR mechanism, which utilizes the censored random variable. This procedure does not modify the variable distribution form, which is why it ensures objective and efficient estimation of distribution parameters within studies affected by certain restrictions of technical or physical nature (censored distribution), with a relatively low workload. Furthermore, it does not require the application of specialized software. A prerequisite for using this method is the knowledge of the frequency and cause of missing data. The method for estimating the random variable censored distribution parameters was shown based on the example of studying the leachability of selected heavy metals from a hardening slurry. The analysis results were compared with classical methods for dealing with missing data, such as, ignoring missing data observations (listwise or pairwise deletion), single imputation and stochastic regressive imputation.

PL

Braki danych w tablicach wyników badań mogą w znaczący sposób wpływać na jakość analizy, szczególnie w naukach technicznych, gdzie mechanizm generujący braki danych często jest nielosowy, a ich występowanie zależy od części nieobserwowanej badanych zmiennych. W takich przypadkach zastosowanie nieodpowiedniej metody radzenia sobie z brakami danych prowadzi do obciążenia estymowanych parametrów rozkładu. W artykule przedstawiono stosunkowo prostą w implementacji metodę radzenia sobie z brakami danych powstałymi w wyniku mechanizmu MNAR wykorzystującą rozkład cenzurowany. Procedura ta nie modyfikuje postaci rozkładu zmiennej, przez co zapewnia obiektywne i skuteczne estymowanie parametrów rozkładu w badaniach dotkniętych pewnymi ograniczeniami natury technicznej lub fizycznej, przy stosunkowo niskim nakładzie pracy. Ponadto nie wymaga zastosowania specjalistycznego oprogramowania. Warunkiem koniecznym zastosowania metody jest znajomość częstości występowania braków danych oraz ich przyczyny. Sposób estymacji parametrów rozkładu cenzurowanego zmiennej losowej przedstawiono na przykładzie badania wymywalności wybranych metali ciężkich z zawiesiny twardniejącej. Wyniki analizy porównano z klasycznymi sposobami radzenia sobie z brakami danych: pominięciem obserwacji z brakami danych, imputacją oraz stochastyczną imputacją regresyjną.