Wyniki wyszukiwania - BazTech

1

Analysis of the Effectiveness of Selected Machine Learning Algorithms in the Classification of Satellite Image Content Depending on the Size of the Training Sample

Kupidura Przemysław, Niemyski Stanisław

Teledetekcja Środowiska

|

2024

|

T. 64

24--38

EN

The article presents an analysis of the accuracy of 3 popular machine learning (ML) methods: Maximum Likelihood Classifier (MLC), Support Vector Machine (SVM), and Random Forest (RF) depending on the size of the training sample. The analysis involved performing the classification of the content of a Landsat 8 satellite image (divided into 6 basic land cover classes) in 10 different variants of the number of training samples (from 2664 to 34711 pixels), estimating individual results, and a comparative analysis of the obtained results. For each classification variant, an error matrix was developed and on their basis, accuracy metrics were calculated: f1-score, precision and recall (for individual classes) as well as overall accuracy and kappa index of agreement (generally for the entire classification). The analysis showed a stimulating effect of the size of the training sample on the accuracy of the obtained classification results in all analyzed cases, with the most sensitive to this factor being MLC, showing the best effectiveness with the largest training sample and the smallest - with the smallest, and the least SVM, characterized by the highest accuracy with the smallest training sample, comparing to other algorithms.

PL

Artykuł przedstawia analizę dokładności 3 popularnych metod uczenia maszynowego: Maximum Likelihood Classifier (MLC), Support Vector Machine (SVM) oraz Random Forest (RF) w zależności od liczebności próbki treningowej. Analiza polegała na wykonaniu klasyfikacji treści zdjęcia satelitarnego Landsat 8 (w podziale na 6 podstawowych klas pokrycia terenu) w 10 różnych wariantach liczebności próbek uczących (od 2664 do 34711 pikseli), oszacowaniu poszczególnych wyników oraz analizie porównawczej uzyskanych wyników. Dla każdego wariantu klasyfikacji opracowano macierz błędów, a na ich podstawie obliczono metryki dokładności: F1-score, precision and recall (dla pojedynczych klas) oraz ogólną dokładność i wskaźnik zgodności Kappa (ogólnie dla całej klasyfikacji). Analiza wykazała stymulujący wpływ rozmiaru próbki uczącej na dokładność uzyskiwanych wyników klasyfikacji we wszystkich analizowanych przypadkach, przy czym najbardziej wrażliwym na ten czynnik był MLC, wykazujący się najlepszą skutecznością przy największej próbce treningowej i najmniejszą - przy najmniejszej, a najmniej SVM, cechujący się największą dokładnością przy najmniejszej próbce treningowej, w porównaniu do pozostałych algorytmów.

2

Application of machine learning tools for seismic reservoir characterization study of porosity and saturation type

Topór Tomasz, Sowiżdżał Krzysztof

Nafta-Gaz

|

2022

|

R. 78, nr 3

165--175

EN

The application of machine learning (ML) tools and data-driven modeling became a standard approach for solving many problems in exploration geology and contributed to the discovery of new reservoirs. This study explores an application of machine learning ensemble methods – random forest (RF) and extreme gradient boosting (XGBoost) to derive porosity and saturation type (gas/water) in multihorizon sandstone formations from Miocene deposits of the Carpathian Foredeep. The training of ML algorithms was divided into two stages. First, the RF algorithm was used to compute porosity based on seismic attributes and well location coordinates. The obtained results were used as an extra feature to saturation type modeling using the XGBoost algorithm. The XGBoost was run with and without well location coordinates to evaluate the influence of the spatial information for the modeling performance. The hyperparameters for each model were tuned using the Bayesian optimization algorithm. To check the training models' robustness, 10-fold cross-validation was performed. The results were evaluated using standard metrics, for regression and classification, on training and testing sets. The residual mean standard error (RMSE) for porosity prediction with RF for training and testing was close to 0.053, providing no evidence of overfitting. Feature importance analysis revealed that the most influential variables for porosity prediction were spatial coordinates and seismic attributes sweetness. The results of XGBoost modeling (variant 1) demonstrated that the algorithm could accurately predict saturation type despite the class imbalance issue. The sensitivity for XGBoost on training and testing data was high and equaled 0.862 and 0.920, respectively. The XGBoost model relied on computed porosity and spatial coordinates. The obtained sensitivity results for both training and testing sets dropped significantly by about 10% when well location coordinates were removed (variant 2). In this case, the three most influential features were computed porosity, seismic amplitude contrast, and iso-frequency component (15 Hz) attribute. The obtained results were imported to Petrel software to present the spatial distribution of porosity and saturation type. The latter parameter was given with probability distribution, which allows for identifying potential target zones enriched in gas.

PL

Metody uczenia maszynowego stanowią obecnie rutynowe narzędzie wykorzystywane przy rozwiązywaniu wielu problemów w geologii poszukiwawczej i przyczyniają się do odkrycia nowych złóż. Prezentowana praca pokazuje zastosowanie dwóch algorytmów uczenia maszynowego – lasów losowych (RF) i drzew wzmocnionych gradientowo (XGBoost) do wyznaczenia porowatości i typu nasycenia (gaz/woda) w formacjach piaskowców będących potencjalnymi horyzontami gazonośnymi w mioceńskich osadach zapadliska przedkarpackiego. Proces uczenia maszynowego został podzielony na dwa etapy. W pierwszym etapie użyto RF do obliczenia porowatości na podstawie danych pochodzących z atrybutów sejsmicznych oraz współrzędnych lokalizacji otworów. Uzyskane wyniki zostały wykorzystane jako dodatkowa cecha przy modelowaniu typu nasycenia z zastosowaniem algorytmu XGBoost. Modelowanie za pomocą XGBoost został przeprowadzone w dwóch wariantach – z wykorzystaniem lokalizacji otworów oraz bez nich w celu oceny wpływu informacji przestrzennych na wydajność modelowania. Proces strojenia hiperparametrów dla poszczególnych modeli został przeprowadzony z wykorzystaniem optymalizacji Bayesa. Wyniki procesu modelowania zostały ocenione na zbiorach treningowym i testowym przy użyciu standardowych metryk wykorzystywanych do rozwiązywania problemów regresyjnych i klasyfikacyjnych. Dodatkowo, aby wzmocnić wiarygodność modeli treningowych, przeprowadzona została 10-krotna kroswalidacja. Pierwiastek błędu średniokwadratowego (RMSE) dla wymodelowanej porowatości na zbiorach treningowym i testowym był bliski 0,053 co wskazuje na brak nadmiernego dopasowania modelu (ang. overfitting). Analiza istotności cech ujawniła, że zmienną najbardziej wpływającą na prognozowanie porowatości były współrzędne lokalizacji otworów oraz atrybut sejsmiczny sweetness. Wyniki modelowania XGBoost (wariant 1) wykazały, że algorytm jest w stanie dokładnie przewidywać typ nasycenia pomimo problemu z nierównowagą klas. Czułość wykrywania potencjalnych stref gazowych w przypadku modelu XGBoost była wysoka zarówno dla zbioru treningowego, jak i testowego (0,862 i 0,920). W swoich predykcjach model opierał się głównie na wyliczonej porowatości oraz współrzędnych otworów. Czułość dla uzyskanych wyników na zbiorze treningowym i testowym spadła o około 10%, gdy usunięto współrzędne lokalizacji otworów (wariant 2 XGBoost). W tym przypadku trzema najważniejszymi cechami były obliczona porowatość oraz atrybut sejsmiczny amplitude contrast i atrybut iso-frequency component (15 Hz). Uzyskane wyniki zostały zaimportowane do programu Petrel, aby przedstawić przestrzenny rozkład porowatości i typu nasycenia. Ten ostatni parametr został przedstawiony wraz z rozkładem prawdopodobieństwa, co dało wgląd w strefy o najwyższym potencjale gazowym.

3

Carbon emissions futures price forecasting with random forest

Pawłowski Piotr

Rynek Energii

|

2021

|

Nr 2

88--92

EN

The accurate carbon price forecasts are necessary for energy and financial market participants. However, nonstationary and nonlinear nature of carbon prices time series, makes it relatively hard to capture rapid price fluctuations. Literature concerning carbon prices forecasting has extended visibly during last decade, focusing on ARIMA, GARCH or hybrid models combing characteristics of linear and non-linear predictive methods. Development of machine learning techniques and widely available computing power made it possible to test more power consuming algorithms such as XGboost or Random Forest. In this article Random Forest model was used to predict carbon emissions futures price for day ahead, with additional parameter tuning. The final results evaluated on testing dataset indicate that the proposed model performs better than classic linear model and parameter tuning can additionally enhance model accuracy. Overall, the developed approach provides an effective method for predicting carbon price.

PL

Dokładne prognozy cen emisji dwutlenku węgla są niezbędne dla uczestników rynku energii i rynków finansowych. Niestacjonarny i nieliniowy charakter cen emisji dwutlenku węgla sprawia, że stosunkowo trudno jest uchwycić ich gwałtowne wahania. Literatura dotycząca prognozowania cen emisji dwutlenku węgla znacznie się rozwinęła w ciągu ostatniej dekady, koncentrując się głównie na modelach typu ARIMA, GARCH lub hybrydowych łączących cechy liniowych i nieliniowych metod predykcyjnych. Rozwój technik uczenia maszynowego i powszechność dostępnych mocy obliczeniowych umożliwiły testowanie bardziej zaawansowanych algorytmów, takich jak XGboost czy Random Forest. W niniejszym artykule do prognozowania ceny kontraktów terminowych emisji dwutlenku węgla na dzień naprzód został wykorzystany model oparty o lasy losowe, z dodatkowym dostrojeniem parametrów. Ostateczne wyniki zweryfikowane na testowym zbiorze danych wskazały, że proponowany model działa lepiej niż klasyczny model liniowy, a strojenie parametrów może dodatkowo zwiększyć jego dokładność. Tym samym, opracowane podejście zapewnia skuteczną metodę przewidywania ceny emisji dwutlenku węgla na dzień następny.

4

Application of machine learning algorithms to predict permeability in tight sandstone formations

Topór Tomasz

Nafta-Gaz

|

2021

|

R. 77, nr 5

283--292

PL

The application of machine learning algorithms in petroleum geology has opened a new chapter in oil and gas exploration. Machine learning algorithms have been successfully used to predict crucial petrophysical properties when characterizing reservoirs. This study utilizes the concept of machine learning to predict permeability under confining stress conditions for samples from tight sandstone formations. The models were constructed using two machine learning algorithms of varying complexity (multiple linear regression [MLR] and random forests [RF]) and trained on a dataset that combined basic well information, basic petrophysical data, and rock type from a visual inspection of the core material. The RF algorithm underwent feature engineering to increase the number of predictors in the models. In order to check the training models’ robustness, 10-fold cross-validation was performed. The MLR and RF applications demonstrated that both algorithms can accurately predict permeability under constant confining pressure (R2 0.800 vs. 0.834). The RF accuracy was about 3% better than that of the MLR and about 6% better than the linear reference regression (LR) that utilized only porosity. Porosity was the most influential feature of the models’ performance. In the case of RF, the depth was also significant in the permeability predictions, which could be evidence of hidden interactions between the variables of porosity and depth. The local interpretation revealed the common features among outliers. Both the training and testing sets had moderate-low porosity (3–10%) and a lack of fractures. In the test set, calcite or quartz cementation also led to poor permeability predictions. The workflow that utilizes the tidymodels concept will be further applied in more complex examples to predict spatial petrophysical features from seismic attributes using various machine learning algorithms.

EN

Zastosowanie algorytmów uczenia maszynowego w geologii naftowej otworzyło nowy rozdział w poszukiwaniu złóż ropy i gazu. Algorytmy uczenia maszynowego zostały z powodzeniem wykorzystane do przewidywania kluczowych właściwości petrofizycznych charakteryzujących złoże. W pracy zastosowano metody uczenia maszynowego do przewidywania przepuszczalności w warunkach ustalonego ciśnienia złożowego dla formacji zwięzłych piaskowców typu tight gas. Modele zostały skonstruowane przy użyciu algorytmów o różnym stopniu komplikacji (wielowymiarowa regresja liniowa – MLR i lasy losowe – RF), a następnie poddano je procesowi uczenia na danych zawierających podstawowe informacje o otworze, podstawowe parametry petrofizyczne oraz typ skał pochodzący z makroskopowego i mikroskopowego opisu próbek rdzeni. Typ skał został rozkodowany i poddany procesowi inżynierii cech, aby wydobyć dodatkowe zmienne do modelu. Proces uczenia na zbiorze treningowym został przeprowadzony z wykorzystaniem 10-krotnej kroswalidacji. Uzyskane wyniki pokazują, że oba algorytmy mogą przewidywać przepuszczalność z dużą dokładnością (R2 = 0,800 dla MLR vs R2 = 0,834 dla RF). Dokładność modelu RF jest około 3% lepsza niż MLR i około 6% lepsza w porównaniu do modelu referencyjnego (model regresji liniowej z jedną zmienną – porowatością). W przypadku obu modeli porowatość była najistotniejszym parametrem przy przewidywaniu przepuszczalności. Dodatkowo w modelu wykorzystującym lasy losowe istotną cechą okazała się głębokość próbki, co może świadczyć o dodatkowych interakcjach pomiędzy zmiennymi. Cechą wspólną próbek w zbiorze treningowym i testowym, dla których modele zadziałały ze słabą skutecznością, były porowatość od 3% do 10% i brak spękań. Dodatkowo w zbiorze testowym niska dokładność przewidywań przepuszczalności była związana z obecnością cementacji kalcytem i kwarcem. Workflow wykorzystujący stan wiedzy dotyczącej modelowania, którego trzon stanowi pakiet tidymodels, będzie dalej stosowany do prognozowania przestrzennych właściwości petrofizycznych na podstawie atrybutów sejsmicznych.

5

A deep learning model integrating SK-TPCNN and random forests for brain tumor segmentation in MRI

Yang Tiejun, Song Jikun, Li Lei

Biocybernetics and Biomedical Engineering

|

2019

|

Vol. 39, no. 3

613--623

EN

The segmentation of brain tumors in magnetic resonance imaging (MRI) images plays an important role in early diagnosis, treatment planning and outcome evaluation. However, due to gliomas' significant diversity in structure, the segmentation accuracy is low. In this paper, an automatic segmentation method integrating the small kernels two-path convolu-tional neural network (SK-TPCNN) and random forests (RF) is proposed, the feature extrac-tion ability of SK-TPCNN and the joint optimization capability of model are presented respectively. The SK-TPCNN structure combining the small convolutional kernels and large convolutional kernels can enhance the nonlinear mapping ability and avoid over-fitting, the multiformity of features is also increased. The learned features from SK-TPCNN are then applied to the RF classifier to implement the joint optimization. RF classifier effectively integrates redundancy features and classify each MRI image voxel into normal brain tissues and different parts of tumor. The proposed algorithm is validated and evaluated in the Brain Tumor Segmentation Challenge (Brats) 2015 challenge Training dataset and the better performance is achieved.

6

Data mining methods for prediction of air pollution

Siwek K., Osowski S.

International Journal of Applied Mathematics and Computer Science

|

2016

|

Vol. 26, no. 2

467--478

EN

The paper discusses methods of data mining for prediction of air pollution. Two tasks in such a problem are important: generation and selection of the prognostic features, and the final prognostic system of the pollution for the next day. An advanced set of features, created on the basis of the atmospheric parameters, is proposed. This set is subject to analysis and selection of the most important features from the prediction point of view. Two methods of feature selection are compared. One applies a genetic algorithm (a global approach), and the other—a linear method of stepwise fit (a locally optimized approach). On the basis of such analysis, two sets of the most predictive features are selected. These sets take part in prediction of the atmospheric pollutants PM10, SO2, NO2 and O3. Two approaches to prediction are compared. In the first one, the features selected are directly applied to the random forest (RF), which forms an ensemble of decision trees. In the second case, intermediate predictors built on the basis of neural networks (the multilayer perceptron, the radial basis function and the support vector machine) are used. They create an ensemble integrated into the final prognosis. The paper shows that preselection of the most important features, cooperating with an ensemble of predictors, allows increasing the forecasting accuracy of atmospheric pollution in a significant way.

7

Imitation learning of car driving skills with decision trees and random forests

Cichosz P., Pawełczak Ł.

International Journal of Applied Mathematics and Computer Science

|

2014

|

Vol. 24, no. 3

579--597

EN

Machine learning is an appealing and useful approach to creating vehicle control algorithms, both for simulated and real vehicles. One common learning scenario that is often possible to apply is learning by imitation, in which the behavior of an exemplary driver provides training instances for a supervised learning algorithm. This article follows this approach in the domain of simulated car racing, using the TORCS simulator. In contrast to most prior work on imitation learning, a symbolic decision tree knowledge representation is adopted, which combines potentially high accuracy with human readability, an advantage that can be important in many applications. Decision trees are demonstrated to be capable of representing high quality control models, reaching the performance level of sophisticated pre-designed algorithms. This is achieved by enhancing the basic imitation learning scenario to include active retraining, automatically triggered on control failures. It is also demonstrated how better stability and generalization can be achieved by sacrificing human-readability and using decision tree model ensembles. The methodology for learning control models contributed by this article can be hopefully applied to solve real-world control tasks, as well as to develop video game bots.

8

Prognozowanie cen energii elektrycznej na rynku dnia następnego metodami data mining

Fijorek K., Mróz K., Niedziela K., Fijorek D.

Rynek Energii

|

2010

|

nr 6

46-50

PL

Celem artykułu jest zbadanie potencjału wybranych, współcześnie zaproponowanych metod data mining w kontekście prognozowania cen energii elektrycznej na rynku dnia następnego. Szczególny nacisk został położony na zbadanie algorytmu lasów losowych. Prognozy uzyskane metodą lasów losowych porównano z prognozami wygenerowanymi przez model regresji liniowej, model regresji medianowej oraz regresyjną metodę wektorów nośnych.

EN

The aim of the paper is to study capabilities of selected, recently introduced supervised data mining algorithms in context of day-ahead electricity prices forecasting. Special stress was put on studying random forests. The quality of random forests forecasts was compared to forecasts obtained from multiple regression, median regression and support vector regression.

9

Lasy losowe - ocena jakości prognostycznej cech

Krętowska M.

Zeszyty Naukowe Politechniki Białostockiej. Informatyka

|

2007

|

Z. 2

67-77

PL

W pracy bezwzględny błąd predykcji jest wykorzystywany do oceny jakości prognostycznej poszczególnych cech. Narzędzie prognostyczne - lasy losowe - jest konstruowane w celu uzyskania estymatora funkcji przeżycia. Jest on następnie porównywany z estymatorem funkcji przeżycia Kaplana-Meiera, utworzonym przy założeniu jednorodności populacji. Elementem składowym lasów są dipolowe drzewa przeżycia. Zastosowanie dipolowej funkcji kryterialnej pozwala wykorzystać niepełną informację o czasie zajścia porażki, pochodzącą z obserwacji obciętych.

EN

In the paper, predictive accuracy measured as the absolute predictive error is used to evaluate the quality of covariates. The prognostic tool - random forests - is built to receive the aggregated survival function. The function is compared to Kaplan-Meier estimator of survival function with assumption that the population is homogenous. The induction of individual dipolar survival tree is based on minimization of a piece-wise linear function - dipolar criterion. The algorithm allows using the information from censored observations for which the exact survival time is unknown.

10

Ensembles of dipolar trees for prediction of survival time

Krętowska M.

Biocybernetics and Biomedical Engineering

|

2007

|

Vol. 27, no. 3

67-75

EN

In the paper, the application of random forest for prediction of survival time is presented. The observed data loss function is based on inverse probability of censoring weights. The random forest consists of the sequence of multivariate regression trees created on the base of the learning sets, randomly generated from the given dataset. The applied regression trees use minimization of dipolar criterion function for finding the splits in the internal nodes.