Liquid-based cytology (LBC) is a widely used diagnostic tool for cervical cancer diagnosis. However, the accuracy and efficiency of LBC-based cervical cancer classification are still limited due to the lack of standardized, scalable, and objective cytological assessment protocols. To address these gaps, this study develops and evaluates a machine learning framework that integrates various feature extraction techniques, feature selection methods, and machine learning classifiers to improve cervical cancer detection. The results demonstrate that handcrafted and local binary pattern features achieve the best overall performance, with the SVM, gradient boosting and histogram-based gradient buffering reaching a 95.92% accuracy, highlighting the strength of combining morphological and texture descriptors to maximize their discriminative potential. Moreover, we provide a systematic comparison of different classification pipelines, offering insights into the feasibility of hybrid approaches, particularly in resource-constrained medical environments. The promising results obtained in this study highlight the potential impact of machine learning in modern medical diagnostics, providing a clinically relevant, highly accurate, and efficient classification method for LBC slides.
Parkinson's disease (PD) is a progressive neurological disorder that affects millions worldwide, leading to motor dysfunction and significant reductions in quality of life. Early diagnosis is pivotal for initiating timely treatment and improving long-term patient outcomes, yet existing diagnostic methods, which often rely on clinical evaluations andimaging, are prone to delays and varying accuracy. This study presents an innovative, non-invasive approach to early PD detection through the analysis of handwriting patterns, offering a potential alternative to traditional diagnostic techniques. Leveraging a publicly available and meticulously normalized handwriting dataset, our approach applies advanced data processing methods to identify subtle neuromotor impairments associated with PD. Through the integration of robust feature selection processes and cutting-edge machine learning models, we achieved a high accuracy rate of 83.02%, highlighting the method’s reliability. The findings suggest that this approach could significantly enhance early PD detection, leading to more personalized therapeutic strategies that align with the stages of disease progression and potentially delaying the onset of severe symptoms.
PL
Choroba Parkinsona (PD) jest postępującą chorobą neurologiczną, która dotyka miliony ludzi na całym świecie, prowadząc do zaburzeń motorycznych i znacznego obniżenia jakości życia. Wczesna diagnoza ma kluczowe znaczenie dla rozpoczęcia leczenia w odpowiednim czasie i poprawy długoterminowych wyników leczenia pacjentów, jednak istniejące metody diagnostyczne, które często opierają się na ocenach klinicznych i obrazowaniu, są podatne na opóźnienia i różną dokładność. W niniejszym badaniu przedstawiono innowacyjne, nieinwazyjne podejście do wczesnego wykrywania PD poprzez analizę wzorców pisma ręcznego, które stanowi potencjalną alternatywę dla tradycyjnych technik diagnostycznych. Wykorzystując publicznie dostępny i skrupulatnie znormalizowany zbiór danych dotyczących pisma ręcznego, w naszym podejściu zastosowano zaawansowane metody przetwarzania danych w celu identyfikacji subtelnych zaburzeń neuromotorycznych związanych z PD. Dzięki integracji solidnych procesów selekcji cechi najnowocześniejszych modeli uczenia maszynowego osiągnęliśmy wysoką dokładność wynoszącą 83,02%, co podkreśla niezawodność tej metody. Wyniki sugerują, że podejście to może znacznie poprawić wczesne wykrywanie PD, prowadząc do bardziej spersonalizowanych strategii terapeutycznych dostosowanych do etapów postępu choroby i potencjalnie opóźniających wystąpienie poważnych objawów.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
This research aims to develop a new transfer function to transform continuous space to binary space using the Polar Lights Optimizer (PLO) algorithm for the feature selection problem. The PLO algorithm relies on simulating the behaviourof the aurora borealis to achieve a balance in exploring and exploiting binary space. A new transfer function called the tent-shaped transfer function has been incorporated into the algorithm to improve its performance. The proposed function was tested on seven datasets, and compared with traditional transfer functions such as the S-shaped function family and the V-shaped function family. The results showed that the tent-shaped transfer function outperforms in terms of feature selection accuracy and reduces the number of features more effectively, which enhances the algorithm's ability to improve performance and reduce computational complexity.
PL
Badania te mają na celu opracowanie nowej funkcji przenoszenia w celu przekształcenia przestrzeni ciągłej w przestrzeń binarną przy użyciu algorytmu Polar Lights Optimizer (PLO) dla problemu selekcji cech. Algorytm PLO opiera się na symulacji zachowania zorzy polarnej w celu osiągnięcia równowagi w eksploracji i wykorzystaniu przestrzeni binarnej. Nowa funkcja przenoszenia zwana funkcją przenoszenia w kształcienamiotu została włączona do algorytmu w celu poprawy jego wydajności. Proponowana funkcja została przetestowana na siedmiu zestawach danych iporównanaz tradycyjnymi funkcjami przenoszenia, takimi jak rodzina funkcji w kształcie litery S i rodzina funkcji w kształcie litery V. Wyniki pokazały, że funkcja przenoszenia w kształcie namiotu jest lepsza pod względem dokładności wyboru cech i skuteczniej zmniejsza liczbę cech, co zwiększa zdolność algorytmu do poprawy wydajności i zmniejszenia złożoności obliczeniowej.
In this paper, we propose a new hybrid approach, which combines Generalized Normal Distribution Optimization Algorithm (GNDOA) and fuzzy C-Means clustering (FCM). It is designed for processing unsuperviseddatasets. This idea target list the development about conventional function option and clustering techniques. The proposed GNDOA-FCM uses normalized normal distribution concept along with FCM for more accurate and efficient clustering outputs leading to accelerated detection in survey region. Calinski-Harabasz index helps finding the number of clusters that has high compactness within each cluster and also apart from other clusters. The performance of the proposed hybrid GNDOA-FCM approach is tested extensively using different benchmark datasets. The results are compared with existing clustering methods using evaluation metrics like silhouette score & feature selection accuracy. Experimental results show that the proposed method can be flexibly set to obtain higher quality of clustering and is more effective than conventional techniques.
PL
W niniejszym artykule proponujemy nowe podejście hybrydowe, które łączy algorytm uogólnionej optymalizacji rozkładu normalnego (GNDOA) i klasteryzację rozmytych C-średnich(FCM). Zostało ono zaprojektowane do przetwarzania nienadzorowanych zbiorów danych. Pomysł ten ma na celu rozwój konwencjonalnych opcji funkcji i technik klasteryzacji. Proponowany GNDOA-FCMwykorzystuje koncepcję znormalizowanego rozkładu normalnego wraz z FCM w celu uzyskania dokładniejszych i wydajniejszych wyników klasteryzacji, co prowadzi do przyspieszenia wykrywania w badanym regionie. Wskaźnik Calińskiego-Harabasza pomaga znaleźć liczbę klastrów, które charakteryzują się wysoką zwartością w obrębie każdego klastra, a także w odniesieniu do innych klastrów. Wydajność proponowanego hybrydowego podejścia GNDOA-FCM została dokładnie przetestowana przy użyciu różnych zestawów danych benchmarkowych. Wyniki porównano z istniejącymi metodami klastrowania przy użyciu wskaźników oceny, takich jak wynik sylwetki i dokładność wyboru cech. Wyniki eksperymentów pokazują, że proponowana metoda może być elastycznie dostosowana w celu uzyskania wyższej jakości klastrowania i jest bardziej skuteczna niż konwencjonalne techniki.
5
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Early and accurate diagnosis of thyroid disorders is essential due to their prevalence and health impact. To enhance interpretability in clinical settings, we propose a comprehensive workflow for transparent thyroid disease prediction using a multiclass classification problem with five diagnostic categories. A dataset of 9172 samples with 31 features was used to train various machine and deep learning models. A dual-layered framework combining Feature Selection (ETC, MI, RFE) and Explainable AI (SHAP, LIME) improved performance and transparency. Gradient Boosting achieved the highest accuracy (0.97). SHAP explained global feature influence, while LIME clarified individual predictions. Our approach supports interpretable, reliable AI-based diagnostic tools for thyroid disorder classification.
The efficacy of machine learning algorithms significantly depends on the adequacy and relevance of features in the data set. Hence, feature selection precedes the classification process. In this study, a hybrid feature selection approach, integrating filter and wrapper methods was employed. This approach not only enhances classification accuracy, surpassing the results achievable with filter methods alone, but also reduces processing time compared to exclusive reliance on wrapper methods. Results indicate a general improvement in algorithm performance with the application of the hybrid feature selection approach. The study utilized the Taiwanese Bankruptcy and Statlog (German Credit Data) datasets from the UCI Machine Learning Repository. These datasets exhibit an unbalanced distribution, necessitating data preprocessing that considers this unbalance. After acknowledging the datasets’ unbalanced nature, feature selection and subsequent classification processes were executed.
Feature Selection (FS) is an essential research topic in the area of machine learning. FS, which is the process of identifying the relevant features and removing the irrelevant and redundant ones, is meant to deal with the high dimensionality problem for the sake of selecting the best performing feature subset. In the literature, many feature selection techniques approach the task as a research problem, where each state in the search space is a possible feature subset. In this paper, we introduce a new feature selection method based on reinforcement learning. First, decision tree branches are used to traverse the search space. Second, a transition similarity measure is proposed so as to ensure exploit-explore trade-off. Finally, the informative features are the most involved ones in constructing the best branches. The performance of the proposed approaches is evaluated on nine standard benchmark datasets. The results using the AUC score show the effectiveness of the proposed system.
Autism spectrum disorder (ASD) issues formidable challenges in early diagnosis and intervention, requiring efficient methods for identification and treatment. By utilizing machine learning, the risk of ASD can be accurately and promptly evaluated, thereby optimizing the analysis and expediting treatment access. However, accessing high dimensional data degrades the classifier performance. In this regard, feature selection is considered an important process that enhances the classifier results. In this paper, a chaotic binary butterfly optimization algorithm based feature selection and data classification (CBBOAFS-DC) technique is proposed. It involves, preprocessing and feature selection along with data classification. Besides, a binary variant of the chaotic BOA (CBOA) is presented to choose an optimal set of a features. In addition, the CBBOAFS-DC technique employs bacterial colony optimization with a stacked sparse auto-encoder (BCO-SSAE) model for data classification. This model makes use of the BCO algorithm to optimally adjust the ‘weight’ and ‘bias’ parameters of the SSAE model to improve classification accuracy. Experiments show that the proposed scheme offers better results than benchmarked methods.
The scope of this paper is that it investigates and proposes a new clustering method thattakes into account the timing characteristics of frequently used feature words and thesemantic similarity of microblog short texts as well as designing and implementing mi-croblog topic detection and detection based on clustering results. The aim of the proposedresearch is to provide a new cluster overlap reduction method based on the divisions ofsemantic memberships to solve limited semantic expression and diversify short microblogcontents. First, by defining the time-series frequent word set of the microblog text, a fea-ture word selection method for hot topics is given; then, for the existence of initial clusters,according to the time-series recurring feature word set, to obtain the initial clustering ofthe microblog.
With the advent of social media, the volume of photographs uploaded on the internet has increased exponentially. The task of efficiently recognizing and retrieving human facial images is inevitable and essential at this time. In this work, a feature selection approach for recognizing and retrieving human face images using hybrid cheetah optimization algorithm is proposed. The deep feature extraction from the images is done using deep convolutional neural networks. Hybrid cheetah optimization algorithm, an improvised version of cheetah optimization algorithm fused with genetic algorithm is used, to choose optimum features from the extracted deep features. The chosen features are used for finding the best-matching images from the image database. The image matching is performed by approximate nearest neighbor search for the query image over the image database and similar images are retrieved. By constructing a k-NN graph for the images, the efficiency of image retrieval is enhanced. The proposed system performance is evaluated against benchmark datasets such as LFW, MultiePie, ColorFERET, DigiFace-1M and CelebA. The evaluation results show that the proposed methodology is superior to various existing methodologies.
11
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Knowing expected milk yield can help dairy farmers in better decision-making and management. The objective of this study was to build and compare predictive models to forecast daily milk yield over a long duration. A machine-learning pipeline was provided and five baseline models as well as a novel stacking model were developed for the prediction of milk yield on the CowNflow dataset using 414 Holstein cattle records collected from 1983 to 2019. Four different feature selection methods were performed to evaluate the essential biological characteristics and feeding-related features which affect milk yield. The results showed that the overall performance of predictive models improved after proper feature selection, with an R2 value increased to 0.811, and a root mean squared error (RMSE) decreased to 3.627. The stacking model achieved the best performance with an R2 value of 0.85, a mean absolute error (MAE) of 2.537 and an RMSE of 3.236. This research provides benchmark information for the prediction of milk yield on the CowNflow dataset and identifies useful factors such as dry matter (DM) intake and lactation month in long-term milk yield prediction.
12
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
Predicting stock price trends is a challenging puzzle. The immediate price of a stock is affected by an uncountable number of factors. Thus there is essentially no way to accurately predict short-term stock price due to dynamic, incomplete, erratic, and chaotic data. However, by analyzing key financial indicators, it is possible to gain an accurate understanding of a company's operations, make a quantitative assessment of its value, and thus make a reasonable prediction of the long-term trend of its stock price. In this FedCSIS 2024 Data Science Challenge, participants are asked to predict the trends of the stocks which are chosen from the Standard \& Poor's 500 index. In this paper, we apply a wrapper feature selection method that tightly combines the steps of feature selection and model building to result in better prediction models, and provide insight into the indicators. After selecting the best set of features, we train two kinds of gradient boost machine: multi-classification model and regression model for class and risk-return performance prediction respectively. Finally a high confidence voting strategy is used to determine the kind of trading action (buy, sell, or hold). Experimental and competition results demonstrate the effectiveness of the methodology in this paper.
13
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
The sandy conglomerate reservoir in layer Es3 of the Liaohe Eastern Depression has good potential for oil reservoir exploration and has been identified as a key area for future exploration. The low porosity and permeability, complex lithology, and strong heterogeneity of the target layer make it difficult to predict favorable reservoirs. The objective of this study is to analyze and process conventional logging data to extract feature parameters that affect lithology by establishing a decision tree lithology classifier. Principal component analysis is used to reduce data dimensionality, and the elbow method is applied to the clustering algorithm to establish the optimal number of clusters for the automatic classification of reservoir types. Further, support vector machines are used for lithology classification based on features with higher classification capabilities. The results show that the support vector machine lithology recognition method based on feature selection achieved an accuracy of 91.8%. The processing of actual well data has verified the feasibility of the method. Based on the combination of core experiments and oil testing results, the characteristics of three types of reservoirs were presented, and potential reservoir zones were proposed for drilling wells. The comprehensive analysis and the practical application of the developed method reveal that the class I reservoir has high hydrocarbon production and could be the most favorable reservoir in the Es3 sandy conglomerate. The processing data of lithology identification and reservoir classification evaluation are consistent with core data and hydrocarbon production data, verifying the effectiveness and practicability of the method proposed in this paper. The results of this study will serve as a reference for low porosity and permeability sandy conglomerate reservoir evaluation based on machine learning in the target area.
The paper presents special forms of an ensemble of classifiers for analysis of medical images based on application of deep learning. The study analyzes different structures of convolutional neural networks applied in the recognition of two types of medical images: dermoscopic images for melanoma and mammograms for breast cancer. Two approaches to ensemble creation are proposed. In the first approach, the images are processed by a convolutional neural network and the flattened vector of image descriptors is subjected to feature selection by applying different selection methods. As a result, different sets of a limited number of diagnostic features are generated. In the next stage, these sets of features represent input attributes for the classical classifiers: support vector machine, a random forest of decision trees, and softmax. By combining different selection methods with these classifiers an ensemble classification system is created and integrated by majority voting. In the second approach, different structures of convolutional neural networks are directly applied as the members of the ensemble. The efficiency of the proposed classification systems is investigated and compared to medical data representing dermoscopic images of melanoma and breast cancer mammogram images. Thanks to fusion of the results of many classifiers forming an ensemble, accuracy and all other quality measures have been significantly increased for both types of medical images.
This article presents a model based on machine learning for the selection of the characteristics that most influence the low industrial yield of cane sugar production in Cuba. The set of data used in this work corresponds to a period of ten years of sugar harvests from 2010 to 2019. A pro‐ cess of understanding the business and of understand‐ ing and preparing the data is carried out. The accuracy of six rule learning algorithms is evaluated: CONJUNC‐ TIVERULE, DECISIONTABLE, RIDOR, FURIA, PART and JRIP. The results obtained allow us to identify: R417, R379, R378, R419a, R410, R613, R1427 and R380, as the indi‐ cators that most influence low industrial performance.
Reliability is one of the key factors used to gauge software quality. Software defect prediction (SDP) is one of the most important factors which affectsmeasuring software's reliability. Additionally, the high dimensionality of the features has a direct effect on the accuracy of SDP models.The objective of this paper is to propose a hybrid binary whale optimization algorithm (BWOA) based on taper-shape transfer functions for solving feature selection problems and dimension reduction with a KNN classifier as a new software defect prediction method. In this paper, the values of a real vector that representsthe individual encoding have been converted to binary vector by using the four types of Taper-shaped transfer functionsto enhance the performance of BWOA to reduce the dimension of the search space. The performance of the suggestedmethod (T-BWOA-KNN)was evaluatedusing eleven standard software defect prediction datasets from the PROMISE and NASA repositories depending on the K-Nearest Neighbor (KNN) classifier. Seven evaluation metrics have been used to assess the effectiveness of the suggested method. The experimental results have shownthat the performanceof T-BWOA-KNNproduced promising results compared to other methods including ten methods from the literature, four typesof T-BWOAwith the KNN classifier. In addition, the obtained results are compared and analyzed with other methods from the literature in termsof the average numberof selected features (SF) and accuracy rate (ACC) using the Kendall W test. In this paper, a new hybrid software defect prediction methodcalledT-BWOA-KNNhas been proposed which is concerned with the feature selection problem. The experimental results have provedthatT-BWOA-KNN produced promising performance compared with other methods for most datasets.
PL
Niezawodność jest jednym z kluczowych czynników stosowanych do oceny jakości oprogramowania.Przewidywanie defektów oprogramowania SDP (ang. Software Defect Prediction) jest jednym z najważniejszych czynników wpływających na pomiar niezawodności oprogramowania. Dodatkowo, wysoka wymiarowość cech ma bezpośredni wpływ na dokładność modeli SDP.Celemartykułu jest zaproponowanie hybrydowego algorytmu optymalizacji BWOA (ang. Binary Whale Optimization Algorithm) w oparciu o transmitancję stożkową do rozwiązywania problemów selekcji cech i redukcji wymiarów za pomocą klasyfikatora KNN jako nowej metody przewidywania defektów oprogramowania.W artykule, wartości wektora rzeczywistego, reprezentującego indywidualne kodowanie zostały przekonwertowane na wektor binarny przy użyciu czterech typów funkcji transferu w kształcie stożka w celu zwiększenia wydajności BWOA i zmniejszenia wymiaru przestrzeni poszukiwań.Wydajność sugerowanej metody (T-BWOA-KNN) oceniano przy użyciu jedenastu standardowych zestawów danych do przewidywania defektów oprogramowania z repozytoriów PROMISE i NASA w zależności od klasyfikatora KNN. Do oceny skuteczności sugerowanej metody wykorzystano siedemwskaźników ewaluacyjnych. Wyniki eksperymentów wykazały, że działanie rozwiązania T-BWOA-KNN pozwoliło uzyskaćobiecujące wyniki w porównaniu z innymi metodami, w tym dziesięcioma metodami na podstawie literatury, czterema typami T-BWOA z klasyfikatorem KNN. Dodatkowo, otrzymane wyniki zostały porównanei przeanalizowane innymi metodami z literatury pod kątem średniej liczby wybranych cech (SF) i współczynnika dokładności (ACC), z wykorzystaniem testu W.Kendalla. W pracy, zaproponowano nową hybrydową metodę przewidywania defektów oprogramowania, nazwaną T-BWOA-KNN, która dotyczy problemu wyboru cech. Wyniki eksperymentów wykazały, że w przypadku większości zbiorów danych T-BWOA-KNN uzyskała obiecującą wydajnośćw porównaniu z innymi metodami.
Many countries have adopted a public health approach that aims to address the particular challenges faced during the pandemic Coronavirus disease 2019 (COVID-19). Researchers mobilized to manage and limit the spread of the virus, and multiple artificial intelligence-based systems are designed to automatically detect the disease. Among these systems, voice-based ones since the virus have a major impact on voice production due to the respiratory system's dysfunction. In this paper, we investigate and analyze the effectiveness of cough analysis to accurately detect COVID-19. To do so, we distinguished positive COVID patients from healthy controls. After the gammatone cepstral coefficients (GTCC) and the Mel-frequency cepstral coefficients (MFCC) extraction, we have done the feature selection (FS) and classification with multiple machine learning algorithms. By combining all features and the 3-nearest neighbor (3NN) classifier, we achieved the highest classification results. The model is able to detect COVID-19 patients with accuracy and an f1-score above 98 percent. When applying FS, the higher accuracy and F1-score were achieved by the same model and the ReliefF algorithm, we lose 1 percent of accuracy by mapping only 12 features instead of the original 53.
This paper introduces an early prognostic model for attempting to predict the severity of patients for ICU admission and detect the most significant features that affect the prediction process using clinical blood data. The proposed model predicts ICU admission for high-severity patients during the first two hours of hospital admission, which would help assist clinicians in decision-making and enable the efficient use of hospital resources. The Hunger Game search (HGS) meta-heuristic algorithm and a support vector machine (SVM) have been integrated to build the proposed prediction model. Furthermore, these have been used for selecting the most informative features from blood test data. Experiments have shown that using HGS for selecting features with the SVM classifier achieved excellent results as compared with four other meta-heuristic algorithms. The model that used the features that were selected by the HGS algorithm accomplished the topmost results (98.6 and 96.5%) for the best and mean accuracy, respectively, as compared to using all of the features that were selected by other popular optimization algorithms.
Snoring is a typical and intuitive symptom of the obstructive sleep apnea hypopnea syndrome (OSAHS), which is a kind of sleep-related respiratory disorder having adverse effects on people’s lives. Detecting snoring sounds from the whole night recorded sounds is the first but the most important step for the snoring analysis of OSAHS. An automatic snoring detection system based on the wavelet packet transform (WPT) with an eXtreme Gradient Boosting (XGBoost) classifier is proposed in the paper, which recognizes snoring sounds from the enhanced episodes by the generalization subspace noise reduction algorithm. The feature selection technology based on correlation analysis is applied to select the most discriminative WPT features. The selected features yield a high sensitivity of 97.27% and a precision of 96.48% on the test set. The recognition performance demonstrates that WPT is effective in the analysis of snoring and non-snoring sounds, and the difference is exhibited much more comprehensively by sub-bands with smaller frequency ranges. The distribution of snoring sound is mainly on the middle and low frequency parts, there is also evident difference between snoring and non-snoring sounds on the high frequency part.
20
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
In recent times, software developers widely use instant messaging and collaboration platforms, as these platforms aid them in exploring new technologies, raising different development-related issues, and seeking solutions from their peers virtually. Gitter is one such platform that has a heavy userbase. It generates a tremendous volume of data, analysis of which is helpful to gain insights about trends in open-source software development and the developers' inclination toward various technologies. The classification techniques can be deployed for this purpose. The selection of an apt word embedding for a given dataset of text messages plays a vital role in determining the performance of classification techniques. In the present work, the comparative analysis of nine-word embeddings in combination with seventeen classification techniques with onevsone and onevsrest has been performed on the GitterCom dataset for categorizing text messages into one of the pre-determined classes based on their purpose. Further, two feature selection methods have been applied. The SMOTE technique has been used for handling data imbalance. It resulted in a total of 612 classification pipelines for analysis. The experimental results show that word2vect, GLOVE with 300 vector size, and GLOVE with 100 vector size are three top-performing word embeddings having performance values taken across different classification techniques. The models trained using ANOVA features performed similarly to those models trained using all features. Finally, using the SMOTE technique helps models to get a better prediction ability.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.