Wyniki wyszukiwania - BazTech

1

Traveling salesman problem parallelization by solving clustered subproblems

Romanuke Vadim

Foundations of Computing and Decision Sciences

|

2023

|

Vol. 48, No. 4

453--481

EN

A method of parallelizing the process of solving the traveling salesman problem is suggested, where the solver is a heuristic algorithm. The traveling salesman problem parallelization is fulfilled by clustering the nodes into a given number of groups. Every group (cluster) is an open-loop subproblem that can be solved independently of other subproblems. Then the solutions of the respective subproblems are aggregated into a closed loop route being an approximate solution to the initial traveling salesman problem. The clusters should be enumerated such that then the connection of two “neighboring” subproblems (with successive numbers) be as short as possible. For this, the destination nodes of the open-loop subproblems are selected farthest from the depot and closest to the starting node for the subsequent subproblem. The initial set of nodes can be clustered manually by covering them with a finite regular-polygon mesh having the required number of cells. The efficiency of the parallelization is increased by solving all the subproblems in parallel, but the problem should be at least of 1000 nodes or so. Then, having no more than a few hundred nodes in a cluster, the genetic algorithm is especially efficient by executing all the routine calculations during every iteration whose duration becomes shorter.

2

Porting of finite element integration algorithm to Xeon Phi coprocessor-based HPC architectures

Krużel Filip, Banaś Krzysztof, Iacomo Mauro

Computer Assisted Methods in Engineering and Science

|

2023

|

Vol. 30, no. 4

427--459

EN

In the present article, we describe the implementation of the finite element numerical integration algorithm for the Xeon Phi coprocessor. The coprocessor was an extension of the many-core specialized unit for calculations, and its performance was comparable with the corresponding GPUs. Its main advantages were the built-in 512-bit vector registers and the ease of transferring existing codes from traditional x86 architectures. In the article, we move the code developed for a standard CPU to the coprocessor. We compareits performance with our OpenCL implementation of the numerical integration algorithm, previously developed for GPUs. The GPU code is tuned to fit into a coprocessor by ourauto-tuning mechanism. Tests included two types of tasks to solve, using two types of approximation and two types of elements. The obtained timing results allow comparing the performance of highly optimized CPU and GPU codes with a Xeon Phi coprocessor performance. This article answers whether such massively parallel architectures perform better using the CPU or GPU programming method. Furthermore, we have compared the Xeon Phi architecture and the latest available Intel’s i9 13900K CPU when writing this article. This comparison determines if the old Xeon Phi architecture remains competitive in today’s computing landscape. Our findings provide valuable insights for selectingthe most suitable hardware for numerical computations and the appropriate algorithmic design.

3

The potential for real-time testing of high-frequency trading strategies through a developed tool during volatile market conditions

Vaitonis Mantas, Korovkinas Konstantinas

Applied Computer Science

|

2023

|

Vol. 19, no 2

63--81

EN

This study presents a method for testing high-frequency trading (HFT) for algorithms on GPUs using kernel parallelization, code vectorization, and multidimensional matrices. The research evaluates HFT strategies within algorithmic cryptocurrency trading in volatile market conditions, particularly during the COVID-19 pandemic. The study's objective is to provide an efficient and comprehensive approach to assessing the efficiency and profitability of HFT strategies. The results show that the method effectively evaluates the efficiency and profitability of HFT strategies, as demonstrated by the Sharp ratio of 2.29 and the Sortino ratio of 2.88. The authors suggest that further study on HFT testing methods could be conducted using a tool that directly connects to electronic marketplaces, enabling real-time receipt of high-frequency trading data and simulation of trade decisions. Finally, the study introduces a novel method for testing HFT algorithms on GPUs, offering promising results in assessing the efficiency and profitability of HFT strategies during volatile market conditions.

4

An optimized parallel implementation of non-iteratively trained recurrent neural networks

El Zini Julia, Rizk Yara, Awad Mariette

Journal of Artificial Intelligence and Soft Computing Research

|

2021

|

Vol. 11, No. 1

33--50

EN

Recurrent neural networks (RNN) have been successfully applied to various sequential decision-making tasks, natural language processing applications, and time-series predictions. Such networks are usually trained through back-propagation through time (BPTT) which is prohibitively expensive, especially when the length of the time dependencies and the number of hidden neurons increase. To reduce the training time, extreme learning machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup on some applications. Due to its non-iterative nature, ELM training, when parallelized, has the potential to reach higher speedups than BPTT. In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm based on ELM that takes advantage of the GPU shared memory and of parallel QR factorization algorithms to efficiently reach optimal solutions. The theoretical analysis of the proposed algorithm is presented on six RNN architectures, including LSTM and GRU, and its performance is empirically tested on ten time-series prediction applications. Opt- PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and to require up to 20x less time to train than parallel BPTT. Such high speedups over new generation CPUs are extremely crucial in real-time applications and IoT environments.

5

Dependency between Tiles’ Sizes and Program Execution Time

Sushko S., Chemerys O.

Measurement Automation Monitoring

|

2018

|

Vol. 64, No. 2

28--30

EN

The paper is dedicated to the aspects of software optimization. Optimization problem is described. Tiling and parallelization methods were applied on the test applications. Several tests were performed to estimate influence of the tiles' sizes on the computational time. The obtained results show complicated dependency between tiles' sizes and processing time. Numerical characteristics of the obtained results and the corresponding pictures are presented.

6

MIDACO parallelization scalability on 200 minlp benchmarks

Schlueter M., Munetomo M.

Journal of Artificial Intelligence and Soft Computing Research

|

2017

|

Vol. 7, No. 3

171--181

EN

This contribution presents a numerical evaluation of the impact of parallelization on the performance of an evolutionary algorithm for mixed-integer nonlinear programming (MINLP). On a set of 200 MINLP benchmarks the performance of the MIDACO solver is assessed with gradually increasing parallelization factor from one to three hundred. The results demonstrate that the efficiency of the algorithm can be significantly improved by parallelized function evaluation. Furthermore, the results indicate that the scale-up behaviour on the efficiency resembles a linear nature, which implies that this approach will even be promising for very large parallelization factors. The presented research is especially relevant to CPU-time consuming real-world applications, where only a low number of serial processed function evaluation can be calculated in reasonable time.

7

The procedure of construction of mathematical models for nonlinear dynamical systems based on optimization approach

Stakhiv P., Kozak Y., Vasylchyshyn I.

Przegląd Elektrotechniczny

|

2016

|

R. 92, nr 7

103--107

EN

Using optimization for mathematical models’ construction is a universal approach that can be applied to a wide set of objects. Efficient application of this approach to nonlinear dynamical objects requires a combination of methods to be used in order to obtain a good model and to reasonably limit the amount of required computations. An overview of such methods with the example of a complex model construction is provided in this paper.

PL

Wykorzystanie optymalizacji w tworzeniu modeli matematycznych obiektów jest szeroko stosowanym podejściem. Aby zapewnić wydajną optymalizację nieliniowych obiektów dynamicznych należy wykorzystać techniki upraszczające problem oraz przyspieszające obliczenia. W artykule zawarto przegląd takich metod, a następnie użyto ich do tworzenia złożonego modelu.

8

Evaluation of efficient computational work division in parallel Monte Carlo grain growth algorithm

Sitko M., Madej Ł.

Computer Methods in Materials Science

|

2016

|

Vol. 16, No. 3

113--120

EN

Implementation of parallel version of the Monte Carlo (MC) grain growth algorithm is the subject of the present paper. First, modifications of the classical MC grain growth algorithm required for the parallel execution are presented. Then, schemes for the MC space division between subsequent computational threads/nodes are discussed. Finally, implementation details of different parallelization approaches based on OpenMP and MPI are presented and compared.

PL

W pracy przedstawiono implementację równoległej wersji algorytmu rozrostu ziaren z wykorzystaniem metody Monte Carlo (MC). W pierwszej części pracy zostały przedstawione modyfikacje klasycznego algorytmu rozrostu ziaren bazującego na metodzie MC, pozwalające na równoległe wykonanie aplikacji. Następnie zostały opisane różne podziały przestrzeni obliczeniowej pomiędzy poszczególne subdomeny obliczeniowe. Wyniki przedstawionej implementacji opartej na OpenMP oraz MPI zostały zaprezentowane oraz porównane pod kontem przyspieszenia obliczeń oraz maksymalnej redukcji czasu wykonania symulacji.

9

A parallelized model for coupled phase field and crystal plasticity simulation

Lin M, Prahl U.

Computer Methods in Materials Science

|

2016

|

Vol. 16, No. 3

156--162

EN

The predictive simulation of materials with strong interaction between microstructural evolution and mechanical deformation requires the coupling of two or more multi-physics models. The coupling between phase-field method and various mechanical models have drawn growing interests. Here, we propose a coupled multi-phase-field and crystal plasticity model that respects the anisotropic mechanical behavior of crystalline materials. The difference of computational complexity and solver requirements between these models presents a challenging problem for coupling and parallelization. The proposed method enables parallel computation of both models using different numerical solvers with different time discretization. Finally two demonstrative examples are given with an application to the austenite-ferrite transformation in iron-based alloys.

PL

Uzyskanie realistycznych możliwości obliczeniowych modeli materiałowych łączących rozwój mikrostruktury z odkształceniami wymaga sprzężenia dwóch lub więcej modeli fizycznych. Sprzężenie między modelem pola faz i różnymi modelami mechanicznymi jest ostatnio w obszarze zainteresowania naukowców. W pracy zaproponowano sprzężenie modelu pola wielofazowego z modelem plastyczności kryształów, który uwzględnia anizotropię zachowania się materiałów polikrystalicznych. Różnica w złożoności obliczeniowej i w wymaganiach dla solwera pomiędzy tymi modelami jest wyzwaniem dla sprzężenia i zrównoleglenie obliczeń. Zaproponowana w pracy metoda umożliwia zrównolegleni obliczeń z wykorzystaniem dwóch modeli poprzez zastosowanie solwerów numerycznych z różną dyskretyzacją czasu. Dwa przykłady będące zastosowaniem dla przemiany austenit-ferryt w stopach żelaza są podsumowaniem pracy.

10

Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

Liu G., Li C.

Acta Geophysica

|

2016

|

Vol. 64, no. 4

1051--1063

EN

In this study, we present a practical implementation of prestack Kirchhoff time migration (PSTM) on a general purpose graphic processing unit. First, we consider the three main optimizations of the PSTM GPU code, i.e., designing a configuration based on a reasonable execution, using the texture memory for velocity interpolation, and the application of an intrinsic function in device code. This approach can achieve a speedup of nearly 45 times on a NVIDIA GTX 680 GPU compared with CPU code when a larger imaging space is used, where the PSTM output is a common reflection point that is gathered as I[nx][ny][nh][nt] in matrix format. However, this method requires more memory space so the limited imaging space cannot fully exploit the GPU sources. To overcome this problem, we designed a PSTM scheme with multi-GPUs for imaging different seismic data on different GPUs using an offset value. This process can achieve the peak speedup of GPU PSTM code and it greatly increases the efficiency of the calculations, but without changing the imaging result.

11

How message passing interface (MPI) accelerates a coalescent-based whole genome simulator

Cyran K. A., Myszor D.

Studia Informatica

|

2014

|

Vol. 35, nr 4

59--72

PL

Symulacje komputerowe uważane są za jeden z filarów współczesnej nauki. W artykule opisano kolejny rodzaj optymalizacji programu GENOME: A rapid coalescent-based whole genome simulator, mającej na celu skrócenie czasu oczekiwania na wyniki. Modyfikacje bazują na zrównoleglaniu wykonywania procesów z wykorzystaniem technologii MPI oraz klastrów HPC. W celu przetestowania uzyskanego rozwiązania wykorzystano klaster HPC Ziemowit, będący na wyposażeniu Śląskiej Biofarmy. Wyniki wskazują, iż wprowadzone modyfikacje pozwalają na znaczne skrócenie czasu wykonywania aplikacji.

EN

Computer simulations are one of the pillars of contemporary science. In the current paper we present next type of improvements introduced into GENOME: A rapid coalescent-based whole genome simulator. The modifications are based on parallelization of processes with the use of MPI technology. The influence of introduced modification, has been tested on Ziemowit HPC cluster which is installed in Silesian Biofarma. Results point out that process of outcomes generation can be reduced significantly if proposed modifications are applied.

12

Dyskretyzacja z nadzorem tablic danych przy użyciu wielordzeniowego procesora karty graficznej (GPU)

Maciura Ł.

Przegląd Elektrotechniczny

|

2014

|

R. 90, nr 5

114--117

PL

Niniejszy artykuł opisuje opracowany algorytm do dyskretyzacji tablic, polegający na masowym zrównolegleniu wyliczania optymalnego ciecia, poprzez jednoczesne badanie bardzo wielu atrybutów za pomocą wielordzeniowego procesora karty graficznej (GPU) oraz procesora (CPU). Jest to możliwe dzięki zastosowaniu technologii NVIDIA CUDA. Artykuł również porównuje prędkość działania tradycyjnego i zrównoleglonego algorytmu.

EN

This paper describes the developed algorithm for discretization of arrays, consisting of a mass parallelization of calculating the optimal cut by simultaneous examination of a large number of attributes using a multi-core graphics card processor (GPU) and central processing unit (CPU). This is possible by using NVIDIA CUDA technology. Paper also compares the speed of traditional and parallelised algorithm.

13

The Impact of Calculation Precision on the Process of Mathematical Model Construction with the Use of Optimization

Stakhiv P., Byczkowska-Lipińska L., Kozak Y.

Przegląd Elektrotechniczny

|

2013

|

R. 89, nr 3a

283--285

EN

The usage of calculation power of GPU cards for macromodel construction based on optimization approach can lead to significant decrease of model construction time. Unfortunately, most GPU cards do not work well with double-precision calculations. In the paper the comparison of optimization process conducted using single-precision and double-precision has been done. It is shown that the reduction of computation precision to single-precision values does not worsen the precision of obtained model and the number of required iterations of optimization algorithm.

PL

Wykorzystanie mocy obliczeniowej procesorów kart graficznych do budowy makromodeli matematycznych może prowadzić do skrócenia czasu obliczeń. Niestety większość procesorów wykorzystywanych jako GPU nie pracuje w podwójnej precyzji. W artykule wykazano, że redukcja dokładności obliczeń do pojedynczej precyzji nie pogarsza jakości otrzymanego modelu, ani nie zwiększa liczby iteracji algorytmu optymalizacyjnego.

14

Parallelized algorithms for finding similar images and object recognition

Frączek R., Cyganek B., Wiatr K.

Computer Science

|

2013

|

Vol. 14 (1)

113--127

EN

The paper addresses the issue of searching for similar images and objects in arepository of information. The contained images are annotated with the help of the sparse descriptors. In the presented research, different color and edge histogram descriptors were used. To measure similarities among images,various color descriptors are compared. For this purpose different distance measures were employed. In order to decrease execution time, several code optimization and parallelization methods are proposed. Results of these experiments, as well as discussion of the advantages and limitations of different combinations of metods are presented.

15

Parallelization of the Block Encryption Algorithm Based on Logistic Map

Burak D.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 10b

198-200

EN

In this paper the results of parallelizing the block encryption algorithm based on logistic map are presented. The data dependence analysis of loops was applied in order to parallelize this algorithm. The OpenMP standard is used for presenting the parallelism of the algorithm. The efficiency measurement for a parallel program is shown.

PL

W artykule zaprezentowano wyniki zrównoleglenia blokowego algorytmu szyfrowania opartego na odwzorowaniu logistycznym. W celu zrównoleglenia algorytmu zastosowano analizę zależności danych. Celem przedstawienia równoległości algorytmu użyto standardu OpenMP. Pokazano wyniki pomiarów efektywności programu równoległego.

16

Parallelization of calculations using GPU in optimization approach for macromodels construction

Stakhiv P., Strubytska I., Kozak Y.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 3a

7-9

EN

Construction of mathematical models for nonlinear dynamical systems using optimization requires significant computation efforts to solve the optimization task. The most CPU time is required by optimization procedure for goal function calculations, which is repeated many times for different model parameters. This allows to use processors with SIMD architecture of calculation parallelization. The effectiveness of such parallelization is the subject of investigation in this paper.

PL

Rozwiązywanie problemów optymalizacyjnych dla nieliniowych układów dynamicznych wymaga dużych nakładów obliczeniowych. Większość czasu procesora pochłaniane jest przez obliczanie wartości funkcji celu, co powtarzane jest wielokrotnie dla różnych parametrów modelu. Dzięki temu możliwe jest wykorzystanie architektury SIMD do zrównoleglenia obliczeń. Przedmiotem przedstawionych badań jest efektywność takiego zrównoleglenia.

17

Acceleration of information-theoretic data analysis with graphics processing units

Sluga D., Curk T., Zupan B., Lotrič U.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 2

136-139

EN

Information-theoretic measures are frequently employed to assess the degree of feature interactions when mining attribute-value data sets. For large data sets, obtaining these measures quickly poses an unmanageable computational burden. In this work we examine the applicability of consumer graphics processing units supporting CUDA architecture to speed-up the computation of information-theoretic measures. Our implementation was tested on a variety of data sets, and compared with the performance of sequential algorithms running on the central processing unit.

PL

Miary informacji takie jak informacja wzajemna są często używane do określania stopnia współzależności cech podczas eksploracji zbiorów danych opisanych atrybutami. Dla dużych zbiorów danych, proste wyliczanie tych miar prowadzi wprost do znacznego wzrostu nakładów obliczeniowych. Praca jest poświęcona możliwościom zastosowania programowalnych kart graficznych do przyspieszenia wyznaczania miar informacji. Nasza implementacja została przetestowana na różnych zbiorach danych oraz porównana z implementacją sekwencyjną na procesorze głównym.

18

Parallelization of the ARIA Encryption Standard

Burak D.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 2

222-225

EN

In this paper there are presented the results of ARIA encryption standard parallelizing . The data dependence analysis of loops was applied in order to parallelize this algorithm. The OpenMP standard is chosen for presenting the algorithm parallelism. There is shown that the standard can be divided into parallelizable and unparallelizable parts. As a result of the study, it was stated that the most time-consuming loops of the algorithm are suitable for parallelization. The efficiency measurement for a parallel program is presented.

PL

W artykule zaprezentowano proces zrównoleglenia koreańskiego standardu szyfrowania ARIA. Przeprowadzono analizę zależności danych w pętlach programowych celem redukcji zależności danych blokujących możliwości zrównoleglenia algorytmu. Standard OpenMP w wersji 3.0 został wybrany celem prezentacji równoległości najbardziej czasochłonnych obliczeniowo pętli odpowiedzialnych za procesy szyfrowania oraz deszyfrowania danych w postaci bloków danych. Pokazano, że zrównoleglona wersja algorytmu składa się z części sekwenycjnej zawierającej instrukcje wejścia/wyjścia oraz równoległej, przy czym najbardziej czasochłonne pętle programowe zostały efektywnie zrównoleglone. Dołączono wyniki pomiarów przyspieszenia pracy zrównoleglonego standardu szyfrowania oraz procesów szyfrowania oraz deszyfrowania danych z wykorzystaniem dwóch, czterech, ośmiu, szesnastu oraz trzydziestu dwóch wątków oraz zastosowaniem ośmioprocesorowego serwera opartego na czterordzeniowych procesorach Quad Core Intel Xeon.

19

Massive Jacobi Power Flow Based On SIMD-Processor

Vilacha Perez C., Moreira Meira J. C., Miguez Garcia E., Fernandez Otero A.

Przegląd Elektrotechniczny

|

2011

|

R. 87, nr 10

236-240

EN

This paper presents an implementation of the Jacobi power flow algorithm to be run on a single instruction multiple data (SIMD) unit processor. The purpose is to be able to solve a large number of power flows in parallel as quickly as possible. This well-known algorithm was modified taking into account the characteristics of the SIMD architecture. The results show a significant speed-up of the algorithm compared to the time required to solve the algorithm in a conventional CPU, even when a more efficient sequential algorithm, such as the Newton-Raphson, is used. The accuracy of the performance has been validated with the results of the IEEE-118 standard network. This paper also shows a case where the proposed algorithm is used to calculate a statistical load power flow using the Monte Carlo's method.

PL

W artykule przedstawiono implementację algorytmu rozpływu mocy Jacobiego przeznaczonych do uruchamiania na procesorze typu SIMD (jedna instrukcja wiele danych). Celem jest rozwiązanie dużej liczby rozpływów mocy równolegle w jak najkrótszym czasie. Ten dobrze znany algorytm został zmodyfikowany z uwzględnieniem cech architektury SIMD. Wyniki wskazują na znaczne przyspieszenie tego algorytmu w porównaniu do czasu potrzebnego do rozwiązania problemu za pomocą konwencjonalnych CPU, nawet jeśli zastosujemy najbardziej efektywny algorytm sekwencyjny, taki jak metoda Newtona-Raphsona. Dokładność uzyskanych wyników została potwierdzone dzięki porównaniu z wynikami uzyskanymi w sieci standardowej IEEE-118. W artykule przedstawiono również przypadek, gdy proponowany algorytm jest używany do obliczeń statystycznych rozpływów mocy przy użyciu metody Monte Carlo.

20

Hybrid-parallel formulation of fundamental quantum-chemical algorithms

Mazur G., Makowski M., Kuna D.

Computer Science

|

2011

|

Vol. 12

163-168

EN

Hybrid-parallel variants of Hartree-Fock, Kohn-Sham and Moller-Plesset second-level perturbation theory are described. Their efficiency with respect to the serial and MPI-based parallel implementations are measured and briefly analyzed. It is shown that while hybrid parallelization provide increased efficiency in all cases, the magnitude of the effect strongly depends on the features of the particular algorithm.

PL

Przedstawiono hybrydowo zrównoleglone warianty metod Hartreego-Focka, Kohna-Shama i rachunku zaburzeń Mollera-Plesseta drugiego rzędu. Porównano ich wydajność względem implementacji szeregowej i implementacji zrównoleglonej za pomocą mechanizmu przekazywania komunikatów (MPI). Pokazano, że hybrydowe zrównoleglenie zapewnia zwiększoną wydajność we wszystkich analizowanych przypadkach, przy czym wielkość uzyskanego przyspieszenia silnie zależy od cech danego algorytmu.