Wyniki wyszukiwania - BazTech

1

Performance and scalability experiments with a large-scale air pollution model on the EuroHPC petascale supercomputer DISCOVERER

Ostromsky Tzvetan

Annals of Computer Science and Information Systems

|

2022

|

Vol. 32

81--84

EN

The basic parallel versions of the Danish Eulerian Model (UNI-DEM) has been implemented on the new petascale supercomputer DISCOVERER, installed last year in Sofia, Bulgaria by the company Atos. DISCOVERER is part of the European High Performance Computing Joint Undertaking (EuroHPC), which is building a network of 8 powerful supercomputers across the European Union (3 pre-exascale and 5 petascale).The results of some scalability experiments with the basic MPI and a hybrid MPI-OpenMP parallel implementations of UNI-DEM on the new Bulgarian petascale supercomputer DISCOVERER (in EuroHPC network) are presented here. They are compared with similar earlier experiments performed on the Mare Nostrum III supercomputer (petascale too) at Barcelona Supercomputing Centre - the most powerful supercomputer in Spain by that time, upgraded currently to the pre-exascale Mare Nostrum V, also part of the EuroHPC JU infrastructure.

2

Parallel computing of two-parameter bifurcation diagrams of an electric arc model with chaotic dynamics using Nvidia CUDA and OpenMP technologies

Pala Artur, Machaczek Marek

Przegląd Elektrotechniczny

|

2019

|

R. 95, nr 3

138--142

EN

This paper presents parallel and massively parallel calculations of two-parameter bifurcation diagrams of an electric arc model. A simple dynamical model of electric arc is used. Such a model can show complex two-parameter bifurcations with periodic and chaotic responses. Two different parallel computing technologies were used to implement the calculations. Parallel computations are implemented using the OpenMP library and CPU processors. Massively parallel computations are implemented using the Nvidia CUDA technology and GPU processors.

PL

W artykule przedstawiono równoległe i masowo równoległe obliczenia dwuparametrycznych diagramów bifurkacyjnych dla modelu łuku elektrycznego. Do analizy wykorzystano dynamiczny model łuku elektrycznego z okresowymi i chaotycznymi odpowiedziami. Do realizacji obliczeń wykorzystano dwie różne technologie. Obliczenia równoległe zaimplementowano przy użyciu biblioteki OpenMP i procesorów CPU. Obliczenia masowo równoległe zostały zaimplementowane przy użyciu technologii Nvidia CUDA i procesorów GPU.

3

Hybrid MPI/Open-MP acceleration approach for high-order schemes for CFD

Saczek Michał, Wawrzak Karol, Tyliszczak Artur, Boguslawski Andrzej

TASK Quarterly : scientific bulletin of Academic Computer Centre in Gdansk

|

2018

|

Vol. 22, No 3

179--193

EN

The paper presents a hybridMPI+OpenMP (Message Passing Interface/Open Multi-Processor) algorithm used for parallel programs based on the high-order compact method.The main tools used to implement parallelism in computations are OpenMP andMPI whichdiffer in terms of memory on which they are based. OpenMP works on shared-memory and the MPIon distributed-memory whereas the hybrid model is based on a combination of those methods. The tests performed and described in this paper present significant advantages provided by a combination of the MPI/OpenMP approach. The test computations needed for verifying possibilities ofMPI, Open-MP and Hybrid of both tools were carried out using anacademic high-order SAILORsolver. The obtained results seem to be very promising to accelerate simulations of fluid flows as well as for application using high order methods.

4

Using GPU Accelerators for Parallel Simulations in Material Physics

Uchroński M., Potasz P., Szymańska-Kwiecień A., Hruszowiec M.

Computational Methods in Science and Technology

|

2018

|

Vol. 24, No. 4

249--258

EN

This work is focused on parallel simulation of electron-electron interactions in materials with non-trivial topological order (i.e. Chern insulators). The problem of electron-electron interaction systems can be solved by diagonalizing a many-body Hamiltonian matrix in a basis of configurations of electrons distributed among possible single particle energy levels – the configuration interaction method. The number of possible configurations exponentially increases with the number of electrons and energy levels; 12 electrons occupying 24 energy levels corresponds to the dimension of Hilbert space about 106 . Solving such a problem requires effective computational methods and highly efficient optimization of the source code. The work is focused on many-body effects related to strongly interacting electrons on flat bands with non-trivial topology. Such systems are expected to be useful in study and understanding of new topological phases of matter, and in further future they can be used to design novel nanomaterials. Heterogeneous architecture based on GPU accelerators and MPI nodes will be used for improving performance and scalability in parallel solving problem of electron-electron interaction systems

5

Optimization of Machine Learning Process Using Parallel Computing

Grzeszczyk Michał K.

Advances in Science and Technology. Research Journal

|

2018

|

Vol. 12, no 4

81--87

EN

The aim of this paper is to discuss the use of parallel computing in the supervised machine learning processes in order to reduce the computation time. This way of computing has gained popularity because sequential computing is often insufficient for large scale problems like complex simulations or real time tasks. After presenting the foundations of machine learning and neural network algorithms as well as three types of parallel models, the author briefly characterized the development of the experiments carried out and the results obtained. The experiments on image recognition, ran on five sets of empirical data, prove a significant reduction in calculation time compared to classical algorithms. At the end, possible directions of further research concerning parallel optimization of calculation time in the supervised perceptron learning processes were shortly outlined.

6

Polyhedral Source-to-Source Compile

Adamski D., Jabłoński G., Perek P., Napieralski A.

Elektronika : konstrukcje, technologie, zastosowania

|

2016

|

Vol. 57, nr 12

3--13

EN

This paper describes a novel Polyhedral Source-to-Source Compiler (PSSC) that enables automatic recognition of parallel regions of C/C++ code and annotating them with OpenMP/OpenACC pragmas. The proposed source-to-source compiler uses polyhedral model to detect and optimize parallel loops. Loop optimization is done on intermediate code representation by Polly compiler and then it is mapped to original source code. This approach allows combining the simplicity and efficiency of Intermediate Representation (IR) code optimization with readability of output code. Experimental results show that the proposed compiler is able to reach the comparable performance to the original Polly compiler.

PL

Artykuł opisuje nowatorski kompilator typu source-to-source, który wykorzystuje model polihedralny do automatycznego wykrywania kodu C/C++, który może być wykonywany równolegle. Fragmenty kodu źródłowego, które mogą zostać zrównoleglone, są opatrywane pragmami OpenMP/OpenACC. Opisywany kompilator śledzi zmiany jakie zostały wprowadzone w kodzie pośrednim przez kompilator Polly, a następnie odwzoruje te transformacje w kodzie źródłowym. Przedstawione w artykule podejście umożliwia połączenie zalet wynikających z optymalizowania kodu pośredniego z możliwością łatwego przenoszenia na różne platformy kodu wysokopoziomowego. Przeprowadzone pomiary wydajności wykazały, że opracowany kompilator pozwala zrównoleglić kod wysokopoziomowy równie wydajnie jak bazowy kompilator Polly.

7

Evaluation of efficient computational work division in parallel Monte Carlo grain growth algorithm

Sitko M., Madej Ł.

Computer Methods in Materials Science

|

2016

|

Vol. 16, No. 3

113--120

EN

Implementation of parallel version of the Monte Carlo (MC) grain growth algorithm is the subject of the present paper. First, modifications of the classical MC grain growth algorithm required for the parallel execution are presented. Then, schemes for the MC space division between subsequent computational threads/nodes are discussed. Finally, implementation details of different parallelization approaches based on OpenMP and MPI are presented and compared.

PL

W pracy przedstawiono implementację równoległej wersji algorytmu rozrostu ziaren z wykorzystaniem metody Monte Carlo (MC). W pierwszej części pracy zostały przedstawione modyfikacje klasycznego algorytmu rozrostu ziaren bazującego na metodzie MC, pozwalające na równoległe wykonanie aplikacji. Następnie zostały opisane różne podziały przestrzeni obliczeniowej pomiędzy poszczególne subdomeny obliczeniowe. Wyniki przedstawionej implementacji opartej na OpenMP oraz MPI zostały zaprezentowane oraz porównane pod kontem przyspieszenia obliczeń oraz maksymalnej redukcji czasu wykonania symulacji.

8

A parallelized model for coupled phase field and crystal plasticity simulation

Lin M, Prahl U.

Computer Methods in Materials Science

|

2016

|

Vol. 16, No. 3

156--162

EN

The predictive simulation of materials with strong interaction between microstructural evolution and mechanical deformation requires the coupling of two or more multi-physics models. The coupling between phase-field method and various mechanical models have drawn growing interests. Here, we propose a coupled multi-phase-field and crystal plasticity model that respects the anisotropic mechanical behavior of crystalline materials. The difference of computational complexity and solver requirements between these models presents a challenging problem for coupling and parallelization. The proposed method enables parallel computation of both models using different numerical solvers with different time discretization. Finally two demonstrative examples are given with an application to the austenite-ferrite transformation in iron-based alloys.

PL

Uzyskanie realistycznych możliwości obliczeniowych modeli materiałowych łączących rozwój mikrostruktury z odkształceniami wymaga sprzężenia dwóch lub więcej modeli fizycznych. Sprzężenie między modelem pola faz i różnymi modelami mechanicznymi jest ostatnio w obszarze zainteresowania naukowców. W pracy zaproponowano sprzężenie modelu pola wielofazowego z modelem plastyczności kryształów, który uwzględnia anizotropię zachowania się materiałów polikrystalicznych. Różnica w złożoności obliczeniowej i w wymaganiach dla solwera pomiędzy tymi modelami jest wyzwaniem dla sprzężenia i zrównoleglenie obliczeń. Zaproponowana w pracy metoda umożliwia zrównolegleni obliczeń z wykorzystaniem dwóch modeli poprzez zastosowanie solwerów numerycznych z różną dyskretyzacją czasu. Dwa przykłady będące zastosowaniem dla przemiany austenit-ferryt w stopach żelaza są podsumowaniem pracy.

9

Porównanie metod obliczeń równoległych OpenMP i CUDA

Maj Michał

Zeszyty Naukowe WSEI. Seria Transport i Informatyka

|

2015

|

T. 5, nr 1

19--27

PL

Programowanie równoległe oznacza tworzenie programów w taki sposób, by można je było wykonywać równocześnie na wielu procesorach. Na potrzeby niniejszego artykułu napisane zostały dwa programy zrównoleglone – jeden w CUDA C oraz jeden w OpenMP, przeznaczony dla CPU – oraz jeden sekwencyjny (niewspółbieżny). Najszybszym sposobem zrównoleglania okazał się program napisany w CUDA, w którym wykorzystuje się pamięć niekopiowaną. Wadą CUDA jest to, że działa tylko ze sprzętem firmy NVIDIA.

EN

Parallel programming means development of programs, which can be executed truly concurrently on multiprocessor platforms. For current test purposes two parallel programs have been developed – one in CUDA C language, second using OpenMP library. Also equivalent sequential (non-parallel) program has been developed. Most efficient parallelization have been achieved in CUDA program with page-locked memory. CUDA is handicapped by limitation to NVIDIA hardware.

10

Parallel computing techniques on enhancement of thermal-hydraulic analysis of fluid flow networked systems

Fedorov M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 3

18-23

EN

The considerable computation time of a practical application of sequential algorithms for simulating thermal and flow distribution in fluid flow networked systems (FFNS’s) is the motivating factor to study their parallel implementation. The mathematical model formulated and studied in the paper requires the solution of a set of nonlinear equations, which are solved by the Newton-Raphson method. An object-oriented solver automatically formulates the equations for networks of an arbitrary topology. The hydraulic model that is chosen as a benchmark consists of nodal flows and loop equations. A general decomposition algorithm for analysis of flow and temperature distribution in a FFNS is presented, and results of speedup of its parallel implementation are demonstrated.

PL

Zaproponowano model do symulacji równoległej zadania analizy statycznego przepływu cieczy w sieciach przepływowych. Model sprowadza się do rozwiązania układów równan nieliniowych metodą Newtona-Raphsona. Przedstawiono algorytm dekompozycyjny do analizy rozdysponowania przepływu i temperatur w sieci przepływowej oraz wyniki przyspieszenia jego implementacji równoległej.

11

Analysis of parallelisation of 3D-CEMBS model using technologies like OpenACC and OpenMP

Piotrowski P.

Biuletyn Instytutu Morskiego w Gdańsku

|

2015

|

Vol. 30, No. 1

10--15

EN

Oceanographic models utilise parallel computing techniques to increase their performance. Computer hardware constantly evolves and software should follow to better utilise modern hardware potential. The number of CPU cores with access to shared memory increases with hardware evolution. To fully utilise the possibilities new hardware presents, parallelisation techniques employed in oceanographic models, which were designed with distributed memory systems in mind, have to be revised. This research focuses on analysing the 3D-CEMBS model to assess the feasibility of using OpenMP and OpenACC technologies to increase performance. This was done through static code analysis and profiling. The findings show that the main performance problems are attributed to task decomposition that was designed with distributed memory systems in mind. To fully utilise modern shared memory systems, other task decomposition strategies need to be employed. The presented 3D-CEMBS model analysis is a first stage in wider research of oceanographic models as a specific class of parallel applications. In the long term the research will result in proposing design patterns tailored for oceanographic models that would exploit their characteristics to achieve better hardware utilisation on evolving hardware architectures.

PL

Modele oceanograficzne wykorzystują przetwarzanie równoległe dla zwiększenia wydajności. Sprzęt komputerowy ciągle ewoluuje, więc oprogramowanie powinno zmieniać się razem z nim, aby w pełni wykorzystać potencjał współczesnego sprzętu. Wraz z rozwojem sprzętu komputerowego zwiększa się liczba rdzeni procesorów, które mają dostęp do pamięci współdzielonej. Aby w pełni wykorzystać możliwości nowego sprzętu, techniki zrównoleglania wykorzystywane w modelach oceanograficznych muszą zostać zrewidowane. Modele oceanograficzne były często projektowane z myślą o systemach z pamięcią rozproszoną. Niniejsze badania skupiają się na analizie modelu 3D-CEMBS pod kątem możliwości wykorzystania technologii OpenMP i OpenACC w celu podniesienia wydajności modelu. W tym celu została przeprowadzona statyczna analiza kodu modelu oraz profilowanie. Wyniki badań pokazują, że główny problem wydajnościowy modelu jest wynikiem zastosowania dekompozycji zadań przewidzianej dla systemów z pamięcią rozproszoną. Aby w pełni wykorzystać współczesne komputery z pamięcią współdzieloną należy wprowadzić inne strategie dekompozycji zadań.

12

Wykorzystanie wielowątkowości w projektowaniu i implementacji wydajnego silnika gier

Jabłoński S.

Zeszyty Naukowe Wydziału ETI Politechniki Gdańskiej. Technologie Informacyjne

|

2014

|

T. 22

93--104

PL

Artykuł ma celu przedstawienie zagadnienia projektowania i implementacji wydajnych silnikow gier wykorzystujących wieloprocesorowe architektury sprzętowe. Praca analizuje przykładowe wykorzystanie programowania wielowątkowego w kontekście gier komputerowych i silników gier. Autor prezentuje przykładowe modele równoległego przetwarzania oraz wzorcowe architektury silników gier wykorzystujących natywne systemowe interfejsy wątkowe oraz technologię OpenMP. Szczegolna uwaga została poświęcona metodom komunikacji oraz synchronizacji w grach komputerowych. Druga część publikacji przedstawia bibliotekę Intel Thread Building Blocks, jej cechy oraz rożnice w architekturze w stosunku do innych dostępnych technologii. Artykuł kończy się przedstawieniem praktycznego zastosowania Intel Thread Building Blocks na przykładzie autorskiego silnika gier AyumiEngine.

EN

The paper aims to present the problem of designing and implementation of efficient game engines by using multiprocessor hardware architectures. The article analyzes example usage of multithreading in the computer games and game engines. The author presents parallel processing architecture models and examples of game engines that use native system thread interfaces and OpenMP technology. Particular attention was devoted to methods of communication and synchronization in computer games. The second part of publication presents Intel Thread Building Blocks library features and differences compared to other technologies. The article concludes with an example of practical implementaion in author's game engine - AyumiEngine.

13

The comparison of parallel sorting algorithms implemented on different hardware platforms

Żurek D., Pietroń M., Wielgosz M., Wiatr K.

Computer Science

|

2013

|

Vol. 14 (4)

679--691

EN

Sorting is a common problem in computer science. There are a lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, many-core and multi-core platforms have enabled the creation of wide parallel algorithms. We have standard processors that consist of multiple cores and hardware accelerators, like the GPU. Graphic cards, with their parallel architecture, provide new opportunities to speed up many algorithms. In this paper, we describe the results from the implementation of a few different parallel sorting algorithms on GPU cards and multi-core processors. Then, a hybrid algorithm will be presented, consisting of parts executed on both platforms (a standard CPU and GPU). In recent literature about the implementation of sorting algorithms in the GPU, a fair comparison between many core and multi-core platforms is lacking. In most cases, these describe the resulting time of sorting algorithm executions on the GPU platform and a single CPU core.

14

Parallelized algorithms for finding similar images and object recognition

Frączek R., Cyganek B., Wiatr K.

Computer Science

|

2013

|

Vol. 14 (1)

113--127

EN

The paper addresses the issue of searching for similar images and objects in arepository of information. The contained images are annotated with the help of the sparse descriptors. In the presented research, different color and edge histogram descriptors were used. To measure similarities among images,various color descriptors are compared. For this purpose different distance measures were employed. In order to decrease execution time, several code optimization and parallelization methods are proposed. Results of these experiments, as well as discussion of the advantages and limitations of different combinations of metods are presented.

15

Parallelization of the Block Encryption Algorithm Based on Logistic Map

Burak D.

Przegląd Elektrotechniczny

|

2012

|

R. 88, nr 10b

198-200

EN

In this paper the results of parallelizing the block encryption algorithm based on logistic map are presented. The data dependence analysis of loops was applied in order to parallelize this algorithm. The OpenMP standard is used for presenting the parallelism of the algorithm. The efficiency measurement for a parallel program is shown.

PL

W artykule zaprezentowano wyniki zrównoleglenia blokowego algorytmu szyfrowania opartego na odwzorowaniu logistycznym. W celu zrównoleglenia algorytmu zastosowano analizę zależności danych. Celem przedstawienia równoległości algorytmu użyto standardu OpenMP. Pokazano wyniki pomiarów efektywności programu równoległego.

16

Use of the tiling method inside synchronization of free slices of code in OpenMP standard in order to achieve speedup enhancement

Gozdalik M.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 2

202-205

EN

In last few years, there were discovered many methods aiming at enhancing the speedup of parallel programs. In this paper three methods are tested according to a speedup parameter enhancement. These methods are: the tiling, the slicing, and the tiling inside slicing. In Sections 3, 4, and 5 the theoretical basis for chosen transformation are described. Algorithms of transformation processes as operations on a polyhedral model are presented. The problems of transformation costs are also discussed. For experimental studies a UTDSP benchmark was used. From each section, one representative sample was chosen. The results were also examined against a data locality. This aspect of chosen transformation methods was examined as well.

PL

W artykule przedstawiono problem doboru metody transformacji pętli celem uzyskania możliwie maksymalnego przyspieszenia. Do badań wybrano benchmark UTDSP z uniwersytetu w Toronto. Z każdej sekcji benchmarku wybrano reprezentanta, który poddany został transformacjom tiling, slicing oraz transformacji tiling wewnątrz slicingu. W pierwszym rozdziale przedstawiony został wstęp do transformacji pętli. Rozdział drugi zawiera informacje teoretyczne na temat modelu polihedronu jako formy reprezentacji pętli, na której przeprowadzane są transformacje, a wynikowy model jest bazą do generowania kodu źródłowego. Kolejne rozdziały przedstawiają opis teoretyczny transformacji tiling oraz slicing. Przedstawiono w nich algorytm tworzenia tych transformacji wraz z przekształceniami matematycznymi, opisującymi transformacje na modelu polihedronu. W końcowej części pracy badano wpływ wybranych transformacji na przyspieszenie programów. Wyniki badań przedstawione zostały w formie zagregowanych wykresów przyspieszeń poszczególnych aplikacji.

17

Parallelization of the ARIA Encryption Standard

Burak D.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 2

222-225

EN

In this paper there are presented the results of ARIA encryption standard parallelizing . The data dependence analysis of loops was applied in order to parallelize this algorithm. The OpenMP standard is chosen for presenting the algorithm parallelism. There is shown that the standard can be divided into parallelizable and unparallelizable parts. As a result of the study, it was stated that the most time-consuming loops of the algorithm are suitable for parallelization. The efficiency measurement for a parallel program is presented.

PL

W artykule zaprezentowano proces zrównoleglenia koreańskiego standardu szyfrowania ARIA. Przeprowadzono analizę zależności danych w pętlach programowych celem redukcji zależności danych blokujących możliwości zrównoleglenia algorytmu. Standard OpenMP w wersji 3.0 został wybrany celem prezentacji równoległości najbardziej czasochłonnych obliczeniowo pętli odpowiedzialnych za procesy szyfrowania oraz deszyfrowania danych w postaci bloków danych. Pokazano, że zrównoleglona wersja algorytmu składa się z części sekwenycjnej zawierającej instrukcje wejścia/wyjścia oraz równoległej, przy czym najbardziej czasochłonne pętle programowe zostały efektywnie zrównoleglone. Dołączono wyniki pomiarów przyspieszenia pracy zrównoleglonego standardu szyfrowania oraz procesów szyfrowania oraz deszyfrowania danych z wykorzystaniem dwóch, czterech, ośmiu, szesnastu oraz trzydziestu dwóch wątków oraz zastosowaniem ośmioprocesorowego serwera opartego na czterordzeniowych procesorach Quad Core Intel Xeon.

18

Automatyczne zrównoleglanie kodu aplikacji systemów wbudowanych

Pałkowski M.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 7

656-658

PL

W artykule przedstawiono technikę automatycznego zrównoleglenia kodu aplikacji w celu efektywnego wykorzystania mocy obliczeniowej procesorów wielordzeniowych w systemach wbudowanych. Technika ta opiera się na analizie zależności danych w pętlach programowych, podziału ich przestrzeni iteracji i wyznaczeniu niezależnych fragmentów kodu. Rezultatem transformacji jest równoległy kod zgodny ze standardem OpenMP, tożsamy z jego sekwencyjnym odpowiednikiem oraz możliwość przyspieszenia obliczeń komputera przemysłowego.

EN

In a fairly conservative group of solutions, such as industrial computers, more perfect miniaturization of processing units is becoming noticeable. Size and power consumption of units are important, however efficiency of processing is also significant. Installing multi-core processors in embedded systems allows executing the parallel code with OpenMP standard. Multi-core programming enables speeding up calculations, i.e. for test and measurement-processing systems the amount of measurement data processed is increased. For this purpose, techniques of transforming program code to a parallel form are necessary, in particular loop parallelization transformations are significant, because the vast majority of calculations is included in loops. There are many techniques for loop prallelization, such as unimodular and affine transformations. However, these techniques allow only extraction of parallelism for specified set of loops and fail to find full parallelism in a loop because of high inability. In this paper, the Iteration Space Slicing Framework is presented. The framework was designed for automatic extracting parallelism in loops and overcoming limitations of well-known techniques. The result of transformation is the parallel code including OpenMP pragmas. The speedup, efficiency and locality of the code is examined. The continuation of the work in the future is considered.

19

Automatic tuning framework for parallelized programs

Burak D., Radziewicz M., Wierciński T.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 12

1526-1528

EN

Complexity of computers has grown tremendously in recent years, because, among others, multi-processor and multi-core architectures are in widespread use. Parallelized programs should run on multi-core processors to use the most of its computing power. Exploiting parallel compilers for automatic parallelization and data locality optimization of sequential programs reduces costs of software. In this paper there is described the WIZUTIC Compiler Framework developed in the Faculty of Computer Science and Information Technology of the West Pomeranian University of Technology. The application uses the source code of the PLUTO parallel compiler developed in the Ohio State University by Uday Bondhugula. The simulated annealing method and the Bees algorithm are used for finding proper transformations of the source code for given program features. The experimental study results using the Data Encryption Standard (DES) algorithm are described and the speed-ups of encryption and decryption processes are presented.

PL

W artykule przedstawiono autorski kompilator zrównoleglający oraz optymalizujący lokalność danych- WIZUTIC oraz jego wykorzystanie do skrócenia czasu przetwarzania algorytmu szyfrowania DES. Do utworzenia kompilatora WIZUTIC transformującego kod źródłowy zapisany w języku C ze źródła do źródła wykorzystano kody źródłowe kompilatora PLUTO autorstwa Uday'a Bondhuguli służącego do optymalizacji lokalności danych z zastosowaniem transformacji tiling oraz zrównoleglenia pętli programowych z wykorzystaniem gruboziarnistej równoległości. W procesie kompilacji wykorzystano technikę kompilacji iteracyjnej oraz dwie metody optymalizacji: symulowane wyżarzanie (SA) oraz algorytm pszczół (BA) służące do określenia odpowiedniego rozmiaru bloku transformacji tiling. Przedstawiono wyniki badań eksperymentalnych dla algorytmu DES pracującego w trybie ECB. Badania przeprowadzona z zastosowaniem maszyny 8-procesorowej Quad Core Intel Xeon Processor Model E7310, kompilatora GCC GNU z wykorzystaniem standardu OpenMP w wersji 3.0 oraz narzędzia do profilowania kodu Intel VTune.

20

A fuzzy model in speedup prediction process for parallel applications written in OpenMP

Gozdalik M.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 12

1484-1487

EN

A common method to establish code parallelization quality is measuring the program execution time to calculate speedup and efficiency. Generally, parallel and sequential programs must be executed and execution time need to be captured to affirm quality parameters. However, having a good profiling tool, it is easier to designate parameters such as a bus utilization ratio, rather than the measuring program execution time. Having a piece of information about processor and memory ratios, it is possible to estimate quality parameters with satisfying results. In this paper an example solution of the effectiveness prediction process of parallel programs written in OpenMP is provided. As an approach, a fuzzy model was designed and results for a matrix multiplication program are presented. The fuzzy model and a modus operandi are described. Nevertheless, parameters for estimating the efficiency and speedup were implemented using Intel processors event calculation. These parameters are input values of the fuzzy model presented in this paper. According to processor events, the input parameters where divided into two groups. Each group represents one of a submodel in the whole fuzzy model. It provides possibility to measure only some of processor events to estimate the program efficiency. More details on these parameters are included in separate paragraphs.

PL

W artykule przedstawiony został problem dotyczący określenia jakości wygenerowanego kodu równoległego. Mierzenie czasu wykonania programu celem wyznaczenia przyspieszenia jest nieefektywne, a w niektórych przypadkach wręcz niewykonalne. Posiadając narzędzie profilujące dedykowane dla danego typu procesora, możliwe jest stworzenie modelu, który estymował by efektywność wykonywanego programu na podstawie parametrów pamięci cache poziomu drugiego oraz procesora. Dzięki takiemu rozwiązaniu możliwe jest określenie jakości wygenerowanego kodu i podjęcie na tej podstawie decyzji czy warto dalej optymalizować wygenerowany kod. Celem wykonania pomiaru parametrów pamięci i procesora wystarczy wykonywać program przez określony wycinek czasu nie czekając na jego zakończenie. Nie ma również konieczności ingerowania w kod źródłowy programu. Niniejszy artykuł prezentuje model rozmyty estymujący efektywność wygenerowanego kodu źródłowego w standardzie OpenMP.