Wyniki wyszukiwania - BazTech

1

Porting of finite element integration algorithm to Xeon Phi coprocessor-based HPC architectures

Krużel Filip, Banaś Krzysztof, Iacomo Mauro

Computer Assisted Methods in Engineering and Science

|

2023

|

Vol. 30, no. 4

427--459

EN

In the present article, we describe the implementation of the finite element numerical integration algorithm for the Xeon Phi coprocessor. The coprocessor was an extension of the many-core specialized unit for calculations, and its performance was comparable with the corresponding GPUs. Its main advantages were the built-in 512-bit vector registers and the ease of transferring existing codes from traditional x86 architectures. In the article, we move the code developed for a standard CPU to the coprocessor. We compareits performance with our OpenCL implementation of the numerical integration algorithm, previously developed for GPUs. The GPU code is tuned to fit into a coprocessor by ourauto-tuning mechanism. Tests included two types of tasks to solve, using two types of approximation and two types of elements. The obtained timing results allow comparing the performance of highly optimized CPU and GPU codes with a Xeon Phi coprocessor performance. This article answers whether such massively parallel architectures perform better using the CPU or GPU programming method. Furthermore, we have compared the Xeon Phi architecture and the latest available Intel’s i9 13900K CPU when writing this article. This comparison determines if the old Xeon Phi architecture remains competitive in today’s computing landscape. Our findings provide valuable insights for selectingthe most suitable hardware for numerical computations and the appropriate algorithmic design.

2

Effect of in series and in parallel flow heater configuration of solar heat system for industrial processes

Ghabour Rajab, Korzenszky Péter

Science, Technology and Innovation

|

2021

|

Vol. 14, no. 3-4

18--26

EN

The boiler is an enclosed vessel that transfers the energy from fuel combustion or electricity into hot water or steam. Then, this hot water or pressurized steam is used for transferring the heat to a certain heat process. Usually, the required hot water or steam keeps on varying throughout the day which also may be implied on the daily or monthly load. Therefore, several configurations of connecting the boiler into the solar heating system ensure the temperature of the final output. The boiler can be connected in series or parallel to improve the efficiency of the overall process as well as to reduce the running costs. This paper presents a simulation study of a solar heating system for industrial processes. Two flow-heater system configurations are designed for covering the heat demand of a pasteurising factory existing in Budapest, Hungary. The configuration “A” consists of a solar heating system for hot water preparation using in series flow heater configuration. While configuration “B” consists of the same solar system but with a parallel flow heater configuration. These system configurations are modelled using T*sol software for evaluating the system performance under the Hungarian climate from five different aspects: required collector area, glycol ratio, volume flow rate, relative tank capacity, and tank height-to-diameter ratio. According to the optimum design parameters, in series configuration is better than parallel by 3.14% at 45 m² collector area, 0.45% at 25% glycol ratio, 0.42% at 50 l/h · m² volume flow rate, 2.05% at 50 l/m² relative tank capacity, and 0.42% at 1.8 tank height-to-diameter ratio respectively. The results show that in series configuration is better in terms of solar fractions than parallel configuration from all five aspects.

3

Naturally parallel measuring system based on FPGA hardware

Długopolski Jacek, Richert Maria

Journal of Machine Construction and Maintenance - Problemy Eksploatacji

|

2019

|

no. 2

113--119

EN

The FPGA (Field Programmable Gate Array) technology, usually a little unnoticeable, almost from the very beginning is developed simultaneously with the microprocessor technology. The possibility for the system designer or end user to influence the internal structure of the integrated circuit gives unattainable possibilities of building plastic and fully massively parallel systems that fit in almost one integrated circuit. This fact allows, among others for building fully parallel multi-point measuring systems. This manuscript presents the architecture proposal for such an FPGA-based exemplary multichannel measurement system and presents the results of its practical use to study the functioning of a tubular heat exchanger in automotive airconditioning.

PL

Technologia programowalnych układów scalonych FPGA (Field Programmable Gate Array), zwykle trochę niezauważana, niemal od samego początku rozwija się równolegle z technologią mikroprocesorową. Możliwość wpływania przez projektanta systemów lub użytkownika końcowego na wewnętrzną strukturę układu scalonego daje nieosiągalne w przypadku zwykłych procesorów możliwości budowania plastycznych i w pełni masywnie równoległych systemów mieszczących się niemal w jednym układzie scalonym. Fakt ten pozwala m.in. na budowanie w pełni równoległych wielopunktowych systemów pomiarowych. W artykule tym pokazano właśnie propozycję architektury takiego bazującego na FPGA przykładowego wielokanałowego systemu pomiarowego oraz przedstawiono wyniki jego praktycznego wykorzystania do badania funkcjonowania rurowego wymiennika ciepła w klimatyzacji samochodowej.

4

On the maximal dimensionality of tiles in tiled code generated by means of Affine Transformations

Bielecki W., Pałkowski M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 11

158-161

EN

Tiling(blocking) is a very important iteration reordering transformation for both improving data locality and extracting loop nest parallelism. Affine transformations are one of the most power approach to generate tiled code. Tile dimensionality has a strong impact on tiled code performance. This paper presents a way allowing one to discover before tiling what is the maximal dimensionality of tiles in code generated by means of affine transformations.

XX

Blokowanie jest bardzo ważną transformacja reorganizacji iteracji zarówno dla poprawy lokalności pętli jak i dla ekstrakcji równoległości w gniezdzie pętli programowej. Przekształcenia afiniczne są jednym z najbardziej mocnych podejść do implementacji techniki blokowania. W artykule przedstawiono sposób, za pomocą którego można odkryć przed zastosowaniem blokowania jaki jest maksymalny wymiar bloków w kodzie generowanym za pomocą przekształceń afinicznych, który ma silny wpływ na wydajność kodu.

5

Automatic Extraction of Parallelism for Mobile Devices

Pałkowski M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 11

162-166

EN

This paper presents the Iteration Space Slicing (ISS) framework aimed at automatic parallelization of code for Mobile Internet Devices (MID). ISS algorithms permit us to extract coarse-grained parallelism available in arbitrarily nested parameterized loops. The loops are parallelized and transformed to multi-threaded application for the Android OS. Experimental results are carried out by means of the benchmark suites (UTDSP and NPB) using an ARM quad core processor. Performance benefits and power consumption are studied. Related and future work are discussed.

XX

Artykuł przedstawia ekstrakcję niezależnych fragmentów kodu dla urządzeń przenośnych. Narzędzie pozwala na zrównoleglenie gruboziarniste dowolnie zagnieżdżonych pe˛ tli programowych z parametrami do kodu wielowątkowego dla systemu Android. Eksperymenty przeprowadzono na zestawach pętli testowych (UTDSP i NPB) za pomocą czterordzeniowego procesora ARM. Przedstawiono analizę wydajności i poboru mocy oraz pokrewne rozwiązania.

6

Effective expectation maximization algorithm implementation using multicore computer systems

Kasitskij A., Bidyuk P., Gozhyi A.

Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska

|

2014

|

nr 4

35--37

EN

A popular expectation maximization algorithm that is widely used in modern data processing systems to solve various problems including optimization and parameter estimation is considered in the paper. The task of the study was to enhance effectiveness of the algorithm execution in time. An enhancement of execution rate for the EM algorithm using multicore architecture of modern computer systems was carried out. Necessary modifications aimed at better parallelism were proposed for implementation of the EM algorithm. An efficiency of the software implementation was tested on the classic problem of Gaussian random variables mixture separation. It is shown that in the mixture separation problem EM algorithm performance degrades when the distance between mean values of distributions is less than three standard deviations, which is totally in the spirit of three sigma law. In such cases, it is very important to have an efficient EM algorithm implementation to be able to process such test cases in a reasonable time.

PL

W artykule opisany jest popularny algorytm EM (expectation maximization), który jest powszechnie stosowany w nowoczesnych systemach przetwarzania danych do rozwiązywania różnych problemów, w tym optymalizacji i estymacji parametrów. Celem badań było zwiększenie efektywności czasu wykonywania algorytmu. Zwiększenie szybkości wykonania algorytmu EM użyto wielordzeniowy architektury nowoczesnych systemów komputerowych. Zostały zaproponowane niezbędne modyfikacje mające na celu lepszą równoległość realizacji algorytmu EM. Skuteczność implementacji programu była testowana na klasycznym problemie separacji Gaussowskich zmiennych losowych. Wykazano, że w przypadku rozdziału mieszaniny wydajność algorytmu EM ulega degradacji, kiedy odległość między średnimi wartościami rozkładu wynosi mniej niż trzy odchylenia standardowe, co jest całkowicie zgodnie z regułą trzech sigm. W takich przypadkach, jest bardzo ważne, aby mieć efektywną realizację algorytmu EM móc przetworzyć takie przypadki w rozsądnym czasie.

7

Programming synchronization-free parallelism using Intel Threading Building Blocks

Bielecki W., Palkowski M.

Pomiary Automatyka Kontrola

|

2011

|

R. 57, nr 11

1380-1383

EN

Extracting synchronization-free parallelism by means of the Iteration Space Slicing Framework results in parallel pseudo-code that is independent on a parallel computer architecture and API/library, hence it cannot be directly compiled. For producing parallel programs for shared memory multiprocessors, Threading Building Blocks (TBB) can be applied that is a library supporting scalable parallel programming based on the standard C++ language. In this paper, we present how to benefit from TBB in practice on the basis of pseudo-code representing synchronization-free slices produced by a tool using the Omega Library. Results of experiments with the NAS benchmarks suite are presented.

PL

Zastosowanie techniki opartej na ekstrakcji równoległości pozbawionej synchronizacji w pętlach programowych pozwala na wygenerowanie pseudokodu, który jest niezależny od architektury komputera oraz języka lub biblioteki programowania. Taki kod nie może być wprost kompilowany. Jest wymagane przekształcenie takiego pseudokodu na rzeczywisty kod równoległy. W tym celu może być zastosowane narzędzie Intel Threading Building Blocks, które jest biblioteką wspierająca skalowalne programowanie równoległe w standardzie C++. Nie wymaga specjalnego języka programowania i specjalnych kompilatorów. Zaletą biblioteki Threading Building Blocks jest możliwość uruchomienia w dowolnym środowisku programowo-sprzętowym i systemie operacyjnym. W artykule przedstawiono korzyści wynikające z tworzenia aplikacji równoległych za pomocą TBB. Wyjaśniono sposób poszukiwania instancji instrukcji fragmentów kodu przy użyciu biblioteki Omega i tworzenie najpierw równoległego pseudo-kodu, a dalej transformacja pseudokodu na kod równoległy z wykorzystaniem TBB. Proponowane podejście zostało zweryfikowane za pomocą zbioru pętli testowych z benchmarku NAS. Zbadano przyspieszenie i efektywność kodu równoległego oraz skalowalność w aspekcie do zmiennego rozmiaru obliczeń badanych pętli.

8

Akceleracja obliczeń komputerowych za pomocą układów graficznych z wykorzystaniem technologii CUDA

Stefanowicz Ł., Wiśniewski R., Wiśniewska M.

Pomiary Automatyka Kontrola

|

2011

|

R. 57, nr 8

954-956

PL

W artykule zaprezentowano możliwość zastosowania układów graficznych celem przyspieszenia obliczeń komputerowych. Przedstawiono technologię oraz architekturę CUDA firmy nVidia, a także podstawowe rozszerzenia względem standardów języka C. W referacie omówiono autorskie algorytmy testowe oraz metodykę badań, które przeprowadzono w celu określenia skuteczności akceleracji obliczeń komputerowych z wykorzystaniem procesorów graficznych GPU w porównaniu do rozwiązań tradycyjnych, opartych o CPU.

EN

The paper deals with application of the graphic processor units (GPUs) to acceleration of computer operations and computations. The traditional computation methods are based on the Central Processor Unit (CPU), which ought to handle all computer operations and tasks. Such a solution is especially not effective in case of distributed systems where some sub-tasks can be performed in parallel. Many parallel threads can accelerate computing, which results in a shorter execution time. In the paper a new CUDA technology and architecture is shown. The presented idea of CUDA technology bases on application of the GPU processors to compu-tation to achieve better performance in comparison with the traditional methods, where CPUs are used. The GPU processors may perform multi-thread calculation. Therefore, especially in case of tasks where concurrency can be applied, CUDA may highly speed-up the computation process. The effectiveness of CUDA technology was verified experimentally. To perform investigations and experiments, the own test modules were used. The library of benchmarks consists of various algorithms, from simple iteration scripts to video processing methods. The results obtained from calculations performed via CPU and via GPU are compared and discussed.

9

Using transitive closure and transitive reduction to extract coarse-grained parallelism in program loops

Bielecki W., Pałkowski M., Siedlecki K.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 8

976-979

EN

A technique for extracting coarse-grained parallelism available in loops is presented. It is based on splitting a set of dependence relations into two sets. The first one is to be used for generating code scanning slices while the second one permits us to insert send and receive functions to synchronize the slices execution. The paper presents a way demonstrating how to remove redundant synchronization in generated code by means of the transitive reduction operation. Results of experiments - how many synchronization points can be removed, speed-up and efficiency of examined parallel loops are discussed.

PL

W artykule zaprezentowano technikę ekstrakcji równoległości grubo-ziarnistej w pętlach programowych. Bazuje ona na podziale relacji zależności na dwa zbiory: na podstawie pierwszego generowany jest kod skanujący niezależne fragmenty, natomiast drugi służy do wstawienia funkcji send i receive (wyślij i odbierz) służących do synchronizacji tych fragmentów. Operacje te zrealizowano za pomocą semaforów, możliwe jest jednak wykorzystanie innej konstrukcji, bardziej wydajnej dla danego środowiska. Algorytm generuje kod z zaznaczonymi punktami synchronizacji, nie narzuca jednak ich implementacji. W artykule przeanalizowano technikę wyszukiwania i eliminacji zbędnych punktów synchronizacji. Ekstrakcja równoległości za pomocą fragmentów kodu bazuje na operacji tranzytywnego domknięcia, znanej także z teorii grafów. Operacja ta jest również wykorzystana do obliczenia tranzytywnej redukcji, za pomocą której eliminowana jest nadmiarowa synchronizacja. Usuwanie zbędnej komunikacji pomiędzy wątkami obliczeń jest istotne, ponieważ ich obsługa zwłaszcza dla komputerów z pamięcią dzieloną, w których ich koszt obsługi jest istotny. Docelowe jest zatem uzyskanie gruboziarnistego kodu równoległego. Zbadano także wyniki przeprowa-dzonych eksperymentów pod kątem przyspieszenia i efektywności obliczeń.

10

Wyznaczenie punktów reprezentatywnych niezależnych fragmentów kodu w grafie zależności pętli programowych

Bielecki W., Pałkowski M., Klimek T.

Metody Informatyki Stosowanej

|

2010

|

nr 1 (22)

13-20

PL

W artykule przedstawiono nowy algorytm wyznaczania punktów reprezentatywnych cechujacy się mniejszą złożonością obliczeń w porównaniu do rozwiazania [6-7]. Powodzenie wyznaczania punktów jest zależne tylko od obliczenia dokładnego tranzytywnego domknięcia unii relacji zależności pętli. Oprócz tego należy wykonać szereg podstawowych operacji, jak: część wspólna, iloczyn skalarny, unia, aplikacja relacji na zbiorze, inwersja, projekcja. Relacja RUSC budowana jest wieloetapowo dzięki czemu można dokonywać pośrednich uproszczeń jej postaci. Opisane podejście zostało zaimplementowane i przetestowane pod kątem skuteczności na zbiorze pętli testowych NAS. W dalszych badaniach planowane jest zbadanie proponowanego algorytmu z innymi zbiorami pętli testowych oraz dalsze udoskonalanie algorytmów do wyznaczania fragmentów dla dowolnej topologii zależności pod kątem generowania wydajnego kodu równoległego.

EN

An algorithm of finding representatives of synchronization-free slices available in program loops is presented. It based on the transitive closure of a union of dependence relations describing all the dependences in program loops. An algorithm to calculate transitive closure is studied. Both the algorithms are implemented by means of the Omega library. The results of experiments with the NAS Parallel Benchmark are discussed.

11

Extracting representative loop statement instances of synchronization-free slices

Bielecki W., Palkowski M., Beletska A.

Pomiary Automatyka Kontrola

|

2009

|

R. 55, nr 10

807-810

EN

Extracting synchronization-free parallelism by means of the Iteration Space Slicing Framework consists of two steps. First, representative loop statement instances of slices are extracted. Next, slices are reconstructed from their representatives and parallel code scanning slices and elements of each slice is generated. In this paper, we present how to benefit from this technique in practice. We explain how to extract representative loop statement instances of slices by means of the Omega Library enlarged by four new functions allowing us to simplify the process of extracting slice representatives. Results of experiments with the NAS and UTDSP benchmarks are presented.

PL

Rozwój architektur wielordzeniowych wymusza poszukiwanie algorytmów automatycznego zrównoleglenia aplikacji. W artykule opisano zrównoleglenie pętli programowych za pomocą ekstrakcji niezależnych fragmentów kodu. Ekstrakcja równoległości w pętlach programowych pozbawionych synchronizacji za pomocą podziału przestrzeni iteracji składa się z dwóch kroków. Najpierw znajdowane są instancje instrukcji będące początkami fragmentów kodu. Następnie fragmenty kodu uzupełniane są o wszystkie instrukcje i generowany jest kod równoległy. W artykule przedstawiono korzyści wynikające z takiego podejścia. Wyjaśniono sposób poszukiwania instancji instrukcji fragmentów kodu za pomocą biblioteki Omega rozszerzonej o nowe funkcje upraszczające poszukiwanie instrukcji należących do fragmentów kodu. Opis proponowanego podejścia uzupełniono o zbiór eksperymentów na pętlach testowych NAS i UTDSP.

12

Obliczenie potęgi k znormalizowanej afinicznej relacji

Bielecki W., Klimek T., Trifunovic K.

Metody Informatyki Stosowanej

|

2008

|

nr 2 (Tom 15)

5-20

PL

W artykule rozważane są idealnie zagnieżdżone pętle afiniczne, w których dolne oraz górne granice pętli, a także odwołania do tablic oraz instrukcji warunkowych są określane przy pomocy funkcji afinicznych, których argumentami są indeksy otaczających pętli oraz opcjonalnie parametry stukturalne.

EN

An approach to calculate the power k of an affine normalized relation is presented. A way to normalize an arbitrary affine relation is discussed. The approach is illustrated by an example. It is clarified how to calculate the positive transitive closure and transitive closure of a relation on the basis of the power k of the relation. Results of experiments are discussed. It is demonstrated how the calculated power k of a relation can be used for extracting both coarse- and fine-grained parallelism available in program loops. Feature, research is outlined.

13

Hermite spline interpolation on patches for parallelly solving the Vlasov-Poisson equation

Crouseilles N., Latu G., Sonnendrücker E.

International Journal of Applied Mathematics and Computer Science

|

2007

|

Vol. 17, no 3

335-349

EN

This work is devoted to the numerical simulation of the Vlasov equation using a phase space grid. In contrast to Particle- In-Cell (PIC) methods, which are known to be noisy, we propose a semi-Lagrangian-type method to discretize the Vlasov equation in the two-dimensional phase space. As this kind of method requires a huge computational effort, one has to carry out the simulations on parallel machines. For this purpose, we present a method using patches decomposing the phase domain, each patch being devoted to a processor. Some Hermite boundary conditions allow for the reconstruction of a good approximation of the global solution. Several numerical results demonstrate the accuracy and the good scalability of the method with up to 64 processors. This work is a part of the CALVI project.

14

Minimalizacja komunikacji pomiędzy subpopulacjami oraz wybór populacji początkowej dla lokalnych algorytmów genetycznych

Jasińska-Suwada A., Dzwinel W., Łagudza M.

Archiwum Informatyki Teoretycznej i Stosowanej

|

2000

|

T. 12, z. 1

17-36

PL

W artykule omawiamy problemy minimalizacji komunikacji pomiędzy subpopulacjami i doboru odpowiedniej populacji początkowej w lokalnych algorytmach genetycznych, w których elementy populacji umieszczone są na dwuwymiarowej siatce, a operacja krzyżowania dozwolona jest jedynie dla sąsiadujących osobników. Algorytmy te pozwalają na rozwiązanie problemów zwodniczych o szeroko rozpiętych, podatnych na rozerwanie schematach. Populacja została podzielona na dwie części przegrodą z oknem umożliwiającym krzyżowanie elementów należących do różnych subpopulacji. Testy wykazały, że szerokość okna w niewielkim stopniu wpływa na efektywność poszukiwania ekstremum. Prowadzi to do wniosku, że dla opisanego algorytmu można znacznie zminimalizować komunikację pomiędzy subpopulacjami. Przedstawiono testy, wykazujące, że specyficzne sposoby przygotowania populacji początkowej mogą istotnie wpływać na uzyskane wyniki. Rozkład "szachownicy" i rozkład dwustronny umożliwiają odnalezienie rozwiązania dla niektórych problemów znacznie efektywniej niż losowa populacja początkowa.

EN

In this paper we are dealing with minimization of the communication between the subpopulations and selection of initial population for local genetic algorithms (LGA). For the LGA individuals are located at the nodes of two-dimensional grid. Crossover operation is allowed only between neighbors. This type of algorithms is suitable for multidimensional, multimodal and deceptive problems with widely spanned schemata, which can be easily disrupted by crossover operation. To minimize the communication between subpopulations in parallel implementation of LGA, the initial population was devided into two parts by the barrier. The "window" in the barrier allowsa for transfer of individuals from different subpopulations. The tests show, that for the narrow window width (25%) the algorithm efficiency is approximately the same as for an "open" subpopulation but enabling the lowest communication rate. The tests show that particular types of initial population affect the efficiency of LGA. The checkerboard and bilateral patterns enable one to find the solution more efficiently than the random population.