Wyniki wyszukiwania - BazTech

1

Effectiveness of Fast Fourier Transform implementations on GPU and CPU

Puchała D., Stokfiszewski K., Szczepaniak B., Yatsymirskyy M.

Przegląd Elektrotechniczny

|

2016

|

R. 92, nr 7

69--71

EN

In this paper, we present the results of comparison of the effectiveness of selected variants of radix-2 Fast Fourier Transform (FFT) algorithms implemented on both Graphics (GPU) and Central (CPU) Processing Units. The considered algorithms differ in memory consumption and the arrangement of data-flow paths which affects the global memory coalescing and cache memory exploitation. The obtained results allow to indicate the variants of FFT algorithms which are best suited for GPU and CPU architectures, to confirm the advisability of GPU oriented calculations of FFT and to formulate a guideline for implementations of fast algorithms of various linear transforms.

XX

W niniejszej pracy przedstawiono wyniki porównania efektywności wybranych wariantów algorytmów szybkiej transformaty Fouriera (FFT) typu radix-2 realizowanych zarówno dla procesorów graficznych (GPU) jak i typowych jednostek centralnych (CPU). Rozważane algorytmy różnią się zapotrzebowaniem pamięciowym oraz postaciami grafów przepływu danych, które mają wpływ na spójność wykorzystania pamięci globalnej oraz pamięci cache jednostek GPU i CPU. Uzyskane wyniki pozwalają na wskazanie wariantów algorytmów FFT, które są najlepiej dostosowane dla architektur GPU i CPU, pozwalają też potwierdzić celowość realizacji implementacji FFT zorientowanych na wykorzystanie jednostek GPU, a także sformułować ogólne wytyczne dla implementacji zorientowanych na wykorzystanie jednostek GPU algorytmów szybkich przekształceń liniowych.

2

Symulacja metody kwantowych trajektorii dla problemów optyki kwantowej oraz informatyki kwantowej

Wiśniewska J., Sawerwain M.

Symulacja w Badaniach i Rozwoju

|

2015

|

Vol. 6, nr 1

67--75

PL

W artykule została przedstawiona równoległa implementacja odmiany metody Monte Carlo do symulacji dynamiki kwantowych systemów otwartych – jest to tzw. metoda kwantowych trajektorii (QTM). Implementacja została wykonana za pomocą technologii CUDA i obejmuje ona realizację procedury numerycznej odpowiedzialnej za algorytm QTM. W artykule została też pokazana wydajność otrzymanych metod numerycznych dla QTM w stosunku do innych znanych implementacji.

EN

The chapter contains a parallel implementation of Monte Carlo method for simulating the open quantum systems’ dynamics. The mentioned approach is the Quantum Trajectories Method (QTM). The implementation is carried out with use of CUDA technology and it is based on a numerical procedure realizing QTM algorithm. The chapter presents also a comparison of elaborated numerical methods’ performance in comparison to other existing implementations.

3

Lattice structure for parallel calculation of orthogonal wavelet transform on GPUs with CUDA architecture

Puchala D., Szczepaniak B., Yatsymirskyy M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 7

52-54

EN

In this paper, we present a novel approach to calculation of discrete wavelet transform (DWT) on modern Graphics Processing Units (GPUs) with CUDA architecture which takes advantage of highly parallel lattice structure. The experimental results obtained for model signals show that the proposed approach allows to obtain several times acceleration compared with sequential calculations carried out on the CPU while taking into account not only time of calculations but also time required for data transfers.

XX

W artykule zaproponowano nowe podejście do obliczania dyskretnego przekształcenia falkowego (DWT) na nowoczesnych procesorach graficznych (GPUs) o architekturze CUDA oparte o wysoce równoległe struktury kratowe. Wyniki badań eksperymentalnych przeprowadzonych na sygnałach modelowych pokazują, że zaproponowane podejście daje możliwość uzyskania kilkukrotnego przyśpieszenia obliczeń w porównaniu do obliczeń sekwencyjnych na CPU, biorąc pod uwagę nie tylko czas obliczeń , ale również czasy przesyłu danych.

4

Optimization and application of GPU calculations in material science

Korpała G., Kawalla R.

Computer Methods in Materials Science

|

2015

|

Vol. 15, No.1

185--191

EN

Modern Graphic Processing Units (GPU) provide in combination with a very fast Video Random Access Memory (VRAM) very high computational procedure, outrunning the conventional combination of a Central Processing Unit (CPU) and Random Access Memory (RAM) in terms of parallel computing and calculation. Within this work a concept for parallel application of the CPU/GPU is presented which combines the approach for processing and managing of large amounts of data. The computer algebra system (CAS) Wolfram Mathematica is used for numerical calculation of a large Finite Difference Model (FDM). The CUDA-link feature of Mathematica was used to achieve a parallel working environment with a parallelized computation on available CPUs with a parallelization of calculations of Nvidia GPUs at the same time. An advanced desktop computer system was setup to use a high-end desktop CPU in combination with four TITAN GK110 Kepler GPUs from Nvidia. It will be shown, that the calculation time can be reduced by using shared-memory and an optimization of the used block and/or register size to minimize data communication between GPU and VRAM. Results for diffusion, stress field and deformation field for a deformation sample will be shown, which is numerically calculated from crystal plasticity, with over four million of FDM elements being calculated by each of the four used graphic cards. It will be clearly shown, that the overall calculation time is strongly depending on the storage time for the amount of data, both for the final result and as for the intermediate results for the different numerical increments. Nevertheless, a promising application of parallel computing for research in the field of materials science is presented and investigated, showing the possibilities for new approaches and/or more detailed calculations in a reasonable time.

PL

Nowoczesne procesory graficzne (GPU) w połączeniu z bardzo szybką pamięcią typu VRAM stanowią wysoko wydajne obliczeniowo narzędzie, które w aspekcie obliczeń równoległych wyprzedza znacznie konwencjonalną centralną jednostkę obliczeniową (CPU) z pamięcią RAM. W pracy przedstawiona została koncepcja aplikacji wykonującej obliczenia równoległe na proce¬sorach CPU/GPU, która może przetwarzać i zarządzać dużą ilością danych. Wykorzystano środowisko obliczeniowe CAS Wolfram Mathematica do rozwiązywania dużych modeli metodą różnic skończonych (FDM) oraz funkcjonalność Mathematici CUDA-link do równoczesnego zrównoleglenia obliczeń na procesorach CPU i Nvidia GPU. Na tej podstawie opracowano zaawansowany system komputerowy pozwalający na obliczenia na pro¬cesorze CPU w połączeniu z czterema procesorami TITAN GK110 Kepler GPU firmy Nvidia. Pokazano, że czas obliczeń został zredukowany przy wykorzystaniu pamięci dzielonej i optymalizacji bloku lub rozmiaru rejestru, w celu minimalizacji przesyłu danych pomiędzy GPU i VRAM. Przedstawiono wyniki dla dyfuzji, pola naprężeń i pola odkształceń dla odkształconej, przykładowej próbki, otrzymane z modelu plastyczności kryształu z ponad czterema milionami elementów FDM, dla których obliczenia wykonywano na czterech kartach graficznych. Przeprowadzone obliczenia jasno pokazały, że całkowity czas obliczeń jest silnie zależny od czasu dostępu do pamięci dla danych, zarówno w aspekcie otrzymania wyników końcowych, jak i wyników pośrednich dla różnych kroków czasowych. Niemniej jednak w pracy przedstawiono obiecujące wyniki badań nad zastosowaniem obliczeń równoległych w dziedzinie inżynierii materiałowej, pokazując możliwości wykorzystania nowych metod i bardziej dokładnych obliczeń w akceptowalnym czasie.

5

Using shared memory as a cache in cellular automata water flow simulations on GPUs

Topa P., Młocek P.

Computer Science

|

2013

|

Vol. 14 (3)

385--401

EN

Graphics processors (GPU – Graphic Processor Units) recently have gained a lot of interest as an efficient platform for general-purpose computation. Cellular Automata approach which is inherently parallel gives the opportunity to implement high performance simulations. This paper presents how shared memory in GPU can be used to improve performance for Cellular Automata models. In our previous works, we proposed algorithms for Cellular Automata model that use only a GPU global memory. Using a profiling tool, we found bottlenecks in our approach. With this paper, we will introduce modifications that takes an advantage of fast shared memory. The modified algorithm is presented in details, and the results of profiling and performance test are demonstrated. Our unique achievement is comparing the efficiency of the same algorithm working with a global and shared memory.