Wyniki wyszukiwania - BazTech

1

CUDA accelerated Medical Segmentation metrics with MedEval 3D

Mitura Jakub, Chrapko Beata E.

Zeszyty Naukowe Warszawskiej Wyższej Szkoły Informatyki

|

2022

|

nr 26

7--19

EN

Medical segmentation metrics are crucial for development of correct segmentation algorithms in medical imaging domain. In case of three dimensional large arrays representing studies like CT, PET/CT or MRI of critical importance is availability of library implementing high performance metrics. MedEval3D is created in order to fulfill this need thanks to implementation of CUDA acceleration. Most of implemented metrics like Dice coefficient, Jacard coefficient etc. are based on confusion matrix, what enable effective reuse of calculations across multiple metrics improving performance in such use case. Additionally algorithms like interclass correlation and Mahalanobis distance are also introduced. In both cases their implementations are significantly faster then their counterparts from other available libraries. Lastly programming interface to all of the metrics was created in Julia programming language.

2

Implementation of numerical integrationto high-order elements on the GPUs

Krużel Filip, Banaś Krzysztof, Nytko Mateusz

Computer Assisted Methods in Engineering and Science

|

2020

|

Vol. 27, no. 1

3--26

EN

This article presents ways to implement a resource-consuming algorithm on hardware with a limited amount of memory, which is the GPU. Numerical integration for higher-order finite element approximation was chosen as an example algorithm. To perform compu- tational tests, we use a non-linear geometric element and solve the convection-diffusion- reaction problem. For calculations, a Tesla K20m graphics card based on Kepler archi- tecture and Radeon r9 280X based on Tahiti XT architecture were used. The results of computational experiments were compared with the theoretical performance of both GPUs, which allowed an assessment of actual performance. Our research gives sugges- tions for choosing the optimal design of algorithms as well as the right hardware for such a resource-demanding task.

3

Cooperation of CUDA and Intel multi-core architecture in the independent component analysis algorithm for EEG data

Gajos-Balińska Anna, Wójcik Grzegorz M., Stpiczyński Przemysław

Bio-Algorithms and Med-Systems

|

2020

|

Vol. 16, no. 3

art. no. 20200044

EN

Objectives: The electroencephalographic signal is largely exposed to external disturbances. Therefore, an important element of its processing is its thorough cleaning. Methods: One of the common methods of signal improvement is the independent component analysis (ICA). However, it is a computationally expensive algorithm, hence methods are needed to decrease its execution time. One of the ICA algorithms (fastICA) and parallel computing on the CPU and GPU was used to reduce the algorithm execution time. Results: This paper presents the results of study on the implementation of fastICA, which uses some multi-core architecture and the GPU computation capabilities. Conclusions: The use of such a hybrid approach shortens the execution time of the algorithm.

4

GPU-based parallel algorithm of interaction induced light scatering simulation in fluids

Dawid Aleksander

TASK Quarterly : scientific bulletin of Academic Computer Centre in Gdansk

|

2019

|

Vol. 23, No 1

5--17

EN

We parallelized the sequential algorithm of the four-body correlation function if eachcombination of two pairs(i, j)and(k, l) was averaged over the time in a separate calculation thread. The generator of pairs used as the input for this algorithm was also parallelized and connected with the 4-body correlation function calculations. We used our algorithm to accelerate extremely intensive calculations of the 4-body polarizability anisotropy correlation functions,which were very important to estimate the interaction induced light scattering spectrum. The resulting C code was used to test our algorithm on Graphics Processing Units (GPUs) with the Compute Unified Device Architecture (CUDA) technology from NVIDIA®Corporation. Asa result, we achieved 12 times the acceleration of the 4-body correlation function calculations in comparison to the Central Processing Unit (CPU) core. The peak performance of the GPU calculations was registered at the level of 19 times faster than the CPU core. We also found thatacceleration depended on the memory consumption. In the single precision mode, the relative error between the CPU and GPU calculations was found to be within 0.1%

5

Wykorzystanie GPGPU do obliczeń ekspozycji ludności na narażenie pola elektrycznego

Wroński Jacek W., Rzeźniczak Krzysztof, Michalski Igor

Przegląd Telekomunikacyjny + Wiadomości Telekomunikacyjne

|

2019

|

nr 6

291--294, CD

PL

W niniejszym artykule przedstawiono metodę wykorzystania procesorów graficznych do obliczeń wartości poziomów niejonizujących pól elektromagnetycznych, pochodzących od systemów radiokomunikacyjnych, stanowiących potencjalne źródło narażeń ludności na pole elektromagnetyczne. Czasy obliczeń porównano z metodami wykorzystującymi przetwarzanie równoległe na procesorach CPU.

EN

This article presents the method of using GPGPU to estimate EMF levels of human exposure on non-ionized EMF, deriving from wireless systems. Calculation time on GPGPU has been compared to time elapsed with parallel calculations performed on CPU.

6

Acceleration of Signal Processing Algorithms in Seekers Using Graphics Processing Units

Turek Piotr

Problemy Mechatroniki : uzbrojenie, lotnictwo, inżynieria bezpieczeństwa

|

2019

|

Vol. 10, Nr 1 (35)

91--98

EN

The paper presents a discussion on the issue of possible acceleration of radiolocation signal processing algorithms in seekers using graphics processing units. A concept and implementation examples of algorithms performing digital data filtering on general purpose central and graphics processing units are introduced. The results of performance comparison of central and graphics processing units during computing discrete convolution are presented at the end of the paper.

PL

W artykule zamieszczono rozważania na temat możliwości akceleracji algorytmów przetwarzania sygnałów radiolokacyjnych w głowicach samonaprowadzania z wykorzystaniem procesorów graficznych. Przedstawiono koncepcję oraz przykłady implementacji algorytmów realizujących cyfrową filtrację na procesorach klasycznych oraz graficznych ogólnego przeznaczenia. Wyniki porównania wydajności centralnych i graficznych jednostek przetwarzania podczas obliczania dyskretnego splotu przedstawiono na końcu artykułu.

7

Multi-core and many-core SPMD parallel algorithms for construction of basins of attraction

Silveira Marcos, Gonçalves Paulo J.P., Balthazar José M.

Journal of Theoretical and Applied Mechanics

|

2019

|

Vol. 57 nr 4

1067--1079

EN

Construction of basins of attraction, used for the analysis of nonlinear dynamical systems which present multistability, are computationaly very expensive. Because of the long runtime needed, in many cases, the construction of basins does not have any practical use. Numerical time integration is currently the bottleneck of algorithms used for the construction of such basins. The integrations related to each set of initial conditions are independent of each other. The assignment of each integration to a separate thread seems very attractive, and parallel algorithms which use this approach to construct the basins are presented here. Two versions are considered, one for multi-core and another for many-core architectures, both based on a SPMD approach. The algorithm is tested on three systems, the classic nonlinear Duffing system, a non-ideal system exhibiting the Sommerfeld effect and an immunodynamic system. The results for all examples demonstrate the versatility of the proposed parallel algorithm, showing that the multi-core parallel algorithm using MPI has nearly an ideal speedup and efficiency.

8

Algorytm SLAM z pominięciem czujników odometrycznych w kontekście różnych typów procesorów obliczeniowych

Fiedeń Mateusz, Miotk Michał, Dąbek Przemysław, Muraszkowski Artur

Interdisciplinary Journal of Engineering Sciences

|

2019

|

Vol. 7, no. 1

28--37

PL

SLAM jest to algorytm równoczesnego mapowania otoczenia i lokalizowania się na tworzonej mapie. Wykorzystywany jest w robotach autonomicznych przeznaczonych do pracy w nieznanym bądź dynamicznie zmieniającym się otoczeniu. W swojej podstawowej formie wykorzystuje czujnik odległości, taki jak lidar bądź radar oraz dane o przesunięciu pozyskiwane z enkoderów. Dzięki zastosowaniu odpowiednich strategii dodawania kolejnych skanów oraz filtracji pobieranych danych uzyskuje się dokładne mapy, jednak użycie enkoderów, nie zawsze jest możliwe. W artykule poruszony zostaje temat pozycjonowania i mapowania przy użyciu lidaru bez wykorzystywania dodatkowych czujników zapewniających dane odometryczne. Zaproponowany zostaje odpowiedni algorytm oraz dyskusja dotycząca zastosowanych procesorów obliczeniowych, na których jest uruchamiany (wyłącznie CPU oraz z wykorzystaniem GPU wspierającego technologię CUDA). Zaprezentowane są wyniki w formie wykresów zależności czasu od iteracji, uzyskanych chmur punktów, a także parametrów sprzętowych obserwowanych w trakcie działania algorytmu.

EN

SLAM stands for a simultaneous localization and mapping. It’s used in construction of autonomic robots, designed for work in topographically unknown areas or dynamically changing environment. In its simplest form it utilizes distance sensor, lidar for example, and displacement data obtained from encoders. Thanks to application of appropriate strategies of adding next scan iterations and filtration of obtained data, it allows to create accurate maps with minimal computing power required. However, usage of encoders is not always possible, as in case of boats, legged robots or drones. To solve this problem, there’s proposed an algorithm that allows for localization and mapping in described situation, with a discussion on type of processors used by program. Because of the task specifics, it’s necessary to match many obtained simultaneously measurements with created map. For this purpose, the differences between algorithm version using only CPU, by spreading the task between different processor threads, and algorithm version that utilize graphical computing acceleration, that make calculations on many parallel CUDA cores, were checked. Both implementations were tested on the corridor inside building with results in the form of charts comparing time needed for separated iterations to complete.

9

Handling Non-determinism in Spiking Neural P Systems : Algorithms and Simulations

Carandang Jym Paul, Cabarle Francis George C., Adorna Henry Natividad, Hernandez Nestine Hope S., Martínez-del-Amor Miguel Ángel

Fundamenta Informaticae

|

2019

|

Vol. 164, nr 2-3

139--155

EN

Spiking Neural P system is a computing model inspired on how the neurons in a living being are interconnected and exchange information. As a model in embrane computing, it is a non-deterministic and massively-parallel system. The latter makes GPU a good candidate for accelerating the simulation of these models. A matrix representation for systems with and without delay have been previously designed, and algorithms for simulating them with deterministic systems was also developed. So far, non-determinism has been problematic for the design of parallel simulators. In this work, an algorithm for simulating non-deterministic spiking neural P system with delays is presented. In order to study how the simulations get accelerated on a GPU, this algorithm was implemented in CUDA and used to simulate non-uniform and uniform solutions to the Subset Sum problem as a case study. The analysis is completed with a comparison of time and space resources in the GPU of such simulations.

10

Równoległa realizacja przykładowego algorytmu genetycznego z wykorzystaniem akceleratorów GPU

Ratuszniak P., Stasiak A., Łańcucki R.

Zeszyty Naukowe Wydziału Elektroniki i Informatyki Politechniki Koszalińskiej

|

2018

|

Nr 13

63--78

PL

W artykule zaprezentowano praktyczną implementację aplikacji rozwiązującej przykładowy algorytm genetyczny z wykorzystaniem akceleratorów GPU. W tym przypadku zdecydowano się na rozwiązanie za pomocą algorytmu genetycznego typowego problemu optymalizacyjnego, jakim jest problem komiwojażera. Dodatkowo w celu wykorzystania mocy karty graficznej w tworzonej aplikacji wykorzystano technologię programowania na karcie graficznej – technologię Nvidia CUDA.

EN

The paper presents a practical implementation of a local desktop application that solves exemplary genetic algorithm with the use of GPU accelerators. In this case decided with the use of genetic algorithm to solve typical optimization problem which is travelling salesman problem. Additionally used Nvidia CUDA programming technology in order to use power of GPU in created application.

11

Scalable Method of Searching for Full-period Nonlinear Feedback Shift Registers with GPGPU : New List of Maximum Period NLFSRs

Augustynowicz P., Kanciak K.

International Journal of Electronics and Telecommunications

|

2018

|

Vol. 64, No. 2

167--171

EN

This paper addresses the problem of efficient searching for Nonlinear Feedback Shift Registers (NLFSRs) with a guaranteed full period. The maximum possible period for an n-bit NLFSR is 2ⁿ - 1 (an all-zero state is omitted). A multi-stages hybrid algorithm which utilizes Graphics Processor Units (GPU) power was developed for processing data-parallel throughput computation. Usage of the abovementioned algorithm allows giving an extended list of n-bit NLFSR with maximum period for 7 cryptographically applicable types of feedback functions.

12

Robust and efficient finite-difference-time-domain modelling of the propagation of nonlinear elastic waves

Pandala A., Shivaprasad S., Krishnamurthy C. V., Balasubramaniam K.

Badania Nieniszczące i Diagnostyka

|

2018

|

nr 2

11--21

EN

A robust finite-difference-time-domain (FDTD ) scheme to model the non-linear elastic wave propagation in a homogeneous isotropic material is presented. A formulation based on rotated staggered grid scheme in a displacement-velocity-stress configuration incorporating both geometric and material nonlinearities is proposed. By adopting a Parsimonious algorithm, the computational memory requirement is reduced by 50%. Simulations are accelerated by exploiting massive data parallelism innate to the FDTD approach using parallel computation on Graphical Processing Units with NVIDIA CUDA ’s API. For the proposed numerical scheme, the grid convergence criterion and accuracy over propagating distances are investigated. The study is also extended to determine the contribution from geometric and material models at various input amplitude levels. The time and frequency domain signals obtained from the proposed scheme are verified with a commercial finite element solver. The simulation runtimes for an Aluminium sample of dimensions 20 mm x 10 mm using a 5 MHz pulse is of the order of one minute, which makes the proposed numerical scheme attractive to model nonlinear elastic waves in large domains.

PL

W artykule przedstawiono odporny schemat metody różnic skończonych w dziedzinie czasu (FDTD ) do modelowania propagacji nieliniowych fal sprężystych w jednorodnym materiale izotropowym. Zaproponowano podejście oparte na rotowanych siatkach przestawnych w układzie przemieszczenie- prędkość-naprężenie obejmującym zarówno nieliniowość geometryczną, jak i materiałową. Zastosowanie algorytmu redukcji oszczędnej, zmniejszyło zapotrzebowanie na pamięć obliczeniową o 50%. Symulacje są przyspieszane przez wykorzystanie olbrzymiego paralelizmu danych wbudowanego w podejście FDTD z wykorzystaniem obliczeń równoległych na jednostkach przetwarzania graficznego (GPU) wyposażonych w interfejs API NVIDIA CUDA . Dla proponowanego schematu numerycznego badane jest kryterium zbieżności siatki i dokładność w funkcji odległości propagacji. Badanie rozszerzono również w celu określenia wkładu modeli geometrycznych i materiałowych na różnych poziomach amplitudy wejściowej. Sygnały w dziedzinie czasu i częstotliwości uzyskane z proponowanego schematu są weryfikowane za pomocą komercyjnego oprogramowania wykorzystującego metodę elementów skończonych. Czasy pracy dla symulacji propagacji impulsu o częstotliwości 5 MHz w próbce aluminium o wymiarach 20 mm x 10 mm są rzędu jednej minuty, co sprawia, że proponowany schemat liczbowy jest atrakcyjny dla modelowania nieliniowych fal sprężystych w dużych domenach.

13

Sequential Classification of Palm Gestures Based on A* Algorithm and MLP Neural Network for Quadrocopter Control

Wodziński M., Krzyżanowska A.

Metrology and Measurement Systems

|

2017

|

Vol. 24, nr 2

265--276

EN

This paper presents an alternative approach to the sequential data classification, based on traditional machine learning algorithms (neural networks, principal component analysis, multivariate Gaussian anomaly detector) and finding the shortest path in a directed acyclic graph, using A* algorithm with a regression-based heuristic. Palm gestures were used as an example of the sequential data and a quadrocopter was the controlled object. The study includes creation of a conceptual model and practical construction of a system using the GPU to ensure the realtime operation. The results present the classification accuracy of chosen gestures and comparison of the computation time between the CPU- and GPU-based solutions.

14

Parallel RANSAC for point cloud registration

Koguciuk D.

Foundations of Computing and Decision Sciences

|

2017

|

Vol. 42, No. 3

204--217

EN

In this paper, a project and implementation of the parallel RANSAC algorithm in CUDA architecture for point cloud registration are presented. At the beginning, a serial state of the art method with several heuristic improvements from the literature compared to basic RANSAC is introduced. Subsequently, its algorithmic parallelization and CUDA implementation details are discussed. The comparative test has proven a significant program execution acceleration. The result is finding of the local coordinate system of the object in the scene in the near real-time conditions. The source code is shared on the Internet as a part of the Heuros system.

15

Benchmark of 6D SLAM (6d simultaneous localization and mapping) algorithms with robotic mobile mapping systems

Będkowski J., Röhling T., Hoeller F., Shulz D., Schneider F. E.

Foundations of Computing and Decision Sciences

|

2017

|

Vol. 42, No. 3

276--295

EN

This work concerns the study of 6DSLAM algorithms with an application of robotic mobile mapping systems. The architecture of the 6DSLAM algorithm is designed for evaluation of different data registration strategies. The algorithm is composed of the iterative registration component, thus ICP (Iterative Closest Point), ICP (point to projection), ICP with semantic discrimination of points, LS3D (Least Square Surface Matching), NDT (Normal Distribution Transform) can be chosen. Loop closing is based on LUM and LS3D. The main research goal was to investigate the semantic discrimination of measured points that improve the accuracy of final map especially in demanding scenarios such as multi-level maps (e.g., climbing stairs). The parallel programming based nearest neighborhood search implementation such as point to point, point to projection, semantic discrimination of points is used. The 6DSLAM framework is based on modified 3DTK and PCL open source libraries and parallel programming techniques using NVIDIA CUDA. The paper shows experiments that are demonstrating advantages of proposed approach in relation to practical applications. The major added value of presented research is the qualitative and quantitative evaluation based on realistic scenarios including ground truth data obtained by geodetic survey. The research novelty looking from mobile robotics is the evaluation of LS3D algorithm well known in geodesy.

16

Porównanie metod obliczeń równoległych OpenMP i CUDA

Maj Michał

Zeszyty Naukowe WSEI. Seria Transport i Informatyka

|

2015

|

T. 5, nr 1

19--27

PL

Programowanie równoległe oznacza tworzenie programów w taki sposób, by można je było wykonywać równocześnie na wielu procesorach. Na potrzeby niniejszego artykułu napisane zostały dwa programy zrównoleglone – jeden w CUDA C oraz jeden w OpenMP, przeznaczony dla CPU – oraz jeden sekwencyjny (niewspółbieżny). Najszybszym sposobem zrównoleglania okazał się program napisany w CUDA, w którym wykorzystuje się pamięć niekopiowaną. Wadą CUDA jest to, że działa tylko ze sprzętem firmy NVIDIA.

EN

Parallel programming means development of programs, which can be executed truly concurrently on multiprocessor platforms. For current test purposes two parallel programs have been developed – one in CUDA C language, second using OpenMP library. Also equivalent sequential (non-parallel) program has been developed. Most efficient parallelization have been achieved in CUDA program with page-locked memory. CUDA is handicapped by limitation to NVIDIA hardware.

17

Implementacja metody momentów z wykorzystaniem kart graficznych i architektury CUDA

Topa T., Noga A.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 3

34-40

PL

W artykule omówiono możliwości zastosowania kart graficznych do przyspieszania obliczeń numerycznych bazujących na metodzie momentów. Opisano algorytmy metody momentów implementowane w heterogenicznym środowisku CPU/GPU oraz przeprowadzono szczegółową analizę możliwych do uzyskania przyspieszeń dla różnych generacji architektury CUDA.

EN

The using of GPU to accelerate of the numerical simulations based on the Method of Moments (MoM) is presented in this paper. Implementation of the MoM in heterogeneous CPU/GPU platform and the measured speedups for the three generation of CUDA architecture is also demonstrated.

18

Effective biclustering on GPU - capabilities and constraints

Orzechowski P., Boryczko K.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 8

131--134

PL

W artykule przedstawiono korzyści i ograniczenia związane z projektowaniem równoległego algorytmu biklasteryzacji, przeznaczonego na GPU. Zaprezentowano definicję biklasteryzacji oraz skrótowo opisano architekturze GPU. Zestawiono popularne wzorce strategii implementacji algorytmów, przydatne w projektowaniu efektywnych rozwiązań na GPU. Publikacja zawiera także praktyczne wskazówki programistyczne, w kontekście implementacji algorytmów biklasteryzacji w języku CUDA/OpenCL.

EN

This article presents the benefits and limitations related to designing a parallel biclustering algorithm on a GPU. A definition of biclustering is provided together with a brief description of the GPU architecture. We then review algorithm strategy patterns, which are helpful in providing efficient implementations on GPU. Finally, we highlight programming aspects of implementing biclustering algorithms in CUDA/OpenCL programming language.

19

The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators

Pietroń M., Karwatowski M., Wiatr K.

Measurement Automation Monitoring

|

2015

|

Vol. 61, No. 7

385--387

EN

One of the most challenging issues in the case of many and multi-core architectures is how to exploit their potential computing power in legacy systems without a deep knowledge of their architecture. The analysis of static dependence and dynamic data dependences of a program run, can help to identify independent paths that could have been computed by individual parallel threads. The statistics of reusing the data and its size is also crucial in adapting the application in GPU many-core hardware architecture because of specific memory hierarchies. The proposed profiling system accomplishes static data analysis and computes dynamic dependencies for Java programs as well as recommends parts of source code with the highest potential for parallelization in GPU. Such an analysis can also provide starting point for automatic parallelization.

20

Simulating P Systems on GPU Device : A Survey

Martínez-del-Amor M. A., García-Quismondo M., Macías-Ramos L. F., Valencia-Cabrera L., Riscos-Núñez A., Pérez-Jiménez M. J.

Fundamenta Informaticae

|

2015

|

Vol. 136, nr 3

269--284

EN

P systems have been proven to be useful as modeling tools in many fields, such as Systems Biology and Ecological Modeling. For such applications, the acceleration of P system simulation is often desired, given the computational needs derived from these kinds of models. One promising solution is to implement the inherent parallelism of P systems on platforms with parallel architectures. In this respect, GPU computing proved to be an alternative to more classic approaches in Parallel Computing. It provides a low cost, and a manycore platform with a high level of parallelism. The GPU has been already employed to speedup the simulation of P systems. In this paper, we look over the available parallel P systems simulators on the GPU, with special emphasis on those included in the PMCGPU project, and analyze some useful guidelines for future implementations and developments.