Ograniczanie wyników
Czasopisma help
Autorzy help
Lata help
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 59

Liczba wyników na stronie
first rewind previous Strona / 3 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  CUDA
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 3 next fast forward last
EN
Swarm intelligence algorithms are widely recognized for their efficiency in solving complex optimization problems. However, their scalability poses challenges, particularly with large problem instances. This study investigates the time performance of swarm intelligence algorithms by leveraging parallel computing on both central processing units (CPUs) and graphics processing units (GPUs). The focus is on optimizing algorithms designed for range search in Euclidean space to enhance GPU execution. Additionally, the study explores swarm-inspired solutions specifically tailored for GPU implementations, emphasising improving efficiency in video rendering and computer simulations. The findings highlight the potential of GPU-accelerated swarm intelligence solutions to address scalability challenges in large-scale optimization, offering promising advancements in the field.
2
Content available Accelerating the Clarke-Wright algorithm using GPUs
EN
The Capacitated Vehicle Routing Problem (CVRP) is a combinatorial optimization problem that seeks to determine the optimal set of routes for a fleet of vehicles, with limited capacity, to deliver goods to customers while minimizing the total cost. Due to its NP-hard nature, finding exact solutions for the large-scale CVRP instances is computationally intractable. Therefore, heuristics and metaheuristics are widely employed to find approximate optimal solutions. Among these, the Clarke-Wright (CW) algorithm is a popular greedy approach that constructs routes by iteratively merging nodes to minimize transportation costs. This study presents an implementation of the CW algorithm in graphics processing units (GPUs) using the CUDA (Compute Unified Device Architecture) framework. The GPU implementation is compared to its CPU counterpart in terms of execution time and performance. The results demonstrate significant speed-ups achieved by the GPU implementation, particularly for large-scale instances. Performance gains can be attributed to the parallel processing capabilities of GPUs, enabling efficient execution of the algorithm computational steps.
EN
High-density electroencephalographic (EEG) systems are utilized in the study of the human brain and its underlying behaviors. However, working with EEG data requires a well-cleaned signal, which is often achieved through the use of independent component analysis (ICA) methods. The calculation time for these types of algorithms is the longer the more data we have. This article presents a hybrid implementation of the fastICA algorithm that uses parallel programming techniques (libraries and extensions of the Intel processors and CUDA programming), which results in a significant acceleration of execution time on selected architectures.
EN
Medical segmentation metrics are crucial for development of correct segmentation algorithms in medical imaging domain. In case of three dimensional large arrays representing studies like CT, PET/CT or MRI of critical importance is availability of library implementing high performance metrics. MedEval3D is created in order to fulfill this need thanks to implementation of CUDA acceleration. Most of implemented metrics like Dice coefficient, Jacard coefficient etc. are based on confusion matrix, what enable effective reuse of calculations across multiple metrics improving performance in such use case. Additionally algorithms like interclass correlation and Mahalanobis distance are also introduced. In both cases their implementations are significantly faster then their counterparts from other available libraries. Lastly programming interface to all of the metrics was created in Julia programming language.
EN
This article presents ways to implement a resource-consuming algorithm on hardware with a limited amount of memory, which is the GPU. Numerical integration for higher-order finite element approximation was chosen as an example algorithm. To perform compu- tational tests, we use a non-linear geometric element and solve the convection-diffusion- reaction problem. For calculations, a Tesla K20m graphics card based on Kepler archi- tecture and Radeon r9 280X based on Tahiti XT architecture were used. The results of computational experiments were compared with the theoretical performance of both GPUs, which allowed an assessment of actual performance. Our research gives sugges- tions for choosing the optimal design of algorithms as well as the right hardware for such a resource-demanding task.
EN
Objectives: The electroencephalographic signal is largely exposed to external disturbances. Therefore, an important element of its processing is its thorough cleaning. Methods: One of the common methods of signal improvement is the independent component analysis (ICA). However, it is a computationally expensive algorithm, hence methods are needed to decrease its execution time. One of the ICA algorithms (fastICA) and parallel computing on the CPU and GPU was used to reduce the algorithm execution time. Results: This paper presents the results of study on the implementation of fastICA, which uses some multi-core architecture and the GPU computation capabilities. Conclusions: The use of such a hybrid approach shortens the execution time of the algorithm.
EN
We parallelized the sequential algorithm of the four-body correlation function if eachcombination of two pairs(i, j)and(k, l) was averaged over the time in a separate calculation thread. The generator of pairs used as the input for this algorithm was also parallelized and connected with the 4-body correlation function calculations. We used our algorithm to accelerate extremely intensive calculations of the 4-body polarizability anisotropy correlation functions,which were very important to estimate the interaction induced light scattering spectrum. The resulting C code was used to test our algorithm on Graphics Processing Units (GPUs) with the Compute Unified Device Architecture (CUDA) technology from NVIDIA®Corporation. Asa result, we achieved 12 times the acceleration of the 4-body correlation function calculations in comparison to the Central Processing Unit (CPU) core. The peak performance of the GPU calculations was registered at the level of 19 times faster than the CPU core. We also found thatacceleration depended on the memory consumption. In the single precision mode, the relative error between the CPU and GPU calculations was found to be within 0.1%
PL
W niniejszym artykule przedstawiono metodę wykorzystania procesorów graficznych do obliczeń wartości poziomów niejonizujących pól elektromagnetycznych, pochodzących od systemów radiokomunikacyjnych, stanowiących potencjalne źródło narażeń ludności na pole elektromagnetyczne. Czasy obliczeń porównano z metodami wykorzystującymi przetwarzanie równoległe na procesorach CPU.
EN
This article presents the method of using GPGPU to estimate EMF levels of human exposure on non-ionized EMF, deriving from wireless systems. Calculation time on GPGPU has been compared to time elapsed with parallel calculations performed on CPU.
EN
The paper presents a discussion on the issue of possible acceleration of radiolocation signal processing algorithms in seekers using graphics processing units. A concept and implementation examples of algorithms performing digital data filtering on general purpose central and graphics processing units are introduced. The results of performance comparison of central and graphics processing units during computing discrete convolution are presented at the end of the paper.
PL
W artykule zamieszczono rozważania na temat możliwości akceleracji algorytmów przetwarzania sygnałów radiolokacyjnych w głowicach samonaprowadzania z wykorzystaniem procesorów graficznych. Przedstawiono koncepcję oraz przykłady implementacji algorytmów realizujących cyfrową filtrację na procesorach klasycznych oraz graficznych ogólnego przeznaczenia. Wyniki porównania wydajności centralnych i graficznych jednostek przetwarzania podczas obliczania dyskretnego splotu przedstawiono na końcu artykułu.
EN
Construction of basins of attraction, used for the analysis of nonlinear dynamical systems which present multistability, are computationaly very expensive. Because of the long runtime needed, in many cases, the construction of basins does not have any practical use. Numerical time integration is currently the bottleneck of algorithms used for the construction of such basins. The integrations related to each set of initial conditions are independent of each other. The assignment of each integration to a separate thread seems very attractive, and parallel algorithms which use this approach to construct the basins are presented here. Two versions are considered, one for multi-core and another for many-core architectures, both based on a SPMD approach. The algorithm is tested on three systems, the classic nonlinear Duffing system, a non-ideal system exhibiting the Sommerfeld effect and an immunodynamic system. The results for all examples demonstrate the versatility of the proposed parallel algorithm, showing that the multi-core parallel algorithm using MPI has nearly an ideal speedup and efficiency.
PL
SLAM jest to algorytm równoczesnego mapowania otoczenia i lokalizowania się na tworzonej mapie. Wykorzystywany jest w robotach autonomicznych przeznaczonych do pracy w nieznanym bądź dynamicznie zmieniającym się otoczeniu. W swojej podstawowej formie wykorzystuje czujnik odległości, taki jak lidar bądź radar oraz dane o przesunięciu pozyskiwane z enkoderów. Dzięki zastosowaniu odpowiednich strategii dodawania kolejnych skanów oraz filtracji pobieranych danych uzyskuje się dokładne mapy, jednak użycie enkoderów, nie zawsze jest możliwe. W artykule poruszony zostaje temat pozycjonowania i mapowania przy użyciu lidaru bez wykorzystywania dodatkowych czujników zapewniających dane odometryczne. Zaproponowany zostaje odpowiedni algorytm oraz dyskusja dotycząca zastosowanych procesorów obliczeniowych, na których jest uruchamiany (wyłącznie CPU oraz z wykorzystaniem GPU wspierającego technologię CUDA). Zaprezentowane są wyniki w formie wykresów zależności czasu od iteracji, uzyskanych chmur punktów, a także parametrów sprzętowych obserwowanych w trakcie działania algorytmu.
EN
SLAM stands for a simultaneous localization and mapping. It’s used in construction of autonomic robots, designed for work in topographically unknown areas or dynamically changing environment. In its simplest form it utilizes distance sensor, lidar for example, and displacement data obtained from encoders. Thanks to application of appropriate strategies of adding next scan iterations and filtration of obtained data, it allows to create accurate maps with minimal computing power required. However, usage of encoders is not always possible, as in case of boats, legged robots or drones. To solve this problem, there’s proposed an algorithm that allows for localization and mapping in described situation, with a discussion on type of processors used by program. Because of the task specifics, it’s necessary to match many obtained simultaneously measurements with created map. For this purpose, the differences between algorithm version using only CPU, by spreading the task between different processor threads, and algorithm version that utilize graphical computing acceleration, that make calculations on many parallel CUDA cores, were checked. Both implementations were tested on the corridor inside building with results in the form of charts comparing time needed for separated iterations to complete.
12
Content available remote Handling Non-determinism in Spiking Neural P Systems : Algorithms and Simulations
EN
Spiking Neural P system is a computing model inspired on how the neurons in a living being are interconnected and exchange information. As a model in embrane computing, it is a non-deterministic and massively-parallel system. The latter makes GPU a good candidate for accelerating the simulation of these models. A matrix representation for systems with and without delay have been previously designed, and algorithms for simulating them with deterministic systems was also developed. So far, non-determinism has been problematic for the design of parallel simulators. In this work, an algorithm for simulating non-deterministic spiking neural P system with delays is presented. In order to study how the simulations get accelerated on a GPU, this algorithm was implemented in CUDA and used to simulate non-uniform and uniform solutions to the Subset Sum problem as a case study. The analysis is completed with a comparison of time and space resources in the GPU of such simulations.
PL
W artykule zaprezentowano praktyczną implementację aplikacji rozwiązującej przykładowy algorytm genetyczny z wykorzystaniem akceleratorów GPU. W tym przypadku zdecydowano się na rozwiązanie za pomocą algorytmu genetycznego typowego problemu optymalizacyjnego, jakim jest problem komiwojażera. Dodatkowo w celu wykorzystania mocy karty graficznej w tworzonej aplikacji wykorzystano technologię programowania na karcie graficznej – technologię Nvidia CUDA.
EN
The paper presents a practical implementation of a local desktop application that solves exemplary genetic algorithm with the use of GPU accelerators. In this case decided with the use of genetic algorithm to solve typical optimization problem which is travelling salesman problem. Additionally used Nvidia CUDA programming technology in order to use power of GPU in created application.
EN
This paper addresses the problem of efficient searching for Nonlinear Feedback Shift Registers (NLFSRs) with a guaranteed full period. The maximum possible period for an n-bit NLFSR is 2ⁿ - 1 (an all-zero state is omitted). A multi-stages hybrid algorithm which utilizes Graphics Processor Units (GPU) power was developed for processing data-parallel throughput computation. Usage of the abovementioned algorithm allows giving an extended list of n-bit NLFSR with maximum period for 7 cryptographically applicable types of feedback functions.
EN
A robust finite-difference-time-domain (FDTD ) scheme to model the non-linear elastic wave propagation in a homogeneous isotropic material is presented. A formulation based on rotated staggered grid scheme in a displacement-velocity-stress configuration incorporating both geometric and material nonlinearities is proposed. By adopting a Parsimonious algorithm, the computational memory requirement is reduced by 50%. Simulations are accelerated by exploiting massive data parallelism innate to the FDTD approach using parallel computation on Graphical Processing Units with NVIDIA CUDA ’s API. For the proposed numerical scheme, the grid convergence criterion and accuracy over propagating distances are investigated. The study is also extended to determine the contribution from geometric and material models at various input amplitude levels. The time and frequency domain signals obtained from the proposed scheme are verified with a commercial finite element solver. The simulation runtimes for an Aluminium sample of dimensions 20 mm x 10 mm using a 5 MHz pulse is of the order of one minute, which makes the proposed numerical scheme attractive to model nonlinear elastic waves in large domains.
PL
W artykule przedstawiono odporny schemat metody różnic skończonych w dziedzinie czasu (FDTD ) do modelowania propagacji nieliniowych fal sprężystych w jednorodnym materiale izotropowym. Zaproponowano podejście oparte na rotowanych siatkach przestawnych w układzie przemieszczenie- prędkość-naprężenie obejmującym zarówno nieliniowość geometryczną, jak i materiałową. Zastosowanie algorytmu redukcji oszczędnej, zmniejszyło zapotrzebowanie na pamięć obliczeniową o 50%. Symulacje są przyspieszane przez wykorzystanie olbrzymiego paralelizmu danych wbudowanego w podejście FDTD z wykorzystaniem obliczeń równoległych na jednostkach przetwarzania graficznego (GPU) wyposażonych w interfejs API NVIDIA CUDA . Dla proponowanego schematu numerycznego badane jest kryterium zbieżności siatki i dokładność w funkcji odległości propagacji. Badanie rozszerzono również w celu określenia wkładu modeli geometrycznych i materiałowych na różnych poziomach amplitudy wejściowej. Sygnały w dziedzinie czasu i częstotliwości uzyskane z proponowanego schematu są weryfikowane za pomocą komercyjnego oprogramowania wykorzystującego metodę elementów skończonych. Czasy pracy dla symulacji propagacji impulsu o częstotliwości 5 MHz w próbce aluminium o wymiarach 20 mm x 10 mm są rzędu jednej minuty, co sprawia, że proponowany schemat liczbowy jest atrakcyjny dla modelowania nieliniowych fal sprężystych w dużych domenach.
EN
This paper presents an alternative approach to the sequential data classification, based on traditional machine learning algorithms (neural networks, principal component analysis, multivariate Gaussian anomaly detector) and finding the shortest path in a directed acyclic graph, using A* algorithm with a regression-based heuristic. Palm gestures were used as an example of the sequential data and a quadrocopter was the controlled object. The study includes creation of a conceptual model and practical construction of a system using the GPU to ensure the realtime operation. The results present the classification accuracy of chosen gestures and comparison of the computation time between the CPU- and GPU-based solutions.
17
Content available remote Parallel RANSAC for point cloud registration
EN
In this paper, a project and implementation of the parallel RANSAC algorithm in CUDA architecture for point cloud registration are presented. At the beginning, a serial state of the art method with several heuristic improvements from the literature compared to basic RANSAC is introduced. Subsequently, its algorithmic parallelization and CUDA implementation details are discussed. The comparative test has proven a significant program execution acceleration. The result is finding of the local coordinate system of the object in the scene in the near real-time conditions. The source code is shared on the Internet as a part of the Heuros system.
EN
This work concerns the study of 6DSLAM algorithms with an application of robotic mobile mapping systems. The architecture of the 6DSLAM algorithm is designed for evaluation of different data registration strategies. The algorithm is composed of the iterative registration component, thus ICP (Iterative Closest Point), ICP (point to projection), ICP with semantic discrimination of points, LS3D (Least Square Surface Matching), NDT (Normal Distribution Transform) can be chosen. Loop closing is based on LUM and LS3D. The main research goal was to investigate the semantic discrimination of measured points that improve the accuracy of final map especially in demanding scenarios such as multi-level maps (e.g., climbing stairs). The parallel programming based nearest neighborhood search implementation such as point to point, point to projection, semantic discrimination of points is used. The 6DSLAM framework is based on modified 3DTK and PCL open source libraries and parallel programming techniques using NVIDIA CUDA. The paper shows experiments that are demonstrating advantages of proposed approach in relation to practical applications. The major added value of presented research is the qualitative and quantitative evaluation based on realistic scenarios including ground truth data obtained by geodetic survey. The research novelty looking from mobile robotics is the evaluation of LS3D algorithm well known in geodesy.
19
Content available remote Porównanie metod obliczeń równoległych OpenMP i CUDA
PL
Programowanie równoległe oznacza tworzenie programów w taki sposób, by można je było wykonywać równocześnie na wielu procesorach. Na potrzeby niniejszego artykułu napisane zostały dwa programy zrównoleglone – jeden w CUDA C oraz jeden w OpenMP, przeznaczony dla CPU – oraz jeden sekwencyjny (niewspółbieżny). Najszybszym sposobem zrównoleglania okazał się program napisany w CUDA, w którym wykorzystuje się pamięć niekopiowaną. Wadą CUDA jest to, że działa tylko ze sprzętem firmy NVIDIA.
EN
Parallel programming means development of programs, which can be executed truly concurrently on multiprocessor platforms. For current test purposes two parallel programs have been developed – one in CUDA C language, second using OpenMP library. Also equivalent sequential (non-parallel) program has been developed. Most efficient parallelization have been achieved in CUDA program with page-locked memory. CUDA is handicapped by limitation to NVIDIA hardware.
20
PL
W artykule omówiono możliwości zastosowania kart graficznych do przyspieszania obliczeń numerycznych bazujących na metodzie momentów. Opisano algorytmy metody momentów implementowane w heterogenicznym środowisku CPU/GPU oraz przeprowadzono szczegółową analizę możliwych do uzyskania przyspieszeń dla różnych generacji architektury CUDA.
EN
The using of GPU to accelerate of the numerical simulations based on the Method of Moments (MoM) is presented in this paper. Implementation of the MoM in heterogeneous CPU/GPU platform and the measured speedups for the three generation of CUDA architecture is also demonstrated.
first rewind previous Strona / 3 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.