Ograniczanie wyników
Czasopisma help
Autorzy help
Lata help
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 54

Liczba wyników na stronie
first rewind previous Strona / 3 next fast forward last
Wyniki wyszukiwania
Wyszukiwano:
w słowach kluczowych:  CUDA
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 3 next fast forward last
1
Content available remote Implementation of numerical integrationto high-order elements on the GPUs
EN
This article presents ways to implement a resource-consuming algorithm on hardware with a limited amount of memory, which is the GPU. Numerical integration for higher-order finite element approximation was chosen as an example algorithm. To perform compu- tational tests, we use a non-linear geometric element and solve the convection-diffusion- reaction problem. For calculations, a Tesla K20m graphics card based on Kepler archi- tecture and Radeon r9 280X based on Tahiti XT architecture were used. The results of computational experiments were compared with the theoretical performance of both GPUs, which allowed an assessment of actual performance. Our research gives sugges- tions for choosing the optimal design of algorithms as well as the right hardware for such a resource-demanding task.
EN
Objectives: The electroencephalographic signal is largely exposed to external disturbances. Therefore, an important element of its processing is its thorough cleaning. Methods: One of the common methods of signal improvement is the independent component analysis (ICA). However, it is a computationally expensive algorithm, hence methods are needed to decrease its execution time. One of the ICA algorithms (fastICA) and parallel computing on the CPU and GPU was used to reduce the algorithm execution time. Results: This paper presents the results of study on the implementation of fastICA, which uses some multi-core architecture and the GPU computation capabilities. Conclusions: The use of such a hybrid approach shortens the execution time of the algorithm.
EN
We parallelized the sequential algorithm of the four-body correlation function if eachcombination of two pairs(i, j)and(k, l) was averaged over the time in a separate calculation thread. The generator of pairs used as the input for this algorithm was also parallelized and connected with the 4-body correlation function calculations. We used our algorithm to accelerate extremely intensive calculations of the 4-body polarizability anisotropy correlation functions,which were very important to estimate the interaction induced light scattering spectrum. The resulting C code was used to test our algorithm on Graphics Processing Units (GPUs) with the Compute Unified Device Architecture (CUDA) technology from NVIDIA®Corporation. Asa result, we achieved 12 times the acceleration of the 4-body correlation function calculations in comparison to the Central Processing Unit (CPU) core. The peak performance of the GPU calculations was registered at the level of 19 times faster than the CPU core. We also found thatacceleration depended on the memory consumption. In the single precision mode, the relative error between the CPU and GPU calculations was found to be within 0.1%
PL
W niniejszym artykule przedstawiono metodę wykorzystania procesorów graficznych do obliczeń wartości poziomów niejonizujących pól elektromagnetycznych, pochodzących od systemów radiokomunikacyjnych, stanowiących potencjalne źródło narażeń ludności na pole elektromagnetyczne. Czasy obliczeń porównano z metodami wykorzystującymi przetwarzanie równoległe na procesorach CPU.
EN
This article presents the method of using GPGPU to estimate EMF levels of human exposure on non-ionized EMF, deriving from wireless systems. Calculation time on GPGPU has been compared to time elapsed with parallel calculations performed on CPU.
EN
Construction of basins of attraction, used for the analysis of nonlinear dynamical systems which present multistability, are computationaly very expensive. Because of the long runtime needed, in many cases, the construction of basins does not have any practical use. Numerical time integration is currently the bottleneck of algorithms used for the construction of such basins. The integrations related to each set of initial conditions are independent of each other. The assignment of each integration to a separate thread seems very attractive, and parallel algorithms which use this approach to construct the basins are presented here. Two versions are considered, one for multi-core and another for many-core architectures, both based on a SPMD approach. The algorithm is tested on three systems, the classic nonlinear Duffing system, a non-ideal system exhibiting the Sommerfeld effect and an immunodynamic system. The results for all examples demonstrate the versatility of the proposed parallel algorithm, showing that the multi-core parallel algorithm using MPI has nearly an ideal speedup and efficiency.
PL
SLAM jest to algorytm równoczesnego mapowania otoczenia i lokalizowania się na tworzonej mapie. Wykorzystywany jest w robotach autonomicznych przeznaczonych do pracy w nieznanym bądź dynamicznie zmieniającym się otoczeniu. W swojej podstawowej formie wykorzystuje czujnik odległości, taki jak lidar bądź radar oraz dane o przesunięciu pozyskiwane z enkoderów. Dzięki zastosowaniu odpowiednich strategii dodawania kolejnych skanów oraz filtracji pobieranych danych uzyskuje się dokładne mapy, jednak użycie enkoderów, nie zawsze jest możliwe. W artykule poruszony zostaje temat pozycjonowania i mapowania przy użyciu lidaru bez wykorzystywania dodatkowych czujników zapewniających dane odometryczne. Zaproponowany zostaje odpowiedni algorytm oraz dyskusja dotycząca zastosowanych procesorów obliczeniowych, na których jest uruchamiany (wyłącznie CPU oraz z wykorzystaniem GPU wspierającego technologię CUDA). Zaprezentowane są wyniki w formie wykresów zależności czasu od iteracji, uzyskanych chmur punktów, a także parametrów sprzętowych obserwowanych w trakcie działania algorytmu.
EN
SLAM stands for a simultaneous localization and mapping. It’s used in construction of autonomic robots, designed for work in topographically unknown areas or dynamically changing environment. In its simplest form it utilizes distance sensor, lidar for example, and displacement data obtained from encoders. Thanks to application of appropriate strategies of adding next scan iterations and filtration of obtained data, it allows to create accurate maps with minimal computing power required. However, usage of encoders is not always possible, as in case of boats, legged robots or drones. To solve this problem, there’s proposed an algorithm that allows for localization and mapping in described situation, with a discussion on type of processors used by program. Because of the task specifics, it’s necessary to match many obtained simultaneously measurements with created map. For this purpose, the differences between algorithm version using only CPU, by spreading the task between different processor threads, and algorithm version that utilize graphical computing acceleration, that make calculations on many parallel CUDA cores, were checked. Both implementations were tested on the corridor inside building with results in the form of charts comparing time needed for separated iterations to complete.
EN
Spiking Neural P system is a computing model inspired on how the neurons in a living being are interconnected and exchange information. As a model in embrane computing, it is a non-deterministic and massively-parallel system. The latter makes GPU a good candidate for accelerating the simulation of these models. A matrix representation for systems with and without delay have been previously designed, and algorithms for simulating them with deterministic systems was also developed. So far, non-determinism has been problematic for the design of parallel simulators. In this work, an algorithm for simulating non-deterministic spiking neural P system with delays is presented. In order to study how the simulations get accelerated on a GPU, this algorithm was implemented in CUDA and used to simulate non-uniform and uniform solutions to the Subset Sum problem as a case study. The analysis is completed with a comparison of time and space resources in the GPU of such simulations.
PL
W artykule zaprezentowano praktyczną implementację aplikacji rozwiązującej przykładowy algorytm genetyczny z wykorzystaniem akceleratorów GPU. W tym przypadku zdecydowano się na rozwiązanie za pomocą algorytmu genetycznego typowego problemu optymalizacyjnego, jakim jest problem komiwojażera. Dodatkowo w celu wykorzystania mocy karty graficznej w tworzonej aplikacji wykorzystano technologię programowania na karcie graficznej – technologię Nvidia CUDA.
EN
The paper presents a practical implementation of a local desktop application that solves exemplary genetic algorithm with the use of GPU accelerators. In this case decided with the use of genetic algorithm to solve typical optimization problem which is travelling salesman problem. Additionally used Nvidia CUDA programming technology in order to use power of GPU in created application.
EN
This paper addresses the problem of efficient searching for Nonlinear Feedback Shift Registers (NLFSRs) with a guaranteed full period. The maximum possible period for an n-bit NLFSR is 2ⁿ - 1 (an all-zero state is omitted). A multi-stages hybrid algorithm which utilizes Graphics Processor Units (GPU) power was developed for processing data-parallel throughput computation. Usage of the abovementioned algorithm allows giving an extended list of n-bit NLFSR with maximum period for 7 cryptographically applicable types of feedback functions.
EN
A robust finite-difference-time-domain (FDTD ) scheme to model the non-linear elastic wave propagation in a homogeneous isotropic material is presented. A formulation based on rotated staggered grid scheme in a displacement-velocity-stress configuration incorporating both geometric and material nonlinearities is proposed. By adopting a Parsimonious algorithm, the computational memory requirement is reduced by 50%. Simulations are accelerated by exploiting massive data parallelism innate to the FDTD approach using parallel computation on Graphical Processing Units with NVIDIA CUDA ’s API. For the proposed numerical scheme, the grid convergence criterion and accuracy over propagating distances are investigated. The study is also extended to determine the contribution from geometric and material models at various input amplitude levels. The time and frequency domain signals obtained from the proposed scheme are verified with a commercial finite element solver. The simulation runtimes for an Aluminium sample of dimensions 20 mm x 10 mm using a 5 MHz pulse is of the order of one minute, which makes the proposed numerical scheme attractive to model nonlinear elastic waves in large domains.
PL
W artykule przedstawiono odporny schemat metody różnic skończonych w dziedzinie czasu (FDTD ) do modelowania propagacji nieliniowych fal sprężystych w jednorodnym materiale izotropowym. Zaproponowano podejście oparte na rotowanych siatkach przestawnych w układzie przemieszczenie- prędkość-naprężenie obejmującym zarówno nieliniowość geometryczną, jak i materiałową. Zastosowanie algorytmu redukcji oszczędnej, zmniejszyło zapotrzebowanie na pamięć obliczeniową o 50%. Symulacje są przyspieszane przez wykorzystanie olbrzymiego paralelizmu danych wbudowanego w podejście FDTD z wykorzystaniem obliczeń równoległych na jednostkach przetwarzania graficznego (GPU) wyposażonych w interfejs API NVIDIA CUDA . Dla proponowanego schematu numerycznego badane jest kryterium zbieżności siatki i dokładność w funkcji odległości propagacji. Badanie rozszerzono również w celu określenia wkładu modeli geometrycznych i materiałowych na różnych poziomach amplitudy wejściowej. Sygnały w dziedzinie czasu i częstotliwości uzyskane z proponowanego schematu są weryfikowane za pomocą komercyjnego oprogramowania wykorzystującego metodę elementów skończonych. Czasy pracy dla symulacji propagacji impulsu o częstotliwości 5 MHz w próbce aluminium o wymiarach 20 mm x 10 mm są rzędu jednej minuty, co sprawia, że proponowany schemat liczbowy jest atrakcyjny dla modelowania nieliniowych fal sprężystych w dużych domenach.
EN
This paper presents an alternative approach to the sequential data classification, based on traditional machine learning algorithms (neural networks, principal component analysis, multivariate Gaussian anomaly detector) and finding the shortest path in a directed acyclic graph, using A* algorithm with a regression-based heuristic. Palm gestures were used as an example of the sequential data and a quadrocopter was the controlled object. The study includes creation of a conceptual model and practical construction of a system using the GPU to ensure the realtime operation. The results present the classification accuracy of chosen gestures and comparison of the computation time between the CPU- and GPU-based solutions.
12
Content available remote Parallel RANSAC for point cloud registration
EN
In this paper, a project and implementation of the parallel RANSAC algorithm in CUDA architecture for point cloud registration are presented. At the beginning, a serial state of the art method with several heuristic improvements from the literature compared to basic RANSAC is introduced. Subsequently, its algorithmic parallelization and CUDA implementation details are discussed. The comparative test has proven a significant program execution acceleration. The result is finding of the local coordinate system of the object in the scene in the near real-time conditions. The source code is shared on the Internet as a part of the Heuros system.
EN
This work concerns the study of 6DSLAM algorithms with an application of robotic mobile mapping systems. The architecture of the 6DSLAM algorithm is designed for evaluation of different data registration strategies. The algorithm is composed of the iterative registration component, thus ICP (Iterative Closest Point), ICP (point to projection), ICP with semantic discrimination of points, LS3D (Least Square Surface Matching), NDT (Normal Distribution Transform) can be chosen. Loop closing is based on LUM and LS3D. The main research goal was to investigate the semantic discrimination of measured points that improve the accuracy of final map especially in demanding scenarios such as multi-level maps (e.g., climbing stairs). The parallel programming based nearest neighborhood search implementation such as point to point, point to projection, semantic discrimination of points is used. The 6DSLAM framework is based on modified 3DTK and PCL open source libraries and parallel programming techniques using NVIDIA CUDA. The paper shows experiments that are demonstrating advantages of proposed approach in relation to practical applications. The major added value of presented research is the qualitative and quantitative evaluation based on realistic scenarios including ground truth data obtained by geodetic survey. The research novelty looking from mobile robotics is the evaluation of LS3D algorithm well known in geodesy.
14
Content available remote Porównanie metod obliczeń równoległych OpenMP i CUDA
PL
Programowanie równoległe oznacza tworzenie programów w taki sposób, by można je było wykonywać równocześnie na wielu procesorach. Na potrzeby niniejszego artykułu napisane zostały dwa programy zrównoleglone – jeden w CUDA C oraz jeden w OpenMP, przeznaczony dla CPU – oraz jeden sekwencyjny (niewspółbieżny). Najszybszym sposobem zrównoleglania okazał się program napisany w CUDA, w którym wykorzystuje się pamięć niekopiowaną. Wadą CUDA jest to, że działa tylko ze sprzętem firmy NVIDIA.
EN
Parallel programming means development of programs, which can be executed truly concurrently on multiprocessor platforms. For current test purposes two parallel programs have been developed – one in CUDA C language, second using OpenMP library. Also equivalent sequential (non-parallel) program has been developed. Most efficient parallelization have been achieved in CUDA program with page-locked memory. CUDA is handicapped by limitation to NVIDIA hardware.
15
PL
W artykule omówiono możliwości zastosowania kart graficznych do przyspieszania obliczeń numerycznych bazujących na metodzie momentów. Opisano algorytmy metody momentów implementowane w heterogenicznym środowisku CPU/GPU oraz przeprowadzono szczegółową analizę możliwych do uzyskania przyspieszeń dla różnych generacji architektury CUDA.
EN
The using of GPU to accelerate of the numerical simulations based on the Method of Moments (MoM) is presented in this paper. Implementation of the MoM in heterogeneous CPU/GPU platform and the measured speedups for the three generation of CUDA architecture is also demonstrated.
16
Content available remote Effective biclustering on GPU - capabilities and constraints
PL
W artykule przedstawiono korzyści i ograniczenia związane z projektowaniem równoległego algorytmu biklasteryzacji, przeznaczonego na GPU. Zaprezentowano definicję biklasteryzacji oraz skrótowo opisano architekturze GPU. Zestawiono popularne wzorce strategii implementacji algorytmów, przydatne w projektowaniu efektywnych rozwiązań na GPU. Publikacja zawiera także praktyczne wskazówki programistyczne, w kontekście implementacji algorytmów biklasteryzacji w języku CUDA/OpenCL.
EN
This article presents the benefits and limitations related to designing a parallel biclustering algorithm on a GPU. A definition of biclustering is provided together with a brief description of the GPU architecture. We then review algorithm strategy patterns, which are helpful in providing efficient implementations on GPU. Finally, we highlight programming aspects of implementing biclustering algorithms in CUDA/OpenCL programming language.
EN
One of the most challenging issues in the case of many and multi-core architectures is how to exploit their potential computing power in legacy systems without a deep knowledge of their architecture. The analysis of static dependence and dynamic data dependences of a program run, can help to identify independent paths that could have been computed by individual parallel threads. The statistics of reusing the data and its size is also crucial in adapting the application in GPU many-core hardware architecture because of specific memory hierarchies. The proposed profiling system accomplishes static data analysis and computes dynamic dependencies for Java programs as well as recommends parts of source code with the highest potential for parallelization in GPU. Such an analysis can also provide starting point for automatic parallelization.
EN
P systems have been proven to be useful as modeling tools in many fields, such as Systems Biology and Ecological Modeling. For such applications, the acceleration of P system simulation is often desired, given the computational needs derived from these kinds of models. One promising solution is to implement the inherent parallelism of P systems on platforms with parallel architectures. In this respect, GPU computing proved to be an alternative to more classic approaches in Parallel Computing. It provides a low cost, and a manycore platform with a high level of parallelism. The GPU has been already employed to speedup the simulation of P systems. In this paper, we look over the available parallel P systems simulators on the GPU, with special emphasis on those included in the PMCGPU project, and analyze some useful guidelines for future implementations and developments.
EN
In this paper wc present a new multi-frontal solver for the isogeometric collocation method (ISO-C) on GPU. The ISO-C method constitutes an alternative for the isogeometric finite element method (ISO-FEM). The key advantage of ISO-C over ISO-FEM is that it does not include the computationally intensive operation of integrating the variational formulation. The ISO-C method requires using only a single collocation point per one basis function, whereas in ISO-FEM, Gaussian quadrature is applied on many points at each finite element. The presented multi-frontal solver for collocation method results in logarithmic execution time assuming that large enough number of GPU processors is available. In this article, the method is employed for an exemplary ID nanolithography problem of Step-and-Flash Imprint Lithography (SFIL). The algorithm, however, may be applied to a wide class of 2D and 3D problems.
PL
W artykule przedstawiamy nowy solwer wielofrontalny dla izogeometrycznej metody kolokacji (ISO-C) na GPU. Metoda ISO-C stanowi alternatywę dla izogeometrycznej metody elementów skończonych (ISO-FEM). Główną zaletą metody ISO-C jest redukcja znacznego kosztu obliczeniowego całkowania sformułowania wariacyjnego występującego w metodzie ISO-FEM. Metoda ISO-C wymaga bowiem użycia tylko jednego punktu kolokacji dla jednej funkcji bazowej, podczas gdy metoda ISO-FEM wiąże się z zastosowaniem kwadratury Gaussa w wielu punktach na każdym elemencie skończonym. Prezentowany solwer wielofrontalny dla metody kolokacji uzyskuje logarytmiczną złożoność obliczeniową przy założeniu odpowiednio dużej liczby procesorów graficznych GPU. Niniejsza publikacja przedstawia proste wykorzystanie metody dla jednowymiarowego przykładowego problemu nanolitografii Step-and-Flash Imprint Lithography (SFIL). Zaprezentowany algorytm znajduje jednak ogólnie zastosowanie dla szerokiej klasy problemów w dwóch i trzech wymiarach.
20
Content available remote Monte Carlo Simulations of the Ising Model on GPU
EN
Monte Carlo simulations of two- and three-dimensional Ising model on graphic cards (GPU) are described. The standard Metropolis algorithm has been employed. In the framework of the implementation developed by us, simulations were up to 100 times faster than their sequential CPU analogons. It is possible to perform simulations for systems containing up to 109 spins on Tesla C2050 GPU. As a physical application, higher cumulants for the 3d Ising model have been calculated.
first rewind previous Strona / 3 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.