Wyznaczanie równoległości pętli programowych w aplikacjach dedykowanych dla procesorów graficznych

Bielecki, W.; Pałkowski, M.

Artykuł - szczegóły

Tytuł artykułu

Wyznaczanie równoległości pętli programowych w aplikacjach dedykowanych dla procesorów graficznych

Autorzy

Bielecki W. , Pałkowski M.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Parallelizing program loops for graphics processing in general purpose computing

Języki publikacji

Abstrakty

Ekstrakcja równoległości w postaci niezależnych fragmentów kodu pozwala wygenerować równoległe pętle programowe w sposób automatyczny. Kod taki umożliwia wykorzystanie mocy obliczeniowej maszyn równoległych, w tym wieloprocesorowych kart graficznych. W niniejszym artykule poddano analizie zastosowanie algorytmów wyznaczania fragmentów kodu dla aplikacji dedykowanych dla procesorów graficznych. Zbadano przyspieszenie i efektywność obliczeń oraz skalowalność wygenerowanego kodu równoległego.

Extracting synchronization-free slices allows automatically generating parallel loops. The code can be executed on multi-processors machines in a reduced period of time. Slicing techniques enable also generating parallel code for graphics processing in general purpose computing. Nowadays, graphic cards support executing multi-threaded applications. GPU systems consist of tens or hundreds of processors. CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. Graphics processing units (GPUs) are accessible to software developers through variants of industry standard programming languages. Using CUDA, the latest NVIDIA GPUs become accessible for computation like CPUs. The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user's perspective, the application just runs faster because it uses the high-performance of the GPU to boost performance. In this paper slicing algorithms are examined for generating a parallel code for graphic cards are examined. A short example of the code is presented. CUDA statements and technique are explained. Memory cost and transfer data is considered. Speed-up, efficiency and scalability of the code are analyzed.

Słowa kluczowe

zrównoleglenie pętli fragmenty kodu GPU CUDA OpenCL obliczenia wysokiej wydajności

loop parallelization GPU CUDA OpenCL slices

Wydawca

Wydawnictwo PAK

Czasopismo

Pomiary Automatyka Kontrola

Rocznik

2011

Tom

R. 57, nr 8

Strony

963--965

Opis fizyczny

Bibliogr. 16 poz., rys., tab.

Twórcy

autor

Bielecki W.

autor

Pałkowski M.

Katedra Inżynierii Oprogramowania, Wydział Informatyki, Uniwersytet Zachodniopomorski Uniwersytet Technologiczny, ul. Żołnierska 49, 71-210 Szczecin, wbielecki@wi.zut.edu.pl

Bibliografia

[1] OpenCL 1.1, The open standard for parallel programming of heterogeneous systems, http://www.khronos.org/opencl/, 2010.
[2] NVIDIA CUDA C Programming Guide, v. 3.1.11, NVIDIA Corportation, 2010. http://developer.nvidia.com/object/cuda_3_1_downloads.html
[3] Lim A. W., Lam M., Cheong G.: An affine partitioning algorithm to maximize parallelism and minimize communication. In ICS’99, s. 228-237. ACM Press, 1999.
[4] Feautrier P.: Some efficient solutions to the affine scheduling problem, part I, II, one dimensional time, International Journal of Parallel Programming 21. (1992), s. 313-348, 389-420.
[5] Banerjee U.: Unimodular transformations of double loops. Proceedings of the Third Workshop on Languages and Compilers for Parallel Computing. 1990, s. 192-219.
[6] Beletska A., Bielecki W., Cohen A., Palkowski M..: Synchronization-free automatic parallelization: Beyond affine iteration-space slicing. In Languages and Compilers for Parallel Computing (LCPC’09), LNCS. Springer-Verlag, 2009.
[7] Pugh W., Rosser E.: Iteration Space Slicing and Its Application to Communication Optimization. Proceedings of the International Conference on Supercomputing. 1997, s. 221-228.
[8] Pugh W., Wonnacott D.: An exact method for analysis of value-based array data dependences. In In Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing. Springer-Verlag, 1993.
[9] The Omega project. http://www.cs.umd.edu/projects/omega
[10] Kelly W., Maslov V., Pugh W., Rosser E., Shpeisman T., Wonnacott D.: The omega library interface guide. Technical report, USA, 1995.
[11] Beletska A., Bielecki W., Siedlecki K., San Pietro P.: Finding synchronization-free slices of operations in arbitrarily nested loops. In ICCSA (2), volume 5073 of Lecture Notes in Computer Science, pp. 871-886. Springer, 2008.
[12] Strona projektu Iteration Space Slicing Framework http:// sfs.zut.edu.pl
[13] Bastoul C.: Code generation in the polyhedral model is easier than you think. In PACT 2004, s. 7-16, Juan-les-Pins, september 2004.
[14] Bielecki W., Beletska A., Palkowski M., San Pietro P.: Extracting synchronization-free trees composed of non-uniform loop operations, Algorithms and Architectures for Parallel Processing, Lecture Notes in Computer Science Volume 5022/2008, Springer Berlin / Heidelberg, 2008, s. 185-195.
[15] Bielecki W., Pałkowski M.: Using message passing for developing coarse-grained applications in OpenMP, Proceedings of Third International Conference on Software and Data - ICSOFT 2008, Porto, Portugalia 2008, s. 145-153.
[16] NAS Parallel Benchmarks v.3.2, 2010, http://www.nas.nasa.gov/Resources/Software/npb.html

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BSW4-0104-0047