Programming NVIDIA cards by means of transitive closure based parallelization algorithms

Palkowski, M.; Bielecki, W.

Artykuł - szczegóły

Tytuł artykułu

Programming NVIDIA cards by means of transitive closure based parallelization algorithms

Autorzy

Palkowski M. , Bielecki W.

Wybrane pełne teksty z tego czasopisma

http://pe.org.pl/

Identyfikatory

Warianty tytułu

Metody tworzenia aplikacji równoległych dla wielordzeniowych komputerów

Języki publikacji

Abstrakty

Massively parallel processing is a type of computing that uses many separate CPUs or GPUs running in parallel to execute a single program. Because most computations are contained in program loops, automatic extraction of parallelism available in loops is extremely important for many-core systems. In this paper, we study speed-up and scalability of parallel code scanning synchronization-free slices and time partitions by means of a 960 CUDA Cores machine, Tesla S1070.

Przetwarzanie równoległe na wielką skalę wykonywane jest za pomocą wielu procesorów (również graficznych) wykonujących jednocześnie instrukcje pojedynczego programu. Ponieważ większość obliczeń zlokalizowana jest w pętlach programowych, automatyczne zrównoleglanie kodu jest ważne dla maszyn wielordzeniowych. W artykule zbadano przyspieszenie i skalowalność równoległego kodu złożonego z niezależnych fragmentów lub harmonogramowania swobodnego za pomocą maszyny Tesla S1070 zbudowanej z 960 rdzeni CUDA.

Słowa kluczowe

parallel program loops many-core machines synchronization-free slicing free-scheduling

równoległe pętle programowe komputery wielordzeniowe niezależne fragmenty kodu harmonogramowanie swobodne

Wydawca

Wydawnictwo SIGMA-NOT

Czasopismo

Przegląd Elektrotechniczny

Rocznik

2012

Tom

R. 88, nr 10b

Strony

217--222

Opis fizyczny

Bibliogr. 16 poz., rys., tab.

Twórcy

autor

Palkowski M.

autor

Bielecki W.

Zachodniopomorski Uniwersytet Technologiczny, Katedra Inżynierii Oprogramowania, ul. Żołnierska 49, 71-210 Szczecin, mpalkowski@wi.zut.edu.pl

Bibliografia

[1] Beletska, A., Bielecki, W., Cohen, A., Palkowski, M., Siedlecki, K. : Coarse-grained loop parallelization: Iteration space slicing vs affine transformations. Parallel Computing, No. 37, (2011), 479-497,
[2] Beletska A., Bielecki W., Siedlecki K., San Pietro P.. Finding synchronization-free slices of operations in arbitrarily nested loops. In ICCSA (2), volume 5073 of Lecture Notes in Computer Science, Springer, (2008), 871-886.
[3] Pugh W., Wonnacott D., An exact method for analysis of valuebased array data dependences. In In Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing. Springer-Verlag, (1993).
[4] The Omega project. http://www.cs.umd.edu/projects/omega.
[5] Bielecki W., Palkowski M., Using Free Scheduling for Programming NVIDIA Cards, Proceedings of the 2nd Facing the Multicore-Challenge Conference, Karlsruhe, Germany, (2011).
[6] Bastoul C., Code generation in the polyhedral model is easier than you think, In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques PACT’04, (2004).
[7] Verdoolaege S., Integer Set Library: Manual, version 0.0.9, http://www.kotnet.org/~skimo/isl/manual.pdf, (2011).
[8] Specification, Tesla S1070 GPU Computing System, http://www.nvidia.com/docs/IO/43395/SP-04154-001_v02.pdf, (2008).
[9] Tesla Data Center Solutions, S1070 Product Brief, http://www.nvidia.com/object/preconfigured-clusters.html, (2011).
[10] The NAS benchmark suite, http://www.nas.nasa.gov.
[11] NVIDIA CUDA, Programming guide, http://developer.download.nvidia.com/compute/cuda/4_0/toolkit /docs/CUDA_C_Programming_Guide.pdf, version 4.0, (2011).
[12] Lengauer, C. Loop Parallelization in the Polytope Model. In Eike Best, editor, CONCUR'93, number 715 in Lecture Notes in Computer Science, Springer-Verlag, (1993), 398-416.
[13] Baghdadi, S., Größlinger, A., Cohen, A. Putting Automatic Polyhedral Compilation for GPGPU to Work. In Proc. Of Compilers for Parallel Computers (CPC), (2010).
[14] Feautrier, P., Some efficient solutions to the affine scheduling problem, part I and II, one and multidimensional time, International Journal of Parallel Programming 21, (1992), pp. 313-348 and 389-420.
[15] Lim, A., Lam, M., Cheong, G., An affine partitioning algorithm to maximize parallelism and minimize communication. In ICS'99, ACM Press, (1999), 228-237,
[16] PLUTO - An automatic parallelizer and locality optimizer for multicores, http://pluto-compiler.sourceforge.net.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BPS1-0049-0057