Performance enhancement of CUDA applications by overlapping data transfer and Kernel execution

Raju, K.; Chiplunkar, Niranjan N

doi:10.23743/acs-2021-17

Artykuł - szczegóły

Tytuł artykułu

Performance enhancement of CUDA applications by overlapping data transfer and Kernel execution

Autorzy

Raju K. , Chiplunkar Niranjan N

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.23743/acs-2021-17

Warianty tytułu

Języki publikacji

Abstrakty

The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCIExpress bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU. The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches.

Słowa kluczowe

CPU-GPU high-performance computing kernel data transfer CUDA streams

CPU-GPU obliczenia wysokiej wydajności jądro transfer danych strumienie CUDA

Wydawca

Polskie Towarzystwo Promocji Wiedzy
Lublin University of Technology

Czasopismo

Applied Computer Science

Rocznik

2021

Tom

Vol. 17, no 3

Strony

5--18

Opis fizyczny

Bibliogr. 24 poz., fig., tab.

Twórcy

autor

Raju K.

rajuk@nitte.edu.in

Department of CSE, NMAM Institute of Technology, Nitte, India

https://orcid.org/0000-0001-6731-5427

autor

Chiplunkar Niranjan N

nchiplunkar@nitte.edu.in

Department of CSE, NMAM Institute of Technology, Nitte, India

https://orcid.org/0000-0003-4223-2355

Bibliografia

[1] Antoniadis, N., & Sifaleras, A. (2017). A hybrid CPU-GPU parallelization scheme of variable neighborhood search for inventory optimization problems. Electronic Notes in Discrete Mathematics, 58, 47–54. https://doi.org/10.1016/j.endm.2017.03.007
[2] Dhake, A.A., & Walunj, S.M. (2019). Transfer Time Optimization Between CPU and GPU for Virus Signature Scanning. In A. Luhach, D. Jat, K. Hawari, X.Z. Gao & P. Lingras (Eds.), Advanced Informatics for Computing Research. ICAICR 2019. Communications in Computer and Information Science (vol. 1076 pp. 70–78). Springer Singapore. https://doi.org/https://doi.org/10.1007/978-981-15-0111-1_6
[3] Fang, J., Chen, H., & Mao, J. (2018). Understanding data partition for applications on CPU-GPU integrated processors. In Communications in Computer and Information Science (vol. 747). Springer Singapore. https://doi.org/10.1007/978-981-10-8890-2_32
[4] Fu, C., Wang, Z., & Zhai, Y. (2017). A CPU-GPU Data Transfer Optimization Approach Based on Code Migration and Merging. Proceedings - 2017 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, DCABES 2017, 2018-Septe (pp. 23–26). IEEE. https://doi.org/10.1109/DCABES.2017.13
[5] Gowanlock, M., & Karsin, B. (2019). A hybrid CPU/GPU approach for optimizing sorting throughput. Parallel Computing, 85, 45–55. https://doi.org/10.1016/j.parco.2019.01.004
[6] Gregg, C., & Hazelwood, K. (2011). Where is the Data ? Why You Cannot Debate CPU vs. GPU Performance Without the Answer. IEEE International Symposium on Performance Analysis of Systems and Software. (pp. 134–144). IEEE. https://doi.org/10.1109/ISPASS.2011.5762730
[7] Hascoet, T., Zhuang, W., Febvre, Q., Ariki, Y., & Takiguchi, T. (2019). Reducing the Memory Cost of Training Convolutional Neural Networks by CPU Offloading. Journal of Software Engineering and Applications, 12(08), 307–320. https://doi.org/10.4236/jsea.2019.128019
[8] Huang, W., Yu, L., Ye, M., Chen, T., & Hu, T. (2012). A CPU-GPGPU scheduler based on data transmission bandwidth of workload. Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings (pp. 610–613). IEEE. https://doi.org/10.1109/PDCAT.2012.15
[9] Lázaro-Muñoz, A.J., González-Linares, J.M., Gómez-Luna, J., & Guil, N. (2017). A tasks reordering model to reduce transfers overhead on GPUs. Journal of Parallel and Distributed Computing, 109, 258–271. https://doi.org/10.1016/j.jpdc.2017.06.015
[10] Lee, C., Woo, W.R., & Gaudiot, J. (2014). Boosting CUDA Applications with CPU – GPU Hybrid Computing. International Journal of Parallel Programming, 42, 384–404. https://doi.org/10.1007/s10766-013-0252-y
[11] Lee, J., Samadi, M., Park, Y., & Mahlke, S. (2015). SKMD: Single kernel on multiple devices for transparent CPUGPU collaboration. ACM Transactions on Computer Systems, 33(3). https://doi.org/10.1145/2798725
[12] Li, T., Dong, Q., Wang, Y., Gong, X., & Yang, Y. (2017). Dual buffer rotation four-stage pipeline for CPU – GPU cooperative computing. Soft Computing, 23, 859–869. https://doi.org/10.1007/s00500-017-2795-0
[13] Luley, R.S., & Qiu, Q. (2016). Effective utilization of CUDA hyper-Q for improved power and performance efficiency. Proceedings – 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 (pp. 1160–1169). IEEE. https://doi.org/10.1109/IPDPSW.2016.154
[14] Lutz, C., Breß, S., Zeuch, S., Rabl, T., & Markl, V. (2020). Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1633–1649). ACM Digital Library. https://doi.org/10.1145/3318464.3389705
[15] NVIDIA TITAN V. (n.d.). NVIDIA Corporation. Retrieved May 8, 2021 from https://www.nvidia.com
[16] NVIDIA. (2015). CUDA C Programming Guide v 9.1. NVIDIA.
[17] Pandit, P., & Govindarajan, R. (2014). Fluidic kernels: Cooperative execution of openCL programs on multiple heterogeneous devices. Proceedings of the 12th ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2014 (pp. 273–283). ACM Digital Library. https://doi.org/10.1145/2544137.2544163
[18] Patil, S.V., & Kulkarni, D.B. (2021). Data transfer optimization in CPU/GPGPU Communication. Turkish Journal of Computer and Mathematics Education, 12(13), 1920–1923.
[19] Piao, X., Kim, C., Oh, Y., Li, H., Kim, J., Kim, H., & Lee, J.W. (2015). JAWS: A JavaScript framework for adaptive CPU-GPU work sharing. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 2015-Janua (pp. 251–252). ACM Digital Library. https://doi.org/10.1145/2688500.2688525
[20] Raju, K., & Chiplunkar, N.N. (2018). A survey on techniques for cooperative CPU-GPU computing. Sustainable Computing: Informatics and Systems, 19, 72–85. https://doi.org/10.1016/j.suscom.2018.07.010
[21] Sabet, A.H.N., Zhao, Z., & Gupta, R. (2020). Subway: Minimizing data transfer during out-of-GPU-memory graph processing. Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020 (pp. 1–16). ACM Digital Library. https://doi.org/10.1145/3342195.3387537
[22] Siklosi, B., Reguly, I.Z., & Mudalige, G.R. (2019). Heterogeneous CPU-GPU execution of stencil applications. Proceedings of P3HPC 2018: International Workshop on Performance, Portability and Productivity in HPC, Held in Conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 71–80). IEEE. https://doi.org/10.1109/P3HPC.2018.00010
[23] Werkhoven, B. Van, Maassen, J., Seinstra, F.J., & Bal, H.E. (2014). Performance models for CPU-GPU data transfers. Proceedings – 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2014 (pp. 11–20). IEEE. https://doi.org/10.1109/CCGrid.2014.16
[24] Yang, W., Li, K., & Li, K. (2017). A hybrid computing method of SpMV on CPU–GPU heterogeneous computing systems. Journal of Parallel and Distributed Computing, 104, 49–60. https://doi.org/10.1016/j.jpdc.2016.12.023

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa Nr 461252 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2021).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-2e71f97b-a288-4c8e-9b16-f515e81e035b