PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Execution time prediction model for parallel GPU realizations of discrete transforms computation algorithms

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Parallel realizations of discrete transforms (DTs) computation algorithms (DTCAs) performed on graphics processing units (GPUs) play a significant role in many modern data processing methods utilized in numerous areas of human activity. In this paper the authors propose a novel execution time prediction model, which allows for accurate and rapid estimation of execution times of various kinds of structurally different DTCAs performed on GPUs of distinct architectures, without the necessity of conducting the actual experiments on physical hardware. The model can serve as a guide for the system analyst in making the optimal choice of the GPU hardware solution for a given computational task involving particular DT calculation, or can help in choosing the best appropriate parallel implementation of the selected DT, given the limitations imposed by available hardware. Restricting the model to exhaustively adhere only to the key common features of DTCAs enables the authors to significantly simplify its structure, leading consequently to its design as a hybrid, analytically–simulational method, exploiting jointly the main advantages of both of the mentioned techniques, namely: time-effectiveness and high prediction accuracy, while, at the same time, causing mutual elimination of the major weaknesses of both of the specified approaches within the proposed solution. The model is validated experimentally on two structurally different parallel methods of discrete wavelet transform (DWT) computation, i.e. the direct convolutionbased and lattice structure-based schemes, by comparing its prediction results with the actual measurements taken for 6 different graphics cards, representing a fairly broad spectrum of GPUs compute architectures. Experimental results reveal the overall average execution time and prediction accuracy of the model to be at a level of 97.2%, with global maximum prediction error of 14.5%, recorded throughout all the conducted experiments, maintaining at the same time high average evaluation speed of 3.5 ms for single simulation duration. The results facilitate inferring the model generality and possibility of extrapolation to other DTCAs and different GPU architectures, which along with the proposed model straightforwardness, time-effectiveness and ease of practical application, makes it, in the authors’ opinion, a very interesting alternative to the related existing solutions.
Rocznik
Strony
art. no. e139393
Opis fizyczny
Bibliogr. 64 poz., rys., tab.
Twórcy
  • Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland
  • Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland
  • Institute of Information Technology, Łódź University of Technology, ul. Wólczańska 215, 90-924 Łódź, Poland
Bibliografia
  • [1] U.N. Ahmed and K.R. Rao, Orthogonal Transforms for Digital Signal Process. Secaucus, NJ, USA: Springer-Verlag, New York, Inc., 1974.
  • [2] Y. Su and Z. Xu, “Parallel implementation of wavelet-based image denoising on programmable pc-grade graphics hardware,” Signal Process., vol. 90, pp. 2396–2411, 2010, doi: 10.1016/j.sigpro.2009.06.019.
  • [3] P. Lipinski and D. Puchala, “Digital image watermarking using fast parametric transforms,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 67, pp. 463–477, 2019.
  • [4] K.R. Rao and P. Yip, Discrete cosine transform: algorithms, advantages, applications. San Diego, CA, USA: Academic Press Professional, Inc., 1990.
  • [5] D. Salomon, A Guide to Data Compression Methods. New York: Springer-Verlag.
  • [6] D. Puchala and M. Yatsymirskyy, “Joint compression and encryption of visual data using orthogonal parametric transforms,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 64, pp. 373–382, 2016.
  • [7] M. Akay, Time Frequency and Wavelets in Biomedical Signal Process., ser. IEEE Press Series in Biomed. Eng. Wiley-IEEE Press, 1998.
  • [8] S. Babichev, J. Skvor, J. Fiser, and V. Lytvynenko, “Technology of gene expression profiles filtering based on wavelet analysis,” Int. J. Intell. Syst. Appl., vol. 10, pp. 1–7, 2018.
  • [9] Z. Jakovljevic, R. Puzovic, and M. Pajic, “Recognition of planar segments in point cloud based on wavelet transform,” IEEE Trans. Ind. Inf., vol. 11, no. 2, pp. 342–352, 2015.
  • [10] J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming. Indianapolis, IN 46256: John Wiley & Sons, Inc., 2014.
  • [11] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.
  • [12] G. Barlas, Multicore and GPU Programming: An Integrated Approach. Morgan Kaufmann Publishers, 2015.
  • [13] K. Stokfiszewski and K. Wieloch, “ Time effectiveness optimization of cross correlation methods for OCR systems with the use of graphics processing units,” J. Appl. Comput. Sci., vol. 23, no. 2, pp. 79–100, 2015.
  • [14] A. Wojciechowski and T. Gałaj, “GPU supported dual quaternions based skinning,” in Computer Game Innovations. A. Wojciechowski, P. Napieralski (Eds.), Lodz University of Technology Press, 2016, pp. 5–23.
  • [15] M. Wawrzonowski, D. Szajerman, M. Daszuta, and P. Napieralski, “Mobile devices’ GPUs in cloth dynamics simulation,” in Proceedings of the Federated Conference on Computer Science and Information Systems. M. Ganzha, L. Maciaszek, M. Paprzycki (Eds.), 2017, pp. 1283–1290.
  • [16] D. Puchala, K. Stokfiszewski, B. Szczepaniak, and M. Yatsymirskyy, “Effectiveness of fast fourier transform implementations on GPU and CPU,” Przegl ̨ad Elektrotechniczny, vol. 92, no. 7, pp. 69–71, 2016.
  • [17] K. Stokfiszewski, K. Wieloch, and M. Yatsymirskyy, “The fast Fourier transform partitioning scheme for GPU’s computation effectiveness improvement,” in Advances in Intelligent Systems and Computing II (CSIT), N. Shakhovska and V. Stepashko (Eds.), Springer, Cham, 2017, vol. 689, no. 1, pp. 511–522.
  • [18] B.H.H. Juurlink and H.A.G. Wijshoff, “A quantitive comparison of parallel computation models,” ACM Trans. Comput. Syst., vol. 16, no. 3, pp. 271–318, 1988.
  • [19] S.G. Akl, Parallel computation. Models and methods. Upple Saddle River, NJ: Prentice Hall, 1997.
  • [20] A. Madougou, S. Varbanescu, C. Laat, and R. van Nieuwpoort, “The landscape of GPGPU performance modeling tools,” Parallel Comput., vol. 56, pp. 18–33, 2016.
  • [21] H. Sunpyo and K. Hyesoon, “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness,” ACM SIGARCH Comput. Architect. News, vol. 37, pp. 152–163, 2009.
  • [22] C. Luo and R. Suda, “An execution time prediction analytical model for GPU with instruction-level and thread-level parallelism awareness,” IPSJ SIG Tech. Rep., vol. 2011-HPC-130, no. 19, pp. 1–9, 2011.
  • [23] M. Amaris, D. Cordeiro, A. Goldman, and R.Y. de Camargo, “A simple BSP-based model to predict execution time in GPU applications,” in Proc. IEEE 22nd International Conference on High Performance Computing (HiPC), 2015, pp. 285–294.
  • [24] L. Ma, R.D. Chamberlain, and K. Agrawal, “Performance modeling for highly-threaded many-core GPUs,” in Proc. IEEE 25th International Conference on Applica- tion-Specific Systems, Arch’s and Processors, 2014, pp. 84–91.
  • [25] K. Kothapalli, R. Mukherjee, M.S. Rehman, S. Patidar, P.J. Narayanan, and K. Srinathan, “A performance prediction model for the CUDA GPGPU platform,” in Proc. International Conference on High Performance Computing (HiPC), 2009, pp. 463–472.
  • [26] M. Amaris, R.Y. de Camargo, M. Dyab, A. Goldman, and D. Trystram, “A comparison of GPU execution time prediction using machine learning and analytical modeling,” in Proc. 15th IEEE International Symposium on Network Computing and Applications (NCA), 2016, pp. 326–333.
  • [27] A. Karami, S.A. Mirsoleimani, and F. Khunjush, “A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs,” in Proc. 17th CSI Int. Symposium on Computer Architecture & Digital Systems (CADS), 2013, pp. 15–22.
  • [28] A. Kerr, E. Anger, G. Hendry, and S. Yalamanchili, “Eiger: A framework for the automated synthesis of statistical performance models,” in Proc. 19th Int. Conference on High Performance Computing, 2012, pp. 1–6.
  • [29] Y. Zhang, Y. Hu, B. Li, and L. Peng, “Performance and power analysis of ATI GPU: A statistical approach,” in Proc. 6th IEEE International Conference on Networking, Architecture, and Storage, 2011, pp. 149–158.
  • [30] G. Wu, J.L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “GPGPU performance and power estimation using machine learning,” in Proc. 21st IEEE Int. Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564–576.
  • [31] E. Ipek, B. Supinski, M. Schulz, and S. McKee, “An approach to performance prediction for parallel applications,” in Proc. 11th International Euro-Par Conference on Parallel Processing, 2005, pp. 196–205.
  • [32] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu, “Cross architecture performance prediction (XAPP) using CPU code to predict GPU performance,” in Proc. 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2015, pp. 725–737.
  • [33] “GPGPU-Sim project.” [Online]. Available: http://www.gpgpu-sim.org.
  • [34] A. Bakhoda, W.L. Fung, H. Wong, and G.L. Yuan, “Analyzing CUDA workloads using a detailed GPU simulator,” in Proc. ISPASS International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163–174.
  • [35] “GPUSimPow – AES LPGPU Group Power Simulation Project.” [Online]. Available: https://www.aes.tu-berlin.de/menue/forschung/projekte/gpusimpow_simulator/.
  • [36] Z. Yu, L. Eeckhout, N. Goswami, T. Li, L.K. John, H. Jin, C. Xu, and J. Wu, “Accelerating GPGPU micro-architecture simulation,” IEEE Trans. Comput., vol. 64, no. 11, pp. 3153–3166, 2015.
  • [37] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: a simulation framework for CPU-GPU computing,” in Proc. 21st International Conf. on Parallel Architectures and Compilation Techniques (PACT), 2012, pp. 335–344.
  • [38] G. Malhotra, S. Goel, and S. Sarangi, “GpuTejas: a parallel simulator for GPU architectures,” in Proc. 21st International Conference on High Performance Computing, 2014, pp. 1–10.
  • [39] Y. Arafa, A.A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz, “PPT-GPU: Scalable GPU performance modeling,” IEEE Comput. Archit. Lett., vol. 18, no. 1, pp. 55–58, 2019.
  • [40] X. Wang, K. Huang, A. Knoll, and X. Qian, “A hybrid framework for fast and accurate GPU performance estimation through source-level analysis and trace-based simulation,” in Proc. IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 506–518.
  • [41] K. Punniyamurthy, B. Boroujerdian, and A. Gerstlauer, “GAT-Sim: Abstract timing simulation of GPUs,” in Proc. Design, Automation & Test, Europe Conf. & Exhibition (DATE), 2017, pp. 43–48.
  • [42] M. Khairy, Z. Shen, T.M. Aamodt, and T.G. Rogers, “Accel-Sim: An extensible simulation framework for validated GPU modeling,” in Proc. 47th IEEE/ACM Int. Symposium on Computer Architecture (ISCA), 2020, pp. 473–486.
  • [43] S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A parallel functional simulator for GPGPU,” in Proc. IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2010, pp. 351–360.
  • [44] “GPU Ocelot project: a dynamic compilation framework for GPU computing.” [Online]. Available: http://www.gpuocelot.gatech.edu
  • [45] J. Power, J. Hestness, M.S. Orr, M.D. Hill, and D.A. Wood, “gem5-gpu: A heterogeneous CPU-GPU simulator,” IEEE Comput. Archit. Lett., vol. 14, no. 1, pp. 34–36, 2015.
  • [46] “FusionSim GPU simulator project.” [Online]. Available: https://sites.google.com/site/fusionsimulator/
  • [47] A. Nakonechny and Z. Veres, “The wavelet based trained filter for image interpolation,” in Proc. IEEE 1st International Conference on Data Stream Mining & Processing, 2016, pp. 218–221.
  • [48] G. Strang and T. Nguyen, Wavelets and Filter Banks. Welleslay, UK: Welleslay-Cambridge Press, 1996.
  • [49] P. Lipiński and J. Stolarek, “Improving watermark resistance against removal attacks using orthogonal wavelet adaptation,” in Proc. 38th Conference on Current Trends in Theory and Practice of Computer Science, vol. 7147, 2012, pp. 588–599.
  • [50] D. Bařina, M. Kula, and P. Zemčík, “Parallel wavelet schemes for images,” J. Real-Time Image Process., vol. 16, no. 5, pp. 1365–1381, 2019.
  • [51] D. Bařina, M. Kula, M. Matýšek, and P. Zemčík, “Accelerating discrete wavelet transforms on GPUs,” in Proc. International Conference on Image Processing (ICIP), 2017, pp. 2707–2710.
  • [52] D. Bařina, M. Kula, M. Matýšek, and P. Zemčík, “Accelerating discrete wavelet transforms on parallel architectures,” J. WSCG, vol. 25, no. 2, pp. 77–85, 2017.
  • [53] W. van der Laan, A. Jalba, and J. Roerdink, “Accelerating wavelet lifting on graphics hardware using CUDA,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 132–146, 2011.
  • [54] M. Yatsymirskyy, “A novel matrix model of two channel biorthogonal filter banks,” Metody Informatyki Stosowanej, pp. 205–212, 2011.
  • [55] M. Yatsymirskyy and K. Stokfiszewski, “Effectiveness of lattice factorization of two-channel orthogonal filter banks,” in Proc. Joint Conference NTAV/SPA, 2012, pp. 275–279.
  • [56] M. Yatsymirskyy, “Lattice structures for synthesis and implementation of wavelet transforms,” J. Appl. Comput. Sci., vol. 17, no. 1, pp. 133–141, 2009.
  • [57] J. Stolarek, “Adaptive synthesis of a wavelet transform using fast neural network,” Bull. Pol. Acad. Sci. Tech. Sci., vol. 59, pp. 9–13, 2011.
  • [58] D. Puchala, K. Stokfiszewski, K. Wieloch, and M. Yatsymirskyy, “Comparative study of massively parallel GPU realizations of wavelet transform computation with lattice structure and matrix-based approach,” in Proc. IEEE International Conference on Data Stream Mining & Processing, 2018, pp. 88–93.
  • [59] M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA,” in GPU Gems 3, Part VI: GPU Computing, H. Nguyen, Ed. Addison Wesley, 2007, pp. 851–876.
  • [60] S. Sengupta, A.E. Lefohn, and J.D. Owens, “A work-efficient step-efficient prefix sum algorithm,” in Proc. Workshop on Edge Computing Using New Commodity Architectures, 2006, pp. D–26–27.
  • [61] J. Franco, G. Bernabe, J. Fernandez, and M.E. Acacio, “A parallel implementation of the 2d wavelet transform using CUDA,” in Proc. 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2009, pp. 111–118.
  • [62] H. Bantikyan, “CUDA based implementation of 2-D discrete Haar wavelet transformation,” in Proc. International Conference Parallel and Distributed Computing Systems, 2014, pp. 20–26.
  • [63] M.J. Flynn and S.F. Oberman, Advanced Computer Arithmetic Design. New York, NY, USA: John Wiley & Sons, Inc., 2001.
  • [64] Ł. Napierała, “Effectiveness measurements of linear transforms realized on graphics processing units with the use of GPGPUSim emulator” – MSc thesis, Institute of Information Technology, Łódź University of Technology, Poland, 2020.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-d3878e8d-ba8b-49cd-b563-56a5f904ece1
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.