Parallelizing user-defined functions in the ETL workflow using orchestration style sheets

Ali, Syed Muhammad Fawad; Mey, Johannes; Thiele, Maik

doi:10.2478/amcs-2019-0005

Powiadomienia systemowe

Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Parallelizing user-defined functions in the ETL workflow using orchestration style sheets

Autorzy

Ali Syed Muhammad Fawad , Mey Johannes , Thiele Maik

Treść / Zawartość

Pełne teksty:

05_ali_mey_thiele_parallelizing_user_defined_functions_2019_1.pdf

Pobierz

Identyfikatory

DOI

10.2478/amcs-2019-0005

Warianty tytułu

Języki publikacji

Abstrakty

Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.

Słowa kluczowe

ETL workflow parallel ETL operator parallel algorithmic skeleton user defined function

workflow ETL równoległy szkielet algorytmiczny funkcja zdefiniowana przez użytkownika

Wydawca

Oficyna Wydawnicza Uniwersytetu Zielonogórskiego

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2019

Tom

Vol. 29, no. 1

Strony

69--79

Opis fizyczny

Bibliogr. 27 poz., rys., wykr.

Twórcy

autor

Ali Syed Muhammad Fawad

fawadali.ali@gmail.com

Faculty of Computing, Poznań University of Technology, Piotrowo 2, 60-965 Poznań, Poland; Data Engineering, trivago N.V. Leipzig, Bosestrasse 4, 04109, Leipzig, Germany

autor

Mey Johannes

Faculty of Computer Science, Technical University of Dresden, Helmholtzstrasse 10, 01069, Dresden, Germany

autor

Thiele Maik

Faculty of Computer Science, Technical University of Dresden, Helmholtzstrasse 10, 01069, Dresden, Germany

Bibliografia

[1] Ali, S.M.F. (2018). Next-generation ETL framework to address the challenges posed by big data, Workshop Proceedings of the EDBT/ICDT Joint Conference, Vienna, Austria.
[2] Ali, S.M.F. and Wrembel, R. (2017). From conceptual design to performance optimization of ETL workflows: Current state of research and open problems, The VLDB Journal 26(6): 1–25.
[3] Aßmann, U. (2003). Invasive software composition, Invasive Software Composition, Springer, Berlin/Heidelberg, pp. 107–145.
[4] Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V. and Warneke, D. (2010). Nephele/PACTs: A programming model and execution framework for web-scale analytical processing, Proceedings of the Symposium on Cloud Computing, Indianapolis, IN, USA, pp. 119–130.
[5] Chaiken, R., Jenkins, B., Larson, P.-Å ., Ramsey, B., Shakib, D., Weaver, S. and Zhou, J. (2008). Scope: Easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment 1(2): 1265–1276.
[6] Cloudera (2016). Example: Sentiment analysis using MapReduce custom counters, https://www.cloudera.com/documentation/other/tutorial/CDH5/topics/ht_example_4_sentiment_analysis.html.
[7] Dagum, L. and Menon, R. (1998). OpenMP: An industry standard API for shared-memory programming, IEEE Computational Science and Engineering 5(1): 46–55.
[8] Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters, Communications of the ACM 51(1) 107–113.
[9] Ekman, T. and Hedin, G. (2007). The JastAdd system modular extensible compiler construction, Science of Computer Programming 69(1–3): 14–26.
[10] Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A. and Jacobsen, H.-A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 1197–1208.
[11] González-Vélez, H. and Kontagora, M. (2011). Performance evaluation of MapReduce using full virtualisation on a departmental cloud, International Journal of Applied Mathematics and Computer Science 21(2): 275–284, DOI: 10.2478/v10006-011-0020-3.
[12] Große, P., May, N. and Lehner, W. (2014). A study of partitioning and parallel UDF execution with the SAP HANA database, Proceedings of the 26th International Conference on Scientific and Statistical Database Management, Aalborg, Denmark, p. 36.
[13] Hedin, G. (2000). Reference attributed grammars, Informatica (Slovenia) 24(3): 301–317.
[14] Karagiannis, A., Vassiliadis, P. and Simitsis, A. (2013). Scheduling strategies for efficient ETL execution, Information Systems 38(6): 927–945.
[15] Karol, S. (2015). Well-formed and Scalable Invasive Software Composition, PhD dissertation, Technische Universitat Dresden, Dresden.
[16] Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J.-M. and Irwin, J. (1997). Aspect-oriented programming, in M. Akşit and S. Matsuoka (Eds.), European Conference on Object-oriented Programming, Springer, Berlin/Heidelberg, pp. 220–242.
[17] Kumar, N. and Kumar, P.S. (2010). An efficient heuristic for logical optimization of ETL workflows, International Workshop on Business Intelligence for the Real-Time Enterprise, Singapore, Singapore, pp. 68–83.
[18] Liu, X., Thomsen, C. and Pedersen, T.B. (2013). ETLMR: A highly scalable dimensional etl framework based on MaprEduce, in A. Hameurlain et al. (Eds.), Transactions on Large-Scale Data-and Knowledge-Centered Systems VIII, Springer, Berlin/Heidelberg, pp. 1–31.
[19] Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, pp. 55–60.
[20] Mey, J., Karol, S., Aßmann, U., Huismann, I., Stiller, J. and Fröhlich, J. (2016). Using semantics-aware composition and weaving for multi-variant progressive parallelization, Procedia Computer Science 80: 1554–1565.
[21] Nambiar, R.O. and Poess, M. (2006). The making of TPC-DS, Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, pp. 1049–1058.
[22] Simitsis, A., Vassiliadis, P. and Sellis, T. (2005). State-space optimization of ETL workflows, IEEE Transactions on Knowledge and Data Engineering 17(10): 1404–1419.
[23] Simitsis, A., Wilkinson, K., Dayal, U. and Castellanos, M. (2010). Optimizing ETL workflows for fault-tolerance, IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, USA, pp. 385–396.
[24] Thomsen, C. and Pedersen, T.B. (2011). Easy and effective parallel programmable ETL, Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 37–44.
[25] Tziovara, V., Vassiliadis, P. and Simitsis, A. (2007). Deciding the physical implementation of ETL workflows, Proceedings of the International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 49–56.
[26] Vassiliadis, P., Simitsis, A. and Baikousi, E. (2009). A taxonomy of ETL activities, Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 25–32.
[27] Weinberg, A.I. and Last, M. (2017). Interpretable decision-tree induction in a big data parallel framework, International Journal of Applied Mathematics and Computer Science 27(4): 737–748, DOI: 10.1515/amcs-2017-0051.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-9fdf3e40-e26e-4e11-aee9-958bfaa85217