Intermediate Results Materialization Selection and Format for Data-Intensive Flows

Munir, R. F.; Nadal, S.; Romero, O.; Abelló, A.; Jovanovic, P.; Thiele, M.; Lehner, W.

doi:10.3233/FI-2018-1734

Powiadomienia systemowe

Sesja wygasła!

Artykuł - szczegóły

Tytuł artykułu

Intermediate Results Materialization Selection and Format for Data-Intensive Flows

Autorzy

Munir R. F. , Nadal S. , Romero O. , Abelló A. , Jovanovic P. , Thiele M. , Lehner W.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2018-1734

Warianty tytułu

Języki publikacji

Abstrakty

Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions.

Słowa kluczowe

big data data-intensive flows intermediate results data format HDFS

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2018

Tom

Vol. 163, nr 2

Strony

111--138

Opis fizyczny

Bibliogr. 44 poz., rys., tab., wykr.

Twórcy

autor

Munir R. F.

fmunir@essi.upc.edu

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain

autor

Nadal S.

snadal@essi.upc.edu

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain

autor

Romero O.

oromero@essi.upc.edu

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain

autor

Abelló A.

aabello@essi.upc.edu

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain

autor

Jovanovic P.

petar@essi.upc.edu

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain

autor

Thiele M.

maik.thiele@tu-dresden.de

Technische Universität Dresden (TUD), Dresden, Germany

autor

Lehner W.

wolfgang.lehner@tu-dresden.de

Technische Universität Dresden (TUD), Dresden, Germany

Bibliografia

[1] Jovanovic P, Romero O, Abelló A. A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey. T. Large-Scale Data-and Knowledge-Centered Systems, 2016. 29:66-107. doi:10.1007/978-3-662-54037-4\_3.
[2] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun. ACM, 2008. 51(1):107-113. doi:10.1145/1327452.1327492.
[3] Chen Y, Alspaugh S, Katz RH. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. PVLDB, 2012. 5(12):1802-1813. doi:10.14778/2367502.2367519.
[4] Harinarayan V, Rajaraman A, Ullman JD. Implementing Data Cubes Efficiently. In: SIGMOD. 1996 pp. 205-216. doi:10.1145/233269.233333.
[5] Gupta H, Mumick IS. Selection of Views to Materialize in a Data Warehouse. IEEE Trans. Knowl. Data Eng., 2005. 17(1):24-43. doi:10.1109/TKDE.2005.16.
[6] Alagiannis I, Idreos S, Ailamaki A. H2O: a hands-free adaptive store. In: SIGMOD. 2014 pp. 1103-1114. doi:10.1145/2588555.2610502.
[7] Munir RF, Romero O, Abelló A, Bilalli B, Thiele M, Lehner W. ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. In: MEDI. 2016 pp. 42-56. doi:10.1007/978-3-319-45547-1\_4.
[8] Azim T, Karpathiotakis M, Ailamaki A. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 2017. 11(3):324-337. doi:10.14778/3157794.3157801.
[9] Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 2010. 3(1):494-505. doi:10.14778/1920841.1920906.
[10] Elghandour I, Aboulnaga A. ReStore: Reusing Results of MapReduce Jobs. PVLDB, 2012. 5(6):586-597. doi:10.14778/2168651.2168659.
[11] Wang G, Chan C. Multi-Query Optimization in MapReduce Framework. PVLDB, 2013. 7(3):145-156. doi:10.14778/2732232.2732234.
[12] Simitsis A, Wilkinson K, Castellanos M, Dayal U. QoX-driven ETL design: reducing the cost of ETL consulting engagements. In: SIGMOD. 2009 pp. 953-960. doi:10.1145/1559845.1559954.
[13] Halevy AY. Answering queries using views: A survey. VLDB J., 2001. 10(4):270-294. doi:10.1007/s007780100054.
[14] Bian H, Yan Y, Tao W, Chen LJ, Chen Y, Du X, Moscibroda T. Wide Table Layout Optimization based on Column Ordering and Duplication. In: SIGMOD. 2017 pp. 299-314. doi:10.1145/3035918.3035930.
[15] Theodoratos D, Bouzeghoub M. A General Framework for the View Selection Problem for Data Warehouse Design and Evolution. In: DOLAP. 2000 pp. 1-8. doi:10.1145/355068.355309.
[16] Theodoratos D, Sellis TK. Dynamic Data Warehouse Design. In: DaWaK. 1999 pp. 1-10. doi:10.1007/3-540-48298-9\_1.
[17] Theodoratos D, Sellis TK. Data Warehouse Configuration. In: VLDB. 1997 pp. 126-135. http://www.vldb.org/conf/1997/P126.PDF.
[18] Jovanovic P, Simitsis A, Wilkinson K. Engine independence for logical analytic flows. In: ICDE. 2014 pp. 1060-1071. doi:10.1109/ICDE.2014.6816723.
[19] Jovanovic P, Romero O, Simitsis A, Abelló A. Incremental Consolidation of Data-Intensive Multi-Flows. IEEE Trans. Knowl. Data Eng., 2016. 28(5):1203-1216. doi:10.1109/TKDE.2016.2515609.
[20] Marler RT, Arora JS. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization, 2004. 26(6):369-395. doi:10.1007/s00158-003-0368-6.
[21] Halasipuram R, Deshpande PM, Padmanabhan S. Determining Essential Statistics for Cost Based Optimization of an ETL Workflow. In: EDBT. 2014 pp. 307-318. doi:10.5441/002/edbt.2014.29.
[22] Garcia-Molina H, Ullman JD, Widom J. Database systems - the complete book (2. ed.). Pearson Education, 2009. ISBN 978-0-13-187325-4.
[23] Aggarwal A, Vitter JS. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM, 1988. 31(9):1116-1127. doi:10.1145/48529.48535.
[24] Afrati FN, Ullman JD. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng., 2011. 23(9):1282-1298. doi:10.1109/TKDE.2011.47.
[25] Hueske F, Peters M, Sax M, Rheinländer A, Bergmann R, Krettek A, Tzoumas K. Opening the Black Boxes in Data Flow Optimization. PVLDB, 2012. 5(11):1256-1267. doi:10.14778/2350229.2350244.
[26] Simitsis A, Wilkinson K. Revisiting ETL Benchmarking: The Case for Hybrid Flows. In: TPCTC. 2012 pp. 75-91. doi:10.1007/978-3-642-36727-4\_6.
[27] Nguyen T, Bimonte S, d’Orazio L, Darmont J. Cost models for view materialization in the cloud. In: EDBT/ICDT Workshops. 2012 pp. 47-54. doi:10.1145/2320765.2320788.
[28] Roukh A, Bellatreche L, Boukorca A, Bouarar S. Eco-DMW: Eco-Design Methodology for Data warehouses. In: DOLAP. 2015 pp. 1-10. doi:10.1145/2811222.2811230.
[29] Encyclopedia of Database Systems. chapter Data Quality Dimensions, pp. 612-615. Springer. ISBN 978-0-387-35544-3, 2009.
[30] Grodzevich O, Romanko O. Normalization and other topics in multi-objective optimization. In: FMIPW. 2006 pp. 42-56. http://www.maths-in-industry.org/miis/233/.
[31] Jovanovic P, Romero O, Simitsis A, Abelló A. Incremental Consolidation of Data-Intensive Multi-flows. In: TKDE. 2016. doi:10.1109/TKDE.2016.2515609.
[32] Agrawal S, Chaudhuri S, Kollár L, Marathe AP, Narasayya VR, Syamala M. Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB. 2004 pp. 1110-1121. ISBN 0-12-088469-0.
[33] Kalavri V, Shang H, Vlassov V. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data. In: CSE. 2013 pp. 894-901. doi:10.1109/CSE.2013.134.
[34] Qu W, Dessloch S. A Real-time Materialized View Approach for Analytic Flows in Hybrid Cloud Environments. Datenbank-Spektrum, 2014. 14(2):97-106. doi:10.1007/s13222-014-0155-0.
[35] Roy P, Seshadri S, Sudarshan S, Bhobe S. Efficient and Extensible Algorithms for Multi Query Optimization. In: SIGMOD. 2000 pp. 249-260. doi:10.1145/342009.335419.
[36] Sellis TK. Multiple-Query Optimization. ACM Trans. Database Syst., 1988. 13(1):23-52. doi:10.1145/42201.42203.
[37] Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner W. SAP HANA database: data management for modern business applications. SIGMOD Record, 2011. 40(4):45-51. doi:10.1145/2094114.2094126.
[38] Raman V, Attaluri GK, Barber R, Chainani N, Kalmuk D, KulandaiSamy V, Leenstra J, Lightstone S, Liu S, Lohman GM, Malkemus T, Müller R, Pandis I, Schiefer B, Sharpe D, Sidle R, Storm AJ, Zhang L. DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB, 2013. 6(11):1080-1091. doi:10.14778/2536222.2536233.
[39] DeWitt DJ, Halverson A, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: SIGMOD. 2013 pp. 1255-1266. doi:10.1145/2463676.2463709.
[40] Idreos S, Alagiannis I, Johnson R, Ailamaki A. Here are my Data Files. Here are my Queries. Where are my Results? In: CIDR. 2011 pp. 57-68. http://cidrdb.org/cidr2011/Papers/CIDR11\_Paper7.pdf.
[41] Jindal A, Quiané-Ruiz J, Dittrich J. Trojan data layouts: right shoes for a running elephant. In: SOCC. 2011 p. 21. doi:10.1145/2038916.2038937.
[42] Kougka G, Gounaris A, Tsichlas K. Practical algorithms for execution engine selection in data flows. Future Generation Comp. Syst., 2015. 45:133-148. doi:10.1016/j.future.2014.11.011.
[43] Tziovara V, Vassiliadis P, Simitsis A. Deciding the physical implementation of ETL workflows. In: DOLAP. 2007 pp. 49-56. doi:10.1145/1317331.1317341.
[44] Tan W, Sun Y, Lu G, Tang A, Cui L. Trust Services-Oriented Multi-Objects Workflow Scheduling Model for Cloud Computing. In: ICPCA/SWS. 2012 pp. 617-630. doi:10.1007/978-3-642-37015-1\_54.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-fa83db13-1c1c-4306-977c-aeb8ad2800ed