PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Intermediate Results Materialization Selection and Format for Data-Intensive Flows

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions.
Wydawca
Rocznik
Strony
111--138
Opis fizyczny
Bibliogr. 44 poz., rys., tab., wykr.
Twórcy
autor
  • Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
autor
  • Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
autor
  • Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
autor
  • Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
autor
  • Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
autor
  • Technische Universität Dresden (TUD), Dresden, Germany
autor
  • Technische Universität Dresden (TUD), Dresden, Germany
Bibliografia
  • [1] Jovanovic P, Romero O, Abelló A. A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey. T. Large-Scale Data-and Knowledge-Centered Systems, 2016. 29:66-107. doi:10.1007/978-3-662-54037-4\_3.
  • [2] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun. ACM, 2008. 51(1):107-113. doi:10.1145/1327452.1327492.
  • [3] Chen Y, Alspaugh S, Katz RH. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. PVLDB, 2012. 5(12):1802-1813. doi:10.14778/2367502.2367519.
  • [4] Harinarayan V, Rajaraman A, Ullman JD. Implementing Data Cubes Efficiently. In: SIGMOD. 1996 pp. 205-216. doi:10.1145/233269.233333.
  • [5] Gupta H, Mumick IS. Selection of Views to Materialize in a Data Warehouse. IEEE Trans. Knowl. Data Eng., 2005. 17(1):24-43. doi:10.1109/TKDE.2005.16.
  • [6] Alagiannis I, Idreos S, Ailamaki A. H2O: a hands-free adaptive store. In: SIGMOD. 2014 pp. 1103-1114. doi:10.1145/2588555.2610502.
  • [7] Munir RF, Romero O, Abelló A, Bilalli B, Thiele M, Lehner W. ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. In: MEDI. 2016 pp. 42-56. doi:10.1007/978-3-319-45547-1\_4.
  • [8] Azim T, Karpathiotakis M, Ailamaki A. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 2017. 11(3):324-337. doi:10.14778/3157794.3157801.
  • [9] Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 2010. 3(1):494-505. doi:10.14778/1920841.1920906.
  • [10] Elghandour I, Aboulnaga A. ReStore: Reusing Results of MapReduce Jobs. PVLDB, 2012. 5(6):586-597. doi:10.14778/2168651.2168659.
  • [11] Wang G, Chan C. Multi-Query Optimization in MapReduce Framework. PVLDB, 2013. 7(3):145-156. doi:10.14778/2732232.2732234.
  • [12] Simitsis A, Wilkinson K, Castellanos M, Dayal U. QoX-driven ETL design: reducing the cost of ETL consulting engagements. In: SIGMOD. 2009 pp. 953-960. doi:10.1145/1559845.1559954.
  • [13] Halevy AY. Answering queries using views: A survey. VLDB J., 2001. 10(4):270-294. doi:10.1007/s007780100054.
  • [14] Bian H, Yan Y, Tao W, Chen LJ, Chen Y, Du X, Moscibroda T. Wide Table Layout Optimization based on Column Ordering and Duplication. In: SIGMOD. 2017 pp. 299-314. doi:10.1145/3035918.3035930.
  • [15] Theodoratos D, Bouzeghoub M. A General Framework for the View Selection Problem for Data Warehouse Design and Evolution. In: DOLAP. 2000 pp. 1-8. doi:10.1145/355068.355309.
  • [16] Theodoratos D, Sellis TK. Dynamic Data Warehouse Design. In: DaWaK. 1999 pp. 1-10. doi:10.1007/3-540-48298-9\_1.
  • [17] Theodoratos D, Sellis TK. Data Warehouse Configuration. In: VLDB. 1997 pp. 126-135. http://www.vldb.org/conf/1997/P126.PDF.
  • [18] Jovanovic P, Simitsis A, Wilkinson K. Engine independence for logical analytic flows. In: ICDE. 2014 pp. 1060-1071. doi:10.1109/ICDE.2014.6816723.
  • [19] Jovanovic P, Romero O, Simitsis A, Abelló A. Incremental Consolidation of Data-Intensive Multi-Flows. IEEE Trans. Knowl. Data Eng., 2016. 28(5):1203-1216. doi:10.1109/TKDE.2016.2515609.
  • [20] Marler RT, Arora JS. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization, 2004. 26(6):369-395. doi:10.1007/s00158-003-0368-6.
  • [21] Halasipuram R, Deshpande PM, Padmanabhan S. Determining Essential Statistics for Cost Based Optimization of an ETL Workflow. In: EDBT. 2014 pp. 307-318. doi:10.5441/002/edbt.2014.29.
  • [22] Garcia-Molina H, Ullman JD, Widom J. Database systems - the complete book (2. ed.). Pearson Education, 2009. ISBN 978-0-13-187325-4.
  • [23] Aggarwal A, Vitter JS. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM, 1988. 31(9):1116-1127. doi:10.1145/48529.48535.
  • [24] Afrati FN, Ullman JD. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng., 2011. 23(9):1282-1298. doi:10.1109/TKDE.2011.47.
  • [25] Hueske F, Peters M, Sax M, Rheinländer A, Bergmann R, Krettek A, Tzoumas K. Opening the Black Boxes in Data Flow Optimization. PVLDB, 2012. 5(11):1256-1267. doi:10.14778/2350229.2350244.
  • [26] Simitsis A, Wilkinson K. Revisiting ETL Benchmarking: The Case for Hybrid Flows. In: TPCTC. 2012 pp. 75-91. doi:10.1007/978-3-642-36727-4\_6.
  • [27] Nguyen T, Bimonte S, d’Orazio L, Darmont J. Cost models for view materialization in the cloud. In: EDBT/ICDT Workshops. 2012 pp. 47-54. doi:10.1145/2320765.2320788.
  • [28] Roukh A, Bellatreche L, Boukorca A, Bouarar S. Eco-DMW: Eco-Design Methodology for Data warehouses. In: DOLAP. 2015 pp. 1-10. doi:10.1145/2811222.2811230.
  • [29] Encyclopedia of Database Systems. chapter Data Quality Dimensions, pp. 612-615. Springer. ISBN 978-0-387-35544-3, 2009.
  • [30] Grodzevich O, Romanko O. Normalization and other topics in multi-objective optimization. In: FMIPW. 2006 pp. 42-56. http://www.maths-in-industry.org/miis/233/.
  • [31] Jovanovic P, Romero O, Simitsis A, Abelló A. Incremental Consolidation of Data-Intensive Multi-flows. In: TKDE. 2016. doi:10.1109/TKDE.2016.2515609.
  • [32] Agrawal S, Chaudhuri S, Kollár L, Marathe AP, Narasayya VR, Syamala M. Database Tuning Advisor for Microsoft SQL Server 2005. In: VLDB. 2004 pp. 1110-1121. ISBN 0-12-088469-0.
  • [33] Kalavri V, Shang H, Vlassov V. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data. In: CSE. 2013 pp. 894-901. doi:10.1109/CSE.2013.134.
  • [34] Qu W, Dessloch S. A Real-time Materialized View Approach for Analytic Flows in Hybrid Cloud Environments. Datenbank-Spektrum, 2014. 14(2):97-106. doi:10.1007/s13222-014-0155-0.
  • [35] Roy P, Seshadri S, Sudarshan S, Bhobe S. Efficient and Extensible Algorithms for Multi Query Optimization. In: SIGMOD. 2000 pp. 249-260. doi:10.1145/342009.335419.
  • [36] Sellis TK. Multiple-Query Optimization. ACM Trans. Database Syst., 1988. 13(1):23-52. doi:10.1145/42201.42203.
  • [37] Färber F, Cha SK, Primsch J, Bornhövd C, Sigg S, Lehner W. SAP HANA database: data management for modern business applications. SIGMOD Record, 2011. 40(4):45-51. doi:10.1145/2094114.2094126.
  • [38] Raman V, Attaluri GK, Barber R, Chainani N, Kalmuk D, KulandaiSamy V, Leenstra J, Lightstone S, Liu S, Lohman GM, Malkemus T, Müller R, Pandis I, Schiefer B, Sharpe D, Sidle R, Storm AJ, Zhang L. DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB, 2013. 6(11):1080-1091. doi:10.14778/2536222.2536233.
  • [39] DeWitt DJ, Halverson A, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: SIGMOD. 2013 pp. 1255-1266. doi:10.1145/2463676.2463709.
  • [40] Idreos S, Alagiannis I, Johnson R, Ailamaki A. Here are my Data Files. Here are my Queries. Where are my Results? In: CIDR. 2011 pp. 57-68. http://cidrdb.org/cidr2011/Papers/CIDR11\_Paper7.pdf.
  • [41] Jindal A, Quiané-Ruiz J, Dittrich J. Trojan data layouts: right shoes for a running elephant. In: SOCC. 2011 p. 21. doi:10.1145/2038916.2038937.
  • [42] Kougka G, Gounaris A, Tsichlas K. Practical algorithms for execution engine selection in data flows. Future Generation Comp. Syst., 2015. 45:133-148. doi:10.1016/j.future.2014.11.011.
  • [43] Tziovara V, Vassiliadis P, Simitsis A. Deciding the physical implementation of ETL workflows. In: DOLAP. 2007 pp. 49-56. doi:10.1145/1317331.1317341.
  • [44] Tan W, Sun Y, Lu G, Tang A, Cui L. Trust Services-Oriented Multi-Objects Workflow Scheduling Model for Cloud Computing. In: ICPCA/SWS. 2012 pp. 617-630. doi:10.1007/978-3-642-37015-1\_54.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-fa83db13-1c1c-4306-977c-aeb8ad2800ed
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.