Hybrid Analytic Flows : the Case for Optimization

Simitsis, A.; Wilkinson, K.; Dayal, U.

Artykuł - szczegóły

Tytuł artykułu

Hybrid Analytic Flows : the Case for Optimization

Autorzy

Simitsis A. , Wilkinson K. , Dayal U.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

To remain competitive, enterprises are evolving in order to quickly respond to changing market conditions and customer needs. In this new environment, a single centralized data warehouse is no longer sufficient. Next generation business intelligence involves data flows that span multiple, diverse processing engines, that contain complex functionality like data/text analytics, machine learning operations, and that need to be optimized against various objectives. A common example is the use of Hadoop to analyze unstructured text and merging these results with relational database queries over the data warehouse. We refer to these multi-engine analytic data flows as hybrid flows. Currently, it is a cumbersome task to create and run hybrid flows. Custom scripts must be written to dispatch tasks to the individual processing engines and to exchange intermediate results. So, designing correct hybrid flows is a challenging task. Optimizing such flows is even harder. Additionally, when the underlying computing infrastructure changes, existing flows likely need modification and reoptimization. The current, ad-hoc design approach cannot scale as hybrid flows become more commonplace. To address this challenge, we are building a platform to design and manage hybrid flows. It supports the logical design of hybrid flows in which implementation details are not exposed. It generates code for the underlying processing engines and orchestrates their execution. But the key enabling technology in the platform is an optimizer that converts the logical flow to an executable form that is optimized for the underlying infrastructure according to user-specified objectives. In this paper, we describe challenges in designing the optimizer and our solutions. We illustrate the optimizer through a real-world use case. We present a logical design and optimized designs for the use case. We show how the performance of the use case varies depending on the system configuration and how the optimizer is able to generate different optimized flows for different configurations.

Słowa kluczowe

data flows databases map-reduce optimization

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2013

Tom

Vol. 128, nr 3

Strony

303--335

Opis fizyczny

Bibliogr. 34 poz.

Twórcy

autor

Simitsis A.

alkis@hp.com

HP Labs, Palo Alto, CA 94304, USA

autor

Wilkinson K.

kevin.wilkinson@hp.com

HP Labs, Palo Alto, CA 94304, USA

autor

Dayal U.

umeshwar.dayal@hp.com

HP Labs, Palo Alto, CA 94304, USA

Bibliografia

[1] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D. J., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, PVLDB, 2(1), 2009, 922–933.
[2] Acher, M., Collet, P., Gaignard, A., Lahire, P., Montagnat, J., France, R. B.: Composing multiple variability artifacts to assemble coherent workflows, Software Quality Journal, 20(3-4), 2012, 689–734.
[3] Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing, SoCC, 2010.
[4] Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh,M. Y., Kanne, C.-C., O¨ zcan, F., Shekita, E. J.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis, PVLDB, 4(12), 2011, 1272–1283.
[5] Briand, L. C., Morasca, S., Basili, V. R.: Property-Based Software Engineering Measurement, IEEE Trans. Software Eng., 22(1), 1996, 68–86.
[6] Dayal, U.: Processing Queries Over Generalization Hierarchies in a Multidatabase System, VLDB, 1983.
[7] Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), PVLDB, 3(1), 2010, 518–529.
[8] Du, W., Krishnamurthy, R., Shan, M.-C.: Query Optimiza- tion in a Heterogeneous DBMS, VLDB, 1992. 334 A. Simitsis et al. / Hybrid Analytic Flows - the Case for Optimization
[9] Ewen, S., Ortega-Binderberger, M., Markl, V.: A Learning Optimizer for a Federated Database Management System, BTW, 2005.
[10] Gardarin, G., Sha, F., Tang, Z.-H.: Calibrating the Query Optimizer Cost Model of IRO-DB, an Object-Oriented Federated Database System, VLDB, 1996.
[11] Gupta, C., Mehta, A., Dayal, U.: PQR: Predicting Query Execution Times for Autonomous Workload Management, ICAC, 2008.
[12] Han, W.-S., Kwak, W., Lee, J., Lohman, G. M., Markl, V.: Parallelizing query optimization, PVLDB, 1(1), 2008, 188–200.
[13] Informatica: PowerCenter Pushdown Optimization Option Datasheet, 2011.
[14] Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks, EuroSys, 2007.
[15] Jiang, D., Ooi, B. C., Shi, L., Wu, S.: The Performance of MapReduce: An In-depth Study, PVLDB, 3(1), 2010, 472–483.
[16] Kllapi, H., Sitaridi, E., Tsangaris, M. M., Ioannidis, Y. E.: Schedule optimization for data processing flows on the cloud, SIGMOD Conference, 2011.
[17] Lohman, G. M., Mohan, C., Haas, L. M., Daniels, D., Lindsay, B. G., Selinger, P. G., Wilms, P. F.: Query Processing in R*, in: Query Processing in Database Systems, Springer, 1985, 31–47.
[18] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M. B., Lee, E. A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system, Concurrency and Computation: Practice and Experience, 18(10), 2006, 1039–1065.
[19] Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S.: CIEL: A Universal Execution Engine for Distributed Data-flow Computing, USENIX NSDI, 2011.
[20] Naacke, H., Tomasic, A., Valduriez, P.: Validating Mediator Cost Models with DISCO, Networking and Information Systems Journal, 2(5), 2000.
[21] Ogasawara, E. S., Paulino, C. E., Murta, L. G. P.,Werner, C., Mattoso, M.: Experiment Line: Software Reuse in Scientific Workflows, SSDBM, 2009.
[22] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing, SIGMOD Conference, 2008.
[23] Roth, M. T., Arya, M., Haas, L. M., Carey, M. J., Cody, W. F., Fagin, R., Schwarz, P. M., II, J. T., Wimmers, E. L.: The Garlic Project, SIGMOD Conference, 1996.
[24] Sellis, T. K.: Multiple-Query Optimization, ACM Trans. Database Syst., 13(1), 1988, 23–52.
[25] Simitsis, A., Vassiliadis, P., Sellis, T. K.: Optimizing ETL Processes in Data Warehouses, ICDE, 2005.
[26] Simitsis, A., Wilkinson, K.: Revisiting ETL Benchmarking: The Case for Hybrid Flows, TPCTC, 2012.
[27] Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: QoX-driven ETL design: reducing the cost of ETL consulting engagements, SIGMOD Conference, 2009.
[28] Simitsis, A.,Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines, SIGMOD Conference, 2012.
[29] Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance, ICDE, 2010.
[30] Simitsis, A., Wilkinson, K., Dayal, U., Hsu, M.: HFMS: Managing the Lifecycle and Complexity of Hybrid Analytic Data Flows, ICDE, 2013.
[31] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using Hadoop, ICDE, 2010.
[32] Vassiliadis, P., Simitsis, A., Terrovitis, M., Skiadopoulos, S.: Blueprints and Measures for ETL Workflows, ER, 2005.
[33] Verma, A., Cherkasova, L., Campbell, R. H.: ARIA: automatic resource inference and allocation for mapreduce environments, ICAC, 2011.
[34] Verma, A., Cherkasova, L., Campbell, R. H.: Resource Provisioning Framework for MapReduce Jobs with Performance Goals, Middleware, 2011.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-20bfc14c-33e9-4c53-9e34-4a451c950a9b