Scaling evolutionary programming with the use of apache spark

Funika, W.; Koperek, P.

doi:10.7494/csci.2016.17.1.69

Artykuł - szczegóły

Tytuł artykułu

Scaling evolutionary programming with the use of apache spark

Autorzy

Funika W. , Koperek P.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.7494/csci.2016.17.1.69

Warianty tytułu

Języki publikacji

Abstrakty

Organizations across the globe gather more and more data, encouraged by easy-to-use and cheap cloud storage services. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, use of this method for large-scale dataset analysis might be unfeasible. In this paper, we analyze a bottleneck in the open-source implementation of this method we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by executing them in a distributed manner on a cluster of machines. We analyze the performance of the service by comparing the evaluation execution time of a number of samples with the use of both implementations. Finally, we draw conclusions and outline plans for further research.

Słowa kluczowe

distributed systems evolutionary programming symbolic regression scaling Apache Spark

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2016

Tom

Vol. 17 (1)

Strony

69--82

Opis fizyczny

Bibliogr. 20 poz., rys., wykr., tab.

Twórcy

autor

Funika W.

funika@agh . edu . pl

AGH University of Science and Technology, ACC CYFRONET AGH, Krakow, Poland

autor

Koperek P.

pkoperek@gmail . com

AGH University of Science and Technology

Bibliografia

1. Amazon.com, Inc.: AWS Amazon Elastic Compute Cloud (EC2) – Scalable Cloud Hosting. http://aws.amazon.com/ec2, 2014, accessed 2.12.2014.
2. Apache Software Foundation: Welcome to Apache Hadoop! http://hadoop.apache.org/, 2014, accessed 11.11.2014.
3. Baldeschwieler E.: Yahoo! Launches Worlds Largest Hadoop Production Application. https://goo.gl/wrOZ2v, 2008, accessed 11.11.2014.
4. Du X., Ni Y., Yao Z., Xiao R., Xie D.: High performance parallel evolutionary algorithm model based on MapReduce framework. International Journal of Computer Applications in Technology, vol. 46(3), pp. 290–295, 2013.
5. Evans J., Rzhetsky A.: Machine Science. Science, vol. 329, pp. 399–400, 2010.
6. Fernndez F., Snchez J.M., Tomassini M., Gmez J.A.: A Parallel Genetic Programming Tool Based on PVM. In: J. Dongarra, E. Luque, T. Margalef, eds., Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. 1697, pp. 241–248, Springer, Berlin–Heidelberg, 1999.
7. Funika W., Godowski P., Pegiel P., Król D.: Semantic-Oriented Performance Monitoring of Distributed Applications. Computing and Informatics, vol. 31(2), pp. 427–446, 2012.
8. Funika W., Koperek P.: Genetic Programming in Automatic Discovery of Relationships in Computer System Monitoring Data. In: Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, vol. 8384, pp. 371–380, Springer, Berlin–Heidelberg, 2014.
9. Funika W., Kupisz M., Koperek P.: Towards Autonomic Semantic-Based Management of Distributed Applications. Computer Science, vol. 11, pp. 51–64, 2010.
10. Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R., Shenker S., Stoica I.: Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pp. 295–308, USENIX Association, Berkeley, CA, USA, 2011.
11. hubert: project source code. https://github.com/pkoperek/hubert, 2015, accessed 15.02.2015.
12. King R.D., Rowland J., Oliver S.G., et al.: The Automation of Science. Science, vol. 324, pp. 85–89, 2009.
13. Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992.
14. Ryan A.: Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode. https://goo.gl/ifnqx, 2012, accessed 11.11.2014.
15. Salhi A., Glaser H., De Roure D.: Parallel Implementation of a Genetic-programming Based Tool for Symbolic Regression. Inf. Process. Lett., vol. 66(6), pp. 299–307, 1998.
16. Schmidt M., Lipson H.: Distilling free-form natural laws from experimental data. Science, vol. 324, pp. 81–85, 2009.
17. Schmidt M.D., Lipson H.: Data-Mining Dynamical Systems: Automated Symbolic System Identification for Exploratory Analysis. ASME Conference Proceedings, vol. 2008(48364), pp. 643–649, 2008.
18. Schmidt M.D., Lipson H.: Age-fitness pareto optimization. In: M. Pelikan, J. Branke, eds., GECCO, pp. 543–544, ACM, 2010.
19. Schwarzkopf M., Konwinski A., Abd-El-Malek M., Wilkes J.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), pp. 351–364, Prague, Czech Republic, 2013.
20. Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M.J., Shenker S., Stoica I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, pp. 2–2, USENIX Association, Berkeley, CA, USA, 2012.

Uwagi

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-9f0bde84-2b94-491b-8aac-5463fb7296ea