Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
Organizations across the globe gather more and more data, encouraged by easy-to-use and cheap cloud storage services. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, use of this method for large-scale dataset analysis might be unfeasible. In this paper, we analyze a bottleneck in the open-source implementation of this method we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by executing them in a distributed manner on a cluster of machines. We analyze the performance of the service by comparing the evaluation execution time of a number of samples with the use of both implementations. Finally, we draw conclusions and outline plans for further research.
Wydawca
Czasopismo
Rocznik
Tom
Strony
69--82
Opis fizyczny
Bibliogr. 20 poz., rys., wykr., tab.
Twórcy
autor
- AGH University of Science and Technology, ACC CYFRONET AGH, Krakow, Poland
autor
- AGH University of Science and Technology
Bibliografia
- 1. Amazon.com, Inc.: AWS Amazon Elastic Compute Cloud (EC2) – Scalable Cloud Hosting. http://aws.amazon.com/ec2, 2014, accessed 2.12.2014.
- 2. Apache Software Foundation: Welcome to Apache Hadoop! http://hadoop.apache.org/, 2014, accessed 11.11.2014.
- 3. Baldeschwieler E.: Yahoo! Launches Worlds Largest Hadoop Production Application. https://goo.gl/wrOZ2v, 2008, accessed 11.11.2014.
- 4. Du X., Ni Y., Yao Z., Xiao R., Xie D.: High performance parallel evolutionary algorithm model based on MapReduce framework. International Journal of Computer Applications in Technology, vol. 46(3), pp. 290–295, 2013.
- 5. Evans J., Rzhetsky A.: Machine Science. Science, vol. 329, pp. 399–400, 2010.
- 6. Fernndez F., Snchez J.M., Tomassini M., Gmez J.A.: A Parallel Genetic Programming Tool Based on PVM. In: J. Dongarra, E. Luque, T. Margalef, eds., Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. 1697, pp. 241–248, Springer, Berlin–Heidelberg, 1999.
- 7. Funika W., Godowski P., Pegiel P., Król D.: Semantic-Oriented Performance Monitoring of Distributed Applications. Computing and Informatics, vol. 31(2), pp. 427–446, 2012.
- 8. Funika W., Koperek P.: Genetic Programming in Automatic Discovery of Relationships in Computer System Monitoring Data. In: Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, vol. 8384, pp. 371–380, Springer, Berlin–Heidelberg, 2014.
- 9. Funika W., Kupisz M., Koperek P.: Towards Autonomic Semantic-Based Management of Distributed Applications. Computer Science, vol. 11, pp. 51–64, 2010.
- 10. Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R., Shenker S., Stoica I.: Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pp. 295–308, USENIX Association, Berkeley, CA, USA, 2011.
- 11. hubert: project source code. https://github.com/pkoperek/hubert, 2015, accessed 15.02.2015.
- 12. King R.D., Rowland J., Oliver S.G., et al.: The Automation of Science. Science, vol. 324, pp. 85–89, 2009.
- 13. Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992.
- 14. Ryan A.: Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode. https://goo.gl/ifnqx, 2012, accessed 11.11.2014.
- 15. Salhi A., Glaser H., De Roure D.: Parallel Implementation of a Genetic-programming Based Tool for Symbolic Regression. Inf. Process. Lett., vol. 66(6), pp. 299–307, 1998.
- 16. Schmidt M., Lipson H.: Distilling free-form natural laws from experimental data. Science, vol. 324, pp. 81–85, 2009.
- 17. Schmidt M.D., Lipson H.: Data-Mining Dynamical Systems: Automated Symbolic System Identification for Exploratory Analysis. ASME Conference Proceedings, vol. 2008(48364), pp. 643–649, 2008.
- 18. Schmidt M.D., Lipson H.: Age-fitness pareto optimization. In: M. Pelikan, J. Branke, eds., GECCO, pp. 543–544, ACM, 2010.
- 19. Schwarzkopf M., Konwinski A., Abd-El-Malek M., Wilkes J.: Omega: flexible, scalable schedulers for large compute clusters. In: SIGOPS European Conference on Computer Systems (EuroSys), pp. 351–364, Prague, Czech Republic, 2013.
- 20. Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M.J., Shenker S., Stoica I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, pp. 2–2, USENIX Association, Berkeley, CA, USA, 2012.
Uwagi
PL
Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-9f0bde84-2b94-491b-8aac-5463fb7296ea