Data locality in Hadoop

Kałużka, J.; Napieralska, M.; Romero, O.; Jovanovic, P.

Artykuł - szczegóły

Tytuł artykułu

Data locality in Hadoop

Autorzy

Kałużka J. , Napieralska M. , Romero O. , Jovanovic P.

Treść / Zawartość

Pełne teksty:

Kaluzka_Napieralska_Romero_Jovanovic_Data_1_2017.pdf

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The Apache Hadoop framework is an answer to the market tendencies regarding the need for storing and processing rapidly growing amounts of data, providing a fault-tolerant distributed storage and data processing. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data. However, in the isolated configuration, namenode (playing the role of a master in the cluster) still favours the closer nodes. Basically it means that before the whole task has run, significant delays can be caused by moving single blocks of data closer to the starting datanode. Currently, a Hadoop user does not have influence how the data is distributed across the cluster. This paper presents an innovative functionality to the Hadoop Distributed File System (HDFS) that enables moving data blocks on request within the cluster. Data can be shifted either by a user running the proper HDFS shell command or programmatically by other modules, like an appropriate scheduler.

Słowa kluczowe

distributed file system big data Apache Hadoop HDFS

rozproszony system plików big data Apache Hadoop HDFS

Wydawca

Lodz University of Technology. Department of Microelectronics and Computer Science

Czasopismo

International Journal of Microelectronics and Computer Science

Rocznik

2017

Tom

Vol. 8, nr 1

Strony

16--20

Opis fizyczny

Bibliogr. 14 poz., il. kolor.

Twórcy

autor

Kałużka J.

justyna.kaluzka@skyscanner.net

Skyscanner Ltd, Barcelona, Spain

autor

Napieralska M.

mnapier@dmcs.pl

Department of Microelectronics and Computer Science (DMCS), Lodz University of Technology, Łódź, Poland

autor

Romero O.

oromero@essi.upc.edu

Departament d’Enginyeria de Serveis i Sistemes d’Informació (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain

autor

Jovanovic P.

petar@essi.upc.edu

Departament d’Enginyeria de Serveis i Sistemes d’Informació (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain

Bibliografia

[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, Tech. Rep., June 2011.
[2] D. Laney, “3D data management: Controlling data volume, velocity, and variety,” META Group, Tech. Rep., February 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3DData-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
[3] K. Normandeau, “Beyond volume, variety and velocity is the issue of big data veracity,” September 2013. [Online]. Available: http://insidebigdata.com/2013/09/12/beyond-volume-varietyvelocity-issue-big-data-veracity/
[4] B. Marr, “Why only one of the 5 vs of big data really matters,” March 2015. [Online]. Available: http://www.ibmbigdatahub.com/blog/whyonly-one-5-vs-big-data-really-matters
[5] Apache Software Foundation, “Apache hadoop.” [Online]. Available: http://hadoop.apache.org/docs/current/
[6] P. Jovanovic, O. Romero, T. Calders, and A. Abell˙o, “H-word: Supporting job scheduling in hadoop with workload-driven data redistribution,” 2016.
[7] B. Hedlund, “Understanding hadoop clusters and the network,” September 2011. [Online]. Available: http://bradhedlund.com/2011/09/10/understanding-hadoop-clustersand-the-network/
[8] “Apache hadoop (mapreduce) internals - diagrams,” http://ercoppa.github.io/HadoopInternals/.
[9] Hortonworks, Inc, “Hortonworks.” [Online]. Available: https://hortonworks.com
[10] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107-113, 2008. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492
[11] H. Herodotou, “Hadoop performance models,” CoRR, vol. abs/1106.0940, 2011. [Online]. Available: http://arxiv.org/abs/1106.0940
[12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache hadoop YARN: yet another resource negotiator,” ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, pp. 5:1-5:16, 2013. [Online]. Available: http://doi.acm.org/10.1145/2523616.2523633
[13] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” 2010, pp. 1-10. [Online]. Available: http://dx.doi.org/10.1109/MSST.2010.5496972
[14] J. Kałużka, “Data locality in hadoop,” M.Sc. Thesis (Lodz University of Technology, Universitat Politècnica de Catalunya), October 2016.

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b6a34d58-c748-41f4-8a6e-d878e1c1f17d