PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Data locality in Hadoop

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
The Apache Hadoop framework is an answer to the market tendencies regarding the need for storing and processing rapidly growing amounts of data, providing a fault-tolerant distributed storage and data processing. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data. However, in the isolated configuration, namenode (playing the role of a master in the cluster) still favours the closer nodes. Basically it means that before the whole task has run, significant delays can be caused by moving single blocks of data closer to the starting datanode. Currently, a Hadoop user does not have influence how the data is distributed across the cluster. This paper presents an innovative functionality to the Hadoop Distributed File System (HDFS) that enables moving data blocks on request within the cluster. Data can be shifted either by a user running the proper HDFS shell command or programmatically by other modules, like an appropriate scheduler.
Twórcy
autor
  • Skyscanner Ltd, Barcelona, Spain
  • Department of Microelectronics and Computer Science (DMCS), Lodz University of Technology, Łódź, Poland
autor
  • Departament d’Enginyeria de Serveis i Sistemes d’Informació (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain
autor
  • Departament d’Enginyeria de Serveis i Sistemes d’Informació (ESSI), Universitat Politècnica de Catalunya, Barcelona, Spain
Bibliografia
  • [1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, Tech. Rep., June 2011.
  • [2] D. Laney, “3D data management: Controlling data volume, velocity, and variety,” META Group, Tech. Rep., February 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3DData-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
  • [3] K. Normandeau, “Beyond volume, variety and velocity is the issue of big data veracity,” September 2013. [Online]. Available: http://insidebigdata.com/2013/09/12/beyond-volume-varietyvelocity-issue-big-data-veracity/
  • [4] B. Marr, “Why only one of the 5 vs of big data really matters,” March 2015. [Online]. Available: http://www.ibmbigdatahub.com/blog/whyonly-one-5-vs-big-data-really-matters
  • [5] Apache Software Foundation, “Apache hadoop.” [Online]. Available: http://hadoop.apache.org/docs/current/
  • [6] P. Jovanovic, O. Romero, T. Calders, and A. Abell˙o, “H-word: Supporting job scheduling in hadoop with workload-driven data redistribution,” 2016.
  • [7] B. Hedlund, “Understanding hadoop clusters and the network,” September 2011. [Online]. Available: http://bradhedlund.com/2011/09/10/understanding-hadoop-clustersand-the-network/
  • [8] “Apache hadoop (mapreduce) internals - diagrams,” http://ercoppa.github.io/HadoopInternals/.
  • [9] Hortonworks, Inc, “Hortonworks.” [Online]. Available: https://hortonworks.com
  • [10] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107-113, 2008. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492
  • [11] H. Herodotou, “Hadoop performance models,” CoRR, vol. abs/1106.0940, 2011. [Online]. Available: http://arxiv.org/abs/1106.0940
  • [12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache hadoop YARN: yet another resource negotiator,” ACM Symposium on Cloud Computing, SOCC ’13, Santa Clara, CA, USA, October 1-3, 2013, pp. 5:1-5:16, 2013. [Online]. Available: http://doi.acm.org/10.1145/2523616.2523633
  • [13] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” 2010, pp. 1-10. [Online]. Available: http://dx.doi.org/10.1109/MSST.2010.5496972
  • [14] J. Kałużka, “Data locality in hadoop,” M.Sc. Thesis (Lodz University of Technology, Universitat Politècnica de Catalunya), October 2016.
Uwagi
Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2018).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-b6a34d58-c748-41f4-8a6e-d878e1c1f17d
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.