Wyniki wyszukiwania - BazTech

1

High Frequency Rule Synthesis in a Large Scale Multiple Database with MapReduce

Bisoyi Sudhanshu Shekhar, Mishra Pragnyaban, Mishra Saroja Nanda

International Journal of Electronics and Telecommunications

|

2022

|

Vol. 68, No. 2

177--186

EN

Increasing development in information and communication technology leads to the generation of large amount of data from various sources. These collected data from multiple sources grows exponentially and may not be structurally uniform. In general, these are heterogeneous and distributed in multiple databases. Because of large volume, high velocity and variety of data mining knowledge in this environment becomes a big data challenge. Distributed Association Rule Mining(DARM) in these circumstances becomes a tedious task for an effective global Decision Support System(DSS). The DARM algorithms generate a large number of association rules and frequent itemset in the big data environment. In this situation synthesizing highfrequency rules from the big database becomes more challenging. Many algorithms for synthesizing association rule have been proposed in multiple database mining environments. These are facing enormous challenges in terms of high availability, scalability, efficiency, high cost for the storage and processing of large intermediate results and multiple redundant rules. In this paper, we have proposed a model to collect data from multiple sources into a big data storage framework based on HDFS. Secondly, a weighted multi-partitioned method for synthesizing high-frequency rules using MapReduce programming paradigm has been proposed. Experiments have been conducted in a parallel and distributed environment by using commodity hardware. We ensure the efficiency, scalability, high availability and costeffectiveness of our proposed method.

2

Efektywne przetwarzanie i integracja dużych zbiorów danych w środowisku Hadoop

Drzymała Paweł, Welfle Henryk, Drzymała Agnieszka

Przegląd Elektrotechniczny

|

2019

|

R. 95, nr 1

29--32

PL

Rozwój nowych kanałów elektronicznej wymiany informacji przyczynia się do powstania coraz większej ilości danych. Dane te są często zróżnicowane, niejednorodne i składowane bez ściśle zdefiniowanej struktury. W ciągu ostatnich 2 lat przyrosło 90% danych, jakie zostały wygenerowane od początku istnienia ludzkości. W artykule zaprezentowano architekturę i możliwości środowiska Hadoop powstałego w celu efektywnego przetwarzania i integracji dużych zbiorów danych. Przedstawiono cechy tej platformy oraz jej skalowalność. Omówiono metodę działania systemu plików HDFS oraz odporności na błędy składowania tego systemu. Zaprezentowano ideę współpracy węzłów klastra Hadoop oraz wykonywania działań typu Map – Reduce.

EN

The development of new channels of electronic information exchange contributes to the emergence of more and more data. These data are often diverse, heterogeneous and stored without a strictly defined structure. Over the past two years, 90% of the data has been generated since the beginning of human civilization. The article presents the architecture and possibilities of the Hadoop environment for the effective processing and integration of large data sets. It also presents the features of this platform and its scalability as well as discussed the method of operation in the HDFS file system and the resistance to storage errors of this system. The scheme of cooperation of the Hadoop cluster nodes to perform MapReduce operation was presented.

3

Intermediate Results Materialization Selection and Format for Data-Intensive Flows

Munir R. F., Nadal S., Romero O., Abelló A., Jovanovic P., Thiele M., Lehner W.

Fundamenta Informaticae

|

2018

|

Vol. 163, nr 2

111-138

EN

Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions.

4

A Contemplating approach for Hive and Map reduce for efficient Big Data Implementation

Sasubilli G., Sekhar U. S., Sharma S., Sharma S.

Annals of Computer Science and Information Systems

|

2018

|

Vol. 14

131--135

EN

In the reference current scenario, data is incremented exponentially and speed of data accruing at the rate of petabytes. Big data defines the available amount of data over the different media or wide communication media internet. Big Data term refers to the explosion in the quantity (and quality) of available and potentially relevant data. On the basis of quantity amount of data are very huge and this quantity has been handled by conventional database systems and data warehouses because the amount of data increases similarly complexity with it also increases. Multiple areas are involved in the production, generation, and implementation of Big Data such as news media, social networking sites, business applications, industrial community, and much more. Some parameters concern with the handling of Big Data like Efficient management, proper storage, availability, scalability, and processing. Thus to handle this big data, new techniques, tools, and architecture are required. In the present paper, we have discussed different technology available in the implementation and management of Big Data. This paper contemplates an approach formal tools and techniques used to solve the major difficulties with Big Data, This evaluate different industries data stock exchange to covariance factor and it tells the significance of data through covariance positive result using hive approach and also how much hive approach is efficient for that in the term of HDFS and hive query. and also evaluates the covariance factors after applying hive and map reduce approaches with stock exchange dataset of around 3500. After process data with the hive approach we have conclude that hive approach is better than map reduce and big table in terms of storage and processing of Big Data.

5

Data locality in Hadoop

Kałużka J., Napieralska M., Romero O., Jovanovic P.

International Journal of Microelectronics and Computer Science

|

2017

|

Vol. 8, nr 1

16--20

EN

The Apache Hadoop framework is an answer to the market tendencies regarding the need for storing and processing rapidly growing amounts of data, providing a fault-tolerant distributed storage and data processing. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data. However, in the isolated configuration, namenode (playing the role of a master in the cluster) still favours the closer nodes. Basically it means that before the whole task has run, significant delays can be caused by moving single blocks of data closer to the starting datanode. Currently, a Hadoop user does not have influence how the data is distributed across the cluster. This paper presents an innovative functionality to the Hadoop Distributed File System (HDFS) that enables moving data blocks on request within the cluster. Data can be shifted either by a user running the proper HDFS shell command or programmatically by other modules, like an appropriate scheduler.