Wyniki wyszukiwania - BazTech

1

A distributed algorithm for protein identification from tandem mass spectrometry data

Orzechowska Katarzyna, Rubel Tymon, Kurjata Robert, Zaremba Krzysztof

Applied Computer Science

|

2022

|

Vol. 18, no 2

16--27

EN

Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increas-ing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.

2

Detection of DDoS Attacks in OpenStack-based Private Cloud Using Apache Spark

Gumaste Shweta, G. Narayan D., Shinde Sumedha, K. Amit

Journal of Telecommunications and Information Technology

|

2020

|

nr 4

62--71

EN

Security is a critical concern for cloud service providers. Distributed denial of service (DDoS) attacks are the most frequent of all cloud security threats, and the consequences of damage caused by DDoS are very serious. Thus, the design of an efficient DDoS detection system plays an important role in monitoring suspicious activity in the cloud. Real-time detection mechanisms operating in cloud environments and relying on machine learning algorithms and distributed processing are an important research issue. In this work, we propose a real-time detection of DDoS attacks using machine learning classifiers on a distributed processing platform. We evaluate the DDoS detection mechanism in an OpenStack-based cloud testbed using the Apache Spark framework. We compare the classification performance using benchmark and real-time cloud datasets. Results of the experiments reveal that the random forest method offers better classifier accuracy. Furthermore, we demonstrate the effectiveness of the proposed distributed approach in terms of training and detection time.

3

Cloud-based sentiment analysis for measuring customer satisfaction in the Moroccan banking sector using Naïve Bayes and Stanford NLP

Riadsolh Anouar, Lasri Imane, ElBelkacemi Mourad

Journal of Automation Mobile Robotics and Intelligent Systems

|

2020

|

Vol. 14, No. 4

64--71

EN

In a world where every day we produce 2.5 quintillion bytes of data, sentiment analysis has been a key for making sense of that data. However, to process huge text data in real-time requires building a data processing pipeline in order to minimize the latency to process data streams. In this paper, we explain and evaluate our proposed real-time customer’ sentiment analysis pipeline on the Moroccan banking sector through data from the web and social network using open-source big data tools such as data ingestion using Apache Kafka, In-memory data processing using Apache Spark, Apache HBase for storing tweets and the satisfaction indicator, and ElasticSearch and Kibana for visualization then NodeJS for building a web application. The performance evaluation of Naïve Bayesian model show that for French Tweets the accuracy has reached 76.19% while for English Tweets the result was unsatisfactory and the resulting accuracy is 56%. To remedy this problem, we used the Stanford core NLP which, for English Tweets, reaches a precision of 80.7%.

4

Influence of YARN schedulers on power consumption and processing time for various big data benchmarks

Drypczewski Krzysztof, Proficz Jerzy, Stepnowski Andrzej

TASK Quarterly : scientific bulletin of Academic Computer Centre in Gdansk

|

2018

|

Vol. 22, No 4

303--312

EN

Climate change caused by human activities can influence the lives of everybody on the planet. The environmental concerns must be taken into consideration by all fields of study includingICT. Green Computing aims to reduce negative effects of IT on the environment while, at the same time, maintaining all of the possible benefits it provides. Several Big Data platforms like Apache Spark or YARN have become widely used in analytics and High-Performance Computing systems due to the reliability and usability of Map Reduce implementations. The authors research the power consumption and energy efficiency of Hadoop YARN schedulers using Apache Spark under three different workloads. The test cases include: sorting large binary files,counting unique words in large text files and processing satellite imagery from the Sentinel-2mission. The presented results show small (2%–11%) but distinct differences in the power consumption of FIFO and FAIR schedulers.

5

Processing of satellite data in the cloud

Proficz J., Drypczewski K.

TASK Quarterly : scientific bulletin of Academic Computer Centre in Gdansk

|

2017

|

Vol. 21, No 4

365--377

EN

The dynamic development of digital technologies, especially those dedicated to devices generating large data streams, such as all kinds of measurement equipment (temperature and humidity sensors, cameras, radio-telescopes and satellites – Internet of Things) enables more in-depth analysis of the surrounding reality, including better understanding of various natural phenomenon, starting from atomic level reactions, through macroscopic processes (e.g. meteorology) to observation of the Earth and the outer space. On the other hand such a large quantitative improvement requires a great number of processing and storage resources, resulting in the recent rapid development of Big Data technologies. Since 2015, the European Space Agency (ESA) has been providing a great amount of data gathered by exploratory equipment: a collection of Sentinel satellites – which perform Earth observation using various measurement techniques. For example Sentinel-2 provides a stream of digital photos, including images of the Baltic Sea and the whole territory of Poland. This data is used in an experimental installation of a Big Data processing system based on the open source software at the Academic Computer Center in Gdansk. The center has one of the most powerful supercomputers in Poland – the Tryton computing cluster, consisting of 1600 nodes interconnected by a fast Infiniband network (56 Gbps) and over 6 PB of storage. Some of these nodes are used as a computational cloud supervised by an OpenStack platform, where the Sentinel-2 data is processed. A subsystem of the automatic, perpetual data download to object storage (based on Swift) is deployed, the required software libraries for the image processing are configured and the Apache Spark cluster has been set up. The above system enables gathering and analysis of the recorded satellite images and the associated metadata, benefiting from the parallel computation mechanisms. This paper describes the above solution including its technical aspects.

6

Scaling evolutionary programming with the use of apache spark

Funika W., Koperek P.

Computer Science

|

2016

|

Vol. 17 (1)

69--82

EN

Organizations across the globe gather more and more data, encouraged by easy-to-use and cheap cloud storage services. Large datasets require new approaches to analysis and processing, which include methods based on machine learning. In particular, symbolic regression can provide many useful insights. Unfortunately, due to high resource requirements, use of this method for large-scale dataset analysis might be unfeasible. In this paper, we analyze a bottleneck in the open-source implementation of this method we call hubert. We identify that the evaluation of individuals is the most costly operation. As a solution to this problem, we propose a new evaluation service based on the Apache Spark framework, which attempts to speed up computations by executing them in a distributed manner on a cluster of machines. We analyze the performance of the service by comparing the evaluation execution time of a number of samples with the use of both implementations. Finally, we draw conclusions and outline plans for further research.