Wyniki wyszukiwania - BazTech

1

Oscilloscope based on small-size FPGA with VGA display

Rzeszut P., Jamro E., Wiatr K.

Measurement Automation Monitoring

|

2018

|

Vol. 64, No. 1

2--4

EN

A simple single-chip, FPGA based oscilloscope was designed both to supply a user with a low-budget oscilloscope and to teach operation of such devices. The device implements basic functions of real oscilloscope, providing clear insight in processes of signal acquisition (employing FPGA built-in analog-to-digital converter with aggregated sampling rate equal 1MS/s), processing and displaying acquired signals. Also some effort was made in order to fit the design in limited resources of the selected FPGA device. The project is open source [1].

2

Basic 3D graphics processor implemented on small FPGA

Panek K., Flak B., Koryciak S., Wiatr K.

Measurement Automation Monitoring

|

2018

|

Vol. 64, No. 1

8--10

EN

FPGAs have big computing possibilities and therefore are very popular as dedicated hardware accelerators. A few years ago, FPGAs were expensive and the cheapest ones had very limited capabilities, because of small amount of logic elements and slow internal clocks. Nowadays, cheap development boards are available at a price below 50€ with abilities to transmit even HDMI signals. This paper covers implementation of the soft processor with a 3D graphics coprocessor on the cheapest available FPGA board with HDMI connector, containing only 8k Logic Elements.

3

FPGA implementation of procedures for video quality assessment

Wielgosz M., Karwatowski M., Pietron M., Wiatr K.

Computer Science

|

2018

|

Vol. 19 (3)

279--305

EN

The video resolutions used in a variety of media are constantly rising. While manufacturers struggle to perfect their screens, it is also important to ensure the high quality of the displayed image. Overall quality can be measured using a Mean Opinion Score (MOS). Video quality can be affected by miscellaneous artifacts appearing at every stage of video creation and transmission. In this paper, we present a solution to calculate four distinct video quality metrics that can be applied to a real-time video quality assessment system. Our assessment module is capable of processing 8K resolution in real time set at a level of 30 frames per second. The throughput of 2.19 GB/s surpasses the performance of pure software solutions. The module was created using a high-level language to concentrate on architectural optimization.

4

Novel architecture for floating point accumulator with cancelation error detection

Jamro E., Dąbrowska-Boruch A., Russek P., Wielgosz M., Wiatr K.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2018

|

Vol. 66, nr 5

579-587

EN

A floating point accumulator cannot be obtained straightforwardly due to its pipeline architecture and feedback loop. Therefore, an essential part of the proposed floating point accumulator is a critical accumulation loop which is limited to an integer adder and 16-bit shifter only. The proposed accumulator detects a catastrophic cancellation which occurs e.g. when two similar numbers are subtracted. Additionally, modules with reduced hardware resources for rough error evaluation are proposed. The proposed architecture does not comply with the IEEE-754 floating point standard but it guarantees that a correct result, with an arbitrarily defined number of significant bits, is obtained. The proposed calculation philosophy focuses on the desired result error rather than on calculation precision as such.

5

Assessment of various GPU acceleration strategies in text categorization processing flow

Korduła Ł., Wielgosz M., Karwatowski M., Pietroń M., Żurek D., Wiatr K.

Measurement Automation Monitoring

|

2017

|

Vol. 63, No. 6

203--205

EN

Automatic text categorization presents many difficulties. Modern algorithms are getting better in extracting meaningful information from human language. However, they often significantly increase complexity of computations. This increased demand for computational capabilities can be facilitated by the usage of hardware accelerators like general purpose graphic cards. In this paper we present a full processing flow for document categorization system. Gram-Schmidt process signatures calculation up to 12 fold decrease in computing time of system components.

6

Real time 8K video quality assessment using FPGA

Wielgosz M., Pietroń M., Karwatowski M., Wiatr K.

Measurement Automation Monitoring

|

2016

|

Vol. 62, No. 6

187--189

EN

This paper presents a hardware architecture of the video quality assessment module. Two different metrics were implemented on FPGA using modern High Level Language for digital system design – Impulse C. FPGA resources consumption of the presented module is low, which enables module-level parallelization. Tests conducted for four modules working concurrently show that 1.96 GB/s throughput can be achieved. The module is capable of processing 8K video stream in a real-time manner i.e. 30 frames/second. Such high performance of the presented solution was achieved due to the series of architectural optimization introduced to the module, such as reduction of data precision and reuse of various module components.

7

A custom co-processor for the discovery of low autocorrelation binary sequences

Russek P., Karwatowski M., Jamro E., Wiatr K.

Measurement Automation Monitoring

|

2016

|

Vol. 62, No. 5

154--156

EN

We present a custom processor that was designed to enhance algorithms of finding Low Autocorrelation Binary Sequences (LABS). Finding LABS is very computationally exhaustive, but no custom computing solutions have been reported in the literature so far. A computational kernel which allowed creating an effective single-purpose processor was determined and an appropriate architecture was proposed. The selected elements of the architecture were coded in High-Level Synthesis (HLS) language to speed up the design process. Afterwards, the processor was verified and tested in Xilinx’s Virtex7 FPGA. At the beginning of the paper, we briefly present the finding LABS problem and its importance. Later, we deliver the algorithm, its custom processor structure, and implementation results in terms of the processor performance, size and power.

8

The versatile hardware accelerator framework for sparse vector calculations

Karwatowski R., Wiatr K.

Measurement Automation Monitoring

|

2015

|

Vol. 61, No. 7

327--329

EN

In this paper, we present the advantage of the ability of FPGAs to perform various computationally complex calculations using deep pipelining and parallelism. We propose an architecture that consists of many small stream processing blocks. The designed framework maintains proper data movement and synchronization. The architecture can be easily adapted to be implemented in FPGA devices of a various size and cost - from small SoC devices to high-end PCIe accelerator cards. It is capable to perform a selected operation on a sparse data that are loaded as the stream of vectors. As an example application, we have implemented the cosine similarity measure for the text similarity calculations that uses the TF-IDF weighting scheme. The presented example application calculates the similarity of texts from the set of input documents to documents from the large database. The scheme is used to find the most similar documents. The proposed design can decrease the service time of search queries in computer centers while reducing power consumption.

9

The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators

Pietroń M., Karwatowski M., Wiatr K.

Measurement Automation Monitoring

|

2015

|

Vol. 61, No. 7

385--387

EN

One of the most challenging issues in the case of many and multi-core architectures is how to exploit their potential computing power in legacy systems without a deep knowledge of their architecture. The analysis of static dependence and dynamic data dependences of a program run, can help to identify independent paths that could have been computed by individual parallel threads. The statistics of reusing the data and its size is also crucial in adapting the application in GPU many-core hardware architecture because of specific memory hierarchies. The proposed profiling system accomplishes static data analysis and computes dynamic dependencies for Java programs as well as recommends parts of source code with the highest potential for parallelization in GPU. Such an analysis can also provide starting point for automatic parallelization.

10

A study of parallel techniques for dimensionality reduction and its impact on the quality of text processing algorithms

Pietroń M., Wielgosz M., Karwatowski M., Wiatr K.

Measurement Automation Monitoring

|

2015

|

Vol. 61, No. 7

352--353

EN

The presented algorithms employ the Vector Space Model (VSM) and its enhancements such as TFIDF (Term Frequency Inverse Document Frequency) with Singular Value Decomposition (SVD). TFIDF were applied to emphasize the important features of documents and SVD was used to reduce the analysis space. Consequently, a series of experiments were conducted. They revealed important properties of the algorithms and their accuracy. The accuracy of the algorithms was estimated in terms of their ability to match the human classification of the subject. For unsupervised algorithms the entropy was used as a quality evaluation measure. The combination of VSM, TFIDF, and SVD came out to be the best performing unsupervised algorithm with entropy of 0.16.

11

Wykorzystanie akceleracji sprzętowej przy implementacji metryk podobieństwa tekstów

Iwanecki Ł., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

426--428

PL

Artykuł opisuje badania na temat klasyfikatorów tekstów. Zadanie polegało na zaprojektowaniu akceleratora sprzętowego, który przyspieszyłby proces klasyfikacji tekstów pod względem znaczeniowym. Projekt został podzielony na dwie części. Celem części pierwszej było zaproponowanie sprzętowej implementacji algorytmu realizującego metrykę do obliczania podobieństwa dokumentów. W drugiej części zaprojektowany został cały systemem akceleratora sprzętowego. Kolejnym etapem projektowym jest integracja modelu metryki z system akceleracji.

EN

The aim of this project is to propose a hardware accelerating system to improve the text categorization process. Text categorization is a task of categorizing electronic documents into the predefined groups, based on the content. This process is complex and requires a high performance computing system and a big number of comparisons. In this document, there is suggested a method to improve the text categorization using the FPGA technology. The main disadvantage of common processing systems is that they are single-threaded – it is possible to execute only one instruction per a single time unit. The FPGA technology improves concurrence. In this case, hundreds of big numbers may be compared in one clock cycle. The whole project is divided into two independent parts. Firstly, a hardware model of the required metrics is implemented. There are two useful metrics to compute a distance between two texts. Both of them are shown as equations (1) and (2). These formulas are similar to each other and the only difference is the denominator. This part results in two hardware models of the presented metrics. The main purpose of the second part of the project is to design a hardware accelerating system. The system is based on a Xilinx Zynq device. It consists of a Cortex-A9 ARM processor, a DMA controller and a dedicated IP Core with the accelerator. The block diagram of the system is presented in Fig.4. The DMA controller provides duplex transmission from the DDR3 memory to the accelerating unit omitting a CPU. The project is still in development. The last step is to integrate the hardware metrics model with the accelerating system.

12

Równoległa implementacja algorytmu winnowing dla operacji strumieniowej analizy tekstu

Wielgosz M., Żurek D., Pietroń M., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 5

309--312

PL

W ramach praca przeprowadzona została analiza możliwości wykorzystania algorytmu winnowing do strumieniowego przetwarzania informacji tekstowej. W szczególności nacisk został położony na operacje generacji odcisku jako jej zredukowanej reprezentacji wiadomości tekstowej. Autorzy przeprowadzili szereg eksperymentów, w celu określenia efektywności działania algorytmu oraz możliwego do uzyskania przyspieszenia obliczeń, z wykorzy-staniem węzła procesorów Intel Xeon E5645 2.40GHz oraz karty GPU Nvidia Tesla m2090.

EN

There are several models available for information retrieval and text analysis but the two are considered to be the dominant ones, namely Boolean and the vector space model (VSM). A model maps the existing words or text into a new representation space. This paper presents a boolean n-gram-based algorithm - winnowing for fast text search and comparison of documents with main focus on its implementation and performance analysis. The algorithm is used to generate fingerprints (i.e. a set of hashes) of the analyzed documents. A dedicated test framework was designed and implemented to handle the task of the algorithm evaluation which utilizes PAN test corpus and programming environment. Several tests were conducted in order to determine the comparison quality of the obfuscated and not obfuscated text for the winnowing algorithm and different window and n-gram size. The tests revealed interesting properties of the algorithms with respect to comparison of documents as well as defied the limits of their applicability. The n-gram-based algorithms due to their simplicity are well suited for hardware implementation. Thus, the authors implemented compu-tationally demanding part of both fingerprint generation both on CPU and GPU. Performance measurements for Intel Xeon E5645, 2.40GHz and Nvidia Tesla m2090 implementation of Ngram-based algorithm show approximately 14x computational speedup.

13

Optymalizacja kompresji Huffmana pod kątem podziału na bloki

Rybak K., Jamro E., Wielgosz M., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

519--521

PL

Prezentowane w pracy badania dotyczą bezstratnej kompresji danych opartej o metodę Huffmana i zgodnej ze standardem deflate stosowanym w plikach .zip / .gz. Zaproponowana jest optymalizacja kodera Huffmana polegająca na podziale na bloki, w których stosuje się różne książki kodowe. Wprowadzenie dodatkowego bloku z reguły poprawia stopień kompresji kosztem narzutu spowodowanego koniecznością przesłania dodatkowej książki kodowej. Dlatego w artykule zaproponowano nowy algorytm podziału na bloki.

EN

According to deflate [2] standard (used e.g. in .zip / .gz files), an input file can be divided into different blocks, which are compressed employing different Huffman [1] codewords. Usually the smaller the block size, the better the compression ratio. Nevertheless each block requires additional header (codewords) overhead. Consequently, introduction of a new block is a compromise between pure data compression ratio and headers size. This paper introduces a novel algorithm for block Huffman compression, which compares sub-block data statistics (histograms) based on current sub-block entropy E(x) (1) and entropy-based estimated average word bitlength Emod(x) for which codewords are obtained for the previous sub-block (2). When Emod(x) - E(x) > T (T - a threshold), then a new block is inserted. Otherwise, the current sub-block is merged into the previous block. The typical header size is 50 B, therefore theoretical threshold T for different sub-block sizes S is as in (3) and is given in Tab. 2. Nevertheless, the results presented in Tab. 1 indicate that optimal T should be slightly different - smaller for small sub-block size S and larger for big S. The deflate standard was selected due to its optimal compression size to compression speed ratio [3]. This standard was selected for hardware implementation in FPGA [4, 5, 6, 7].

14

Komunikacja ze sprzętowym akceleratorem haszowania n-gramów dla procesora ARM z wykorzystaniem portu ACP

Barszczowski M., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

486--488

PL

Artykuł opisuje uruchomienie portu ACP w układzie EPP firmy Xilinx przy użyciu CDMA zarządzającego transmisją pomiędzy akceleratorem, a rdzeniami procesora. Głównym celem badań było utworzenie modułu dokonującego tak zwanego haszowania zbiorów danych. Do wykonania tej operacji wykorzystany został układ Zynq 7000 posiadający zasoby logiki programowalnej oraz dwa rdzenie ARM A9. Powstały dwie koncepcje realizacji akceleratora. Pierwsza wersja zakładała bezpośredni przepływ danych ze źródła do akceleratora, a następnie do rdzeni ARM. Drugie rozwiązanie zakłada wykorzystanie portu ACP.

EN

This paper introduces a new approach to hardware acceleration using the ACP(Acceleration Coherency Port) in Xilinx Zynq-7000 EPP XC7Z020. The first prototype allocated BRAM memory and transferred data through the ACP. The second one used a hardware hashing module to process data outside the CPU. The module received and returned data through the ACP port. The main task of the system is to replace a set of data with its shorter representative of constant length without interference of the processing unit. The main benefit of hashing data lies within the constant length of function outcome, which leads to data compression. Compression is highly desirable while comparing large subsets of data, especially in data mining. The execution of a hashing function requires high performance of the CPU due to the computational complexity of the algorithm. Two concepts where established. The first one assumed transferring data directly do the hardware accelerator and later to ARM cores. This solution is attractive due to its simplicity and relatively fast. Unfortunately, the data cannot be processed before hashing with the same CPU without significant speed reduction. The second approach used the ACP port which can transfer data very fast between L2/L3 cache memory without flushing of validating cache. The data can be processed by the software driven CPU, sent to the accelerator and then sent back to CPU for further processing. To accomplish the established task, the Zynq 7000 EPP with double ARM A9 core and programmable logic in one chip was used.

15

Implementacja oraz porównanie algorytmów tekstowych w środowiskach przetwarzania równoległego na przykładzie procesorów wielordzeniowych i kart graficznych

Pietroń M., Wielgosz M., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 5

301--304

PL

Artykuł przedstawia implementację algorytmów tekstowych w wybranych platformach przetwarzania równoległego. Dostępność procesorów wielordzeniowych oraz kart graficznych ogólnego przeznaczenia sprawia, iż badania nad równoległą implementacją algorytmów w celu ich akceleracji nabierają coraz większego znaczenia. Algorytmy tekstowe są niezwykle istotnym i często niezbędnym elementem zaawansowanych algorytmów analizy tekstu oraz są także składowymi funkcji wyszukiwania wzorców w tekście wielu języków programowania. W pracy dokonano analizy najpopularniejszych algorytmów tekstowych oraz dokonano ich analizy pod kątem ich zrównoleglenia w celu ich implementacji w procesorze wielordzeniowym oraz karcie graficznej ogólnego przeznaczenia. Analizowanymi algorytmami są: boyer-moore, algorytm naiwny oraz algorytm knuth-morris-pratt. Następnie dokonano porównania efektywności ich realizacji na wymienionych platformach sprzętowych.

EN

This paper presents implementation of text algorithms in multicore CPU and GPGPU. The text algorithms are very common algorithms used in text analysis process and they are a part of functions used for text patterns recognition. The library functions for text searching implemented in many languages very often use most popular text-algorithms. The paper describes the analysis of these algorithms for parallel implementations in multicore processors and general purpose graphic cards. The research work presented in this paper shows that text algorithms can be partially parallelized. The process of acceleration can be done by appropriate dividing the input text between parallel threads (data parallelism). The comparative studies were performed for the following algorithms: boyer-moore (horspool) , naive and knuth-morris-pratt algorithm. The presented results show the efficiency of these algorithms in the case of different type and size of patterns. In the case of GPU the implementation was made in the CUDA framework. The OpenMP library was used for a multicore version.

16

Medical Visualizer 3D: Hardware Controller for Dmd Module

Koryciak S., Barszczowski M, Dąbrowska-Boruch A., Wiatr K.

Image Processing & Communications

|

2014

|

Vol. 19, no. 2-3

15--23

EN

In this paper an implementation of the module responsible for the control of micro-mirror array for later use in projection is described. Existing technologies allow for projections of medical images in Digital Imaging and Communications in Medicine format only in the form of a flat 2D image. The 3D Visualizer will allow to display medical images in three dimensions using its own projection surface. The matrix controlling device has been largely developed on the basis of reverse engineering studies carried out on the functional system based on a driver from Texas Instruments. Driver is built on the FPGA with implemented soft processor from Xilinx - MicroBlaze.

17

Optymalizacja sprzętowej architektury kompresji danych metodą słownikową

Gwiazdoń M., Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2013

|

R. 59, nr 8

827--829

PL

Niniejszy artykuł opisuje nową architekturę sprzętową kompresji słownikowej, np. LZ77, LZSS czy też Deflate. Zaproponowana architektura oparta jest na funkcji haszującej. Poprzednie publikacje były oparte na sekwencyjnym odczycie adresu wskazywanego przez pamięć hasz, niniejszy artykuł opisuje układ, w którym możliwe jest równoległe odczytywanie tego adresu z wielu pamięci hasz, w konsekwencji możliwa jest kompresja słownikowa z szybkością na poziomie 1B ciągu wejściowego na takt zegara. Duża szybkość kompresji jest okupiona nieznacznym spadkiem stopnia kompresji.

EN

This paper describes a novel parallel architecture for hardware (ASIC or FPGA) implementation of dictionary compressor, e.g. LZ77 [1], LZSS [2] or Deflate [4]. The proposed architecture allows for very fast compression – 1B of input data per clock cycle. A standard compression architecture [8, 9] is based on sequential hash address reading (see Fig. 2) and requires M clock cycles per 1B of input data, where M is the number of candidates for string matching, i.e. hashes look ups (M varies for different input data). In this paper every hash address is looked up in parallel (see Fig. 3). The drawback of the presented method is that the number of M is defined (limited), therefore the compression ratio is slightly degraded (see Fig. 4). To improve compression ratio, a different sting length may be searched independently, i.e. not only 3B, but also 4B, … N B hashes (see results in Fig. 5, 6). Every hash memory (M(N-2)) usually requires a direct look-up in the dictionary to eliminate hash false positive cases or to check whether a larger length sting was found. In order to reduce the number of dictionary reads, an additional pre-elimination algorithm is proposed, thus the number of dictionary reads does not increase rapidly with growing N (see Fig. 7).

18

Implementacja w układach FPGA dekompresji danych zgodnie ze standardem Deflate

Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2013

|

R. 59, nr 8

739--741

PL

Otwarty standard kompresji danych, Deflate, jest szeroko stosowanym standardem w plikach .gz / .zip i stanowi kombinację kompresji metodą LZ77 / LZSS oraz kodowania Huffmana. Niniejszy artykuł opisuje implementację w układach FPGA dekompresji danych według tego standardu. Niniejszy moduł jest w stanie dokonać dekompresji co najmniej 1B na takt zegara, co przy zegarze 100MHz daje 100MB/s. Aby zwiększyć szybkość, możliwa jest praca wielu równoległych modułów dla różnych strumieni danych wejściowych.

EN

This paper describes FPGA implementation of the Deflate standard decoder. Deflate [1] is a commonly used compression standard employed e.g. in zip and gz files. It is based on dictionary compression (LZ77 / LZSS) [4] and Huffman coding [5]. The proposed Huffman decoded is similar to [9], nevertheless several improvements are proposed. Instead of employing barrel shifter a different translation function is proposed (see Tab. 1). This is a very important modification as the barrel shifter is a part of the time-critical feedback loop (see Fig. 1). Besides, the Deflate standard specifies extra bits, which causes that a single input word might be up to 15+13=28 bits wide, but this width is very rare. Consequently, as the input buffer might not feed the decoder width such wide input date, a conditional decoding is proposed, for which the validity of the input data is checked after decoding the input symbol, thus when the actual input symbol bit widths is known. The implementation results (Tab. 2) show that the occupied hardware resources are mostly defined by the number of BRAM modules, which are mostly required by the 32kB dictionary memory. For example, comparable logic (LUT / FF) resources to the Deflate standard decoder are required by the AXI DMA module which transfers data to / from the decoder.

19

The comparison of parallel sorting algorithms implemented on different hardware platforms

Żurek D., Pietroń M., Wielgosz M., Wiatr K.

Computer Science

|

2013

|

Vol. 14 (4)

679--691

EN

Sorting is a common problem in computer science. There are a lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, many-core and multi-core platforms have enabled the creation of wide parallel algorithms. We have standard processors that consist of multiple cores and hardware accelerators, like the GPU. Graphic cards, with their parallel architecture, provide new opportunities to speed up many algorithms. In this paper, we describe the results from the implementation of a few different parallel sorting algorithms on GPU cards and multi-core processors. Then, a hybrid algorithm will be presented, consisting of parts executed on both platforms (a standard CPU and GPU). In recent literature about the implementation of sorting algorithms in the GPU, a fair comparison between many core and multi-core platforms is lacking. In most cases, these describe the resulting time of sorting algorithm executions on the GPU platform and a single CPU core.

20

Parallelized algorithms for finding similar images and object recognition

Frączek R., Cyganek B., Wiatr K.

Computer Science

|

2013

|

Vol. 14 (1)

113--127

EN

The paper addresses the issue of searching for similar images and objects in arepository of information. The contained images are annotated with the help of the sparse descriptors. In the presented research, different color and edge histogram descriptors were used. To measure similarities among images,various color descriptors are compared. For this purpose different distance measures were employed. In order to decrease execution time, several code optimization and parallelization methods are proposed. Results of these experiments, as well as discussion of the advantages and limitations of different combinations of metods are presented.