Wyniki wyszukiwania - BazTech

1

Oscilloscope based on small-size FPGA with VGA display

Rzeszut P., Jamro E., Wiatr K.

Measurement Automation Monitoring

|

2018

|

Vol. 64, No. 1

2--4

EN

A simple single-chip, FPGA based oscilloscope was designed both to supply a user with a low-budget oscilloscope and to teach operation of such devices. The device implements basic functions of real oscilloscope, providing clear insight in processes of signal acquisition (employing FPGA built-in analog-to-digital converter with aggregated sampling rate equal 1MS/s), processing and displaying acquired signals. Also some effort was made in order to fit the design in limited resources of the selected FPGA device. The project is open source [1].

2

Novel architecture for floating point accumulator with cancelation error detection

Jamro E., Dąbrowska-Boruch A., Russek P., Wielgosz M., Wiatr K.

Bulletin of the Polish Academy of Sciences. Technical Sciences

|

2018

|

Vol. 66, nr 5

579-587

EN

A floating point accumulator cannot be obtained straightforwardly due to its pipeline architecture and feedback loop. Therefore, an essential part of the proposed floating point accumulator is a critical accumulation loop which is limited to an integer adder and 16-bit shifter only. The proposed accumulator detects a catastrophic cancellation which occurs e.g. when two similar numbers are subtracted. Additionally, modules with reduced hardware resources for rough error evaluation are proposed. The proposed architecture does not comply with the IEEE-754 floating point standard but it guarantees that a correct result, with an arbitrarily defined number of significant bits, is obtained. The proposed calculation philosophy focuses on the desired result error rather than on calculation precision as such.

3

A custom co-processor for the discovery of low autocorrelation binary sequences

Russek P., Karwatowski M., Jamro E., Wiatr K.

Measurement Automation Monitoring

|

2016

|

Vol. 62, No. 5

154--156

EN

We present a custom processor that was designed to enhance algorithms of finding Low Autocorrelation Binary Sequences (LABS). Finding LABS is very computationally exhaustive, but no custom computing solutions have been reported in the literature so far. A computational kernel which allowed creating an effective single-purpose processor was determined and an appropriate architecture was proposed. The selected elements of the architecture were coded in High-Level Synthesis (HLS) language to speed up the design process. Afterwards, the processor was verified and tested in Xilinx’s Virtex7 FPGA. At the beginning of the paper, we briefly present the finding LABS problem and its importance. Later, we deliver the algorithm, its custom processor structure, and implementation results in terms of the processor performance, size and power.

4

Optymalizacja kompresji Huffmana pod kątem podziału na bloki

Rybak K., Jamro E., Wielgosz M., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

519--521

PL

Prezentowane w pracy badania dotyczą bezstratnej kompresji danych opartej o metodę Huffmana i zgodnej ze standardem deflate stosowanym w plikach .zip / .gz. Zaproponowana jest optymalizacja kodera Huffmana polegająca na podziale na bloki, w których stosuje się różne książki kodowe. Wprowadzenie dodatkowego bloku z reguły poprawia stopień kompresji kosztem narzutu spowodowanego koniecznością przesłania dodatkowej książki kodowej. Dlatego w artykule zaproponowano nowy algorytm podziału na bloki.

EN

According to deflate [2] standard (used e.g. in .zip / .gz files), an input file can be divided into different blocks, which are compressed employing different Huffman [1] codewords. Usually the smaller the block size, the better the compression ratio. Nevertheless each block requires additional header (codewords) overhead. Consequently, introduction of a new block is a compromise between pure data compression ratio and headers size. This paper introduces a novel algorithm for block Huffman compression, which compares sub-block data statistics (histograms) based on current sub-block entropy E(x) (1) and entropy-based estimated average word bitlength Emod(x) for which codewords are obtained for the previous sub-block (2). When Emod(x) - E(x) > T (T - a threshold), then a new block is inserted. Otherwise, the current sub-block is merged into the previous block. The typical header size is 50 B, therefore theoretical threshold T for different sub-block sizes S is as in (3) and is given in Tab. 2. Nevertheless, the results presented in Tab. 1 indicate that optimal T should be slightly different - smaller for small sub-block size S and larger for big S. The deflate standard was selected due to its optimal compression size to compression speed ratio [3]. This standard was selected for hardware implementation in FPGA [4, 5, 6, 7].

5

Optymalizacja sprzętowej architektury kompresji danych metodą słownikową

Gwiazdoń M., Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2013

|

R. 59, nr 8

827--829

PL

Niniejszy artykuł opisuje nową architekturę sprzętową kompresji słownikowej, np. LZ77, LZSS czy też Deflate. Zaproponowana architektura oparta jest na funkcji haszującej. Poprzednie publikacje były oparte na sekwencyjnym odczycie adresu wskazywanego przez pamięć hasz, niniejszy artykuł opisuje układ, w którym możliwe jest równoległe odczytywanie tego adresu z wielu pamięci hasz, w konsekwencji możliwa jest kompresja słownikowa z szybkością na poziomie 1B ciągu wejściowego na takt zegara. Duża szybkość kompresji jest okupiona nieznacznym spadkiem stopnia kompresji.

EN

This paper describes a novel parallel architecture for hardware (ASIC or FPGA) implementation of dictionary compressor, e.g. LZ77 [1], LZSS [2] or Deflate [4]. The proposed architecture allows for very fast compression – 1B of input data per clock cycle. A standard compression architecture [8, 9] is based on sequential hash address reading (see Fig. 2) and requires M clock cycles per 1B of input data, where M is the number of candidates for string matching, i.e. hashes look ups (M varies for different input data). In this paper every hash address is looked up in parallel (see Fig. 3). The drawback of the presented method is that the number of M is defined (limited), therefore the compression ratio is slightly degraded (see Fig. 4). To improve compression ratio, a different sting length may be searched independently, i.e. not only 3B, but also 4B, … N B hashes (see results in Fig. 5, 6). Every hash memory (M(N-2)) usually requires a direct look-up in the dictionary to eliminate hash false positive cases or to check whether a larger length sting was found. In order to reduce the number of dictionary reads, an additional pre-elimination algorithm is proposed, thus the number of dictionary reads does not increase rapidly with growing N (see Fig. 7).

6

Implementacja w układach FPGA dekompresji danych zgodnie ze standardem Deflate

Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2013

|

R. 59, nr 8

739--741

PL

Otwarty standard kompresji danych, Deflate, jest szeroko stosowanym standardem w plikach .gz / .zip i stanowi kombinację kompresji metodą LZ77 / LZSS oraz kodowania Huffmana. Niniejszy artykuł opisuje implementację w układach FPGA dekompresji danych według tego standardu. Niniejszy moduł jest w stanie dokonać dekompresji co najmniej 1B na takt zegara, co przy zegarze 100MHz daje 100MB/s. Aby zwiększyć szybkość, możliwa jest praca wielu równoległych modułów dla różnych strumieni danych wejściowych.

EN

This paper describes FPGA implementation of the Deflate standard decoder. Deflate [1] is a commonly used compression standard employed e.g. in zip and gz files. It is based on dictionary compression (LZ77 / LZSS) [4] and Huffman coding [5]. The proposed Huffman decoded is similar to [9], nevertheless several improvements are proposed. Instead of employing barrel shifter a different translation function is proposed (see Tab. 1). This is a very important modification as the barrel shifter is a part of the time-critical feedback loop (see Fig. 1). Besides, the Deflate standard specifies extra bits, which causes that a single input word might be up to 15+13=28 bits wide, but this width is very rare. Consequently, as the input buffer might not feed the decoder width such wide input date, a conditional decoding is proposed, for which the validity of the input data is checked after decoding the input symbol, thus when the actual input symbol bit widths is known. The implementation results (Tab. 2) show that the occupied hardware resources are mostly defined by the number of BRAM modules, which are mostly required by the 32kB dictionary memory. For example, comparable logic (LUT / FF) resources to the Deflate standard decoder are required by the AXI DMA module which transfers data to / from the decoder.

7

Realizacja kompresji danych metodą Huffmana z ograniczeniem długości słów kodowych

Rybak K., Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 7

662-664

PL

Praca opisuje zmodyfikowany sposób budowania książki kodowej kodu Huffmana. Książka kodowa została zoptymalizowana pod kątem implementacji sprzętowej kodera i dekodera Huffmana w układach programowalnych FPGA. Opisano dynamiczną metodę kodowania - książka kodowa może się zmieniać w zależności od zmiennego formatu kompresowanych danych, ponadto musi być przesłana z kodera do dekodera. Sprzętowa implementacja kodeka Huffmana wymusza ograniczenie maksymalnej długości słowa, w przyjętym założeniu do 12 bitów, co pociąga za sobą konieczność modyfikacji algorytmu budowy drzewa Huffmana.

EN

This paper presents a modified algorithm for constructing Huffman codeword book. Huffman coder, decoder and histogram calculations are implemented in FPGA similarly like in [2, 3]. In order to reduce the hardware resources the maximum codeword is limited to 12 bit. It reduces insignificantly the compression ratio [2, 3]. The key problem solved in this paper is how to reduce the maximum codeword length while constructing the Huffman tree [1]. A standard solution is to use a prefix coding, like in the JPEG standard. In this paper alternative solutions are presented: modification of the histogram or modification of the Huffman tree. Modification of the histogram is based on incrementing (disrupting) the histogram values for an input codeword for which the codeword length is greater than 12 bit and then constructing the Huffman tree from the very beginning. Unfortunately, this algorithm is not deterministic, i.e. it is not known how much the histogram should be disrupted in order to obtain the maximum codeword length limited by 12 bit. Therefore several iterations might be required. Another solution is to modify the Huffman tree (see Fig. 2). This algorithm is more complicated (when designing), but its execution time is more deterministic. Implementation results (see Tab. 1) show that modifi-cation of the Huffman tree results in a slightly better compression ratio.

8

Moduł wydajnego przetwarzania sygnałów dedykowany dla systemu wbudowanego opartego na układzie FPGA

Jamro E., Wielgosz M., Cioch W., Bieniasz S.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 7

629-631

PL

W niniejszym artykule opisano dedykowany moduł akceleracji obliczeń filtracji FIR (filtrów o skończonej odpowiedzi impulsowej) o nazwie xsp_calc. Moduł ten jest kompatybilny ze środowiskiem EDK (Embedded Development Kit) firmy Xilinx oraz magistralą PLB (Processor Local Bus). Na magistrali PLB niniejszy moduł jest urządzeniem typu master, oraz może wykonywać 8 operacji MACs (dodaj i akumuluj) na takt zegara. Dodatkowo moduł ten może obliczać wartość maksymalną, minimalną, średnią oraz skuteczną sygnału.

EN

In this paper a dedicated module compatible with PLB (Processor Local Bus) and EDK (Embeddded Development Kit) provided by Xilinx is described. This module accelerates FIR (Finite Impulse Response) operations as well as average value and RMS (Root Mean Square) calculations. This module was employed in Programmable Unit for Diagnostics (PUD) [4, 5] and for Procedure of Linear Decimation (PLD) [6, 7]. For PLD the decimation ratio depends on the rotary machinery angular speed, and thus number of FIR filter nodes changes from 20 to 2000. Consequently, no standard FIR filter architecture for FPGA can be efficiently employed. Furthermore, the dedicated module presented in Fig. 2 was designed. This module is a master on PLB bus therefore it can perform input/output data transfer independently of the processor MicroBlaze. The processor just initialize calculation process by writing proper data to the selected control registers. This module can perform up to 8 MACs (Multiply and Acumulate) operations per clock cycle, sufficiently for the presented system and comparable with the computation power of a DSP (Digital Signal Processor). The implementation results presented in Tab. 1 illustrate that the presented module requires roughly twice the resources of the MicroBlaze and can speed up FIR calculation process roughly 20 times in comparison to the MicroBlaze.

9

System wbudowany oparty na procesorze ARM oraz układzie FPGA

Wielgosz M., Jamro E., Cioch W., Bieniasz S.

Pomiary Automatyka Kontrola

|

2011

|

R. 57, nr 8

877-879

PL

W niniejszym artykule przedstawiono system przeznaczony do analizy i przetwarzania sygnałów wibroakustycznych oparty na procesorze z jądrem ARM oraz układzie FPGA. Jednym z kilku zaimplementowanych algorytmów w ramach prezentowanego systemu jest Procedura Liniowej Decymacij, szeroko stosowana do diagnozowania maszyn wirnikowych synchronizowanych cyklem roboczym. Szybkość wstępnego przetwarzania sygnałów przy pomocy układów FPGA jest dużo większa niż w przypadku procesorów DSP, dzięki czemu stworzony system umożliwia analizę sygnałów diagnostyczny w czasie rzeczywistym.

EN

The paper presents an embedded system for monitoring and analysis of vibroacustic signals. The system is based on an ARM processor and FPGA, which provides both flexibility and real-time processing capabilities. The Linear Decimation Procedure was implemented as one of the vital algorithms for rotary machinery analysis along with a whole set of other calculation procedures widely employed in vibroacustic. Exp() function was used to benchmark the DEVKIT8000 and PANDA platforms against the desktop processor Core i7 3,4 GHz. The presented system is also capable of working in a real-time mode due to its high processing data rate resulting from the adopted architecture and employed high-performance components. A number of the original algorithms were implemented in the FPGA which could be used for non-stationary signals analysis. Furthermore, numerical procedures which do not fit into the FPGA due to the high resources occupation were employed on the ARM processor. It is worth mentioning that the whole system is run under the Ubuntu system which provides a huge flexibility in a number of software packets available as well as stability of the system as such. Some additional widely available environments (e.g. Octave) were installed on the platform facilitating data analysis and processing. It should be noted that the software of the system can be easily modified or replaced apart of the hardware which allows for a fast upgrade. Some other Linux or Windows distributions are also considered for installation in the future.

10

Implementacja w układach FPGA wybranych fragmentów metody szybkiej segmentacji obrazów

Żurek D., Wielgosz M., Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2011

|

R. 57, nr 8

871-873

PL

Prezentowane w pracy badania dotyczą segmentacji obrazów metodą wektorów wspierających (ang. Support Vector Machine - SVM). Metoda ta opiera się na grupie kilkunastu wektorów wspierających, które posiadają cechy wybranych obiektów w obrazie. Implementacja przedstawionej procedury klasyfikacji wektorów wspierających została wykona zarówno programowo w języku C++ na procesorze ogólnego przeznaczenia AMD AthlonII P320 Dual-Core2.10 GHz, jak i sprzętowo w języku VHDL. Moduł klasyfikacji wektorów wspierających został zaimplementowany w układzie Xilinx Spartan 6.

EN

The paper presents preliminary implementation results of image segmentation for the SVM (Support Vector Machine) algorithm. SVM is a dedicated mathematical formula which allows extracting selective objects from an input picture and assign them to an appropriate class. Consequently, a black and white images reflecting occurrence of the desired feature are derived from an original picture fed into the classifier. This work is primarily focused on the FPGA implementation aspects of the algorithm as well as on comparison of the hardware and software performance. A human skin classifier was used as an example and implemented both in AMD AthlonII P320 Dual-Core2.10 GHz and Xilinx Spartan 6 FPGA. It is worth emphasizing that the critical hardware components were designed using HDL, whereas the less demanding standard ones such as communication interfaces, FIFO, FSMs were implemented in HLL (High Level Language). Such an approach allowed both shortening the design time and preserving high performance of the hardware classification module. This work is a part of the Synat project embracing several initiatives aiming at creation of a repository of images to which are to be assigned descriptive name according to their contents. Such a database of tagged images will significantly reduce the search time, since only picture tags will be processed instead of images, so the process will involve simple string operations rather than image recognition. The project is a huge challenge due to an immense volume of data collected over the past years denoted today as the Internet resources. Therefore, the core part of the undertaking is to design andimplement a classification system which should be both reliable and fast. In order to achieve the high performance of a search engine, the most computationally intensive operations are to be ported to hardware.

11

Efektywna komunikacja ARM-FPGA z użyciem interfejsu SPI

Jamro E., Wielgosz M., Cioch W., Bieniasz S.

Pomiary Automatyka Kontrola

|

2011

|

R. 57, nr 8

874-876

PL

W systemach wbudowanych użycie niezależnego procesora ARM oraz układu FPGA umożliwia uzyskanie dużo większej elastyczności projektowania oraz lepszej wydajności niż w przypadku systemów homogenicznych (opartych na tylko jednej platformie). Wadą takiego rozwiązania jest konieczność zapewnienia wydajnej, szybkiej komunikacji, która w omawianym przypadku została zrealizowana poprzez interfejs SPI. Aby uzyskać większą przepustowość danych zaprojektowano dedykowany moduł sprzętowy wewnątrz układu FPGA obsługujący interfejs SPI, pracujący jako urządzenie typu slave po stronie interfejsu SPI oraz master na magistrali PLB (Processor Local Bus).

EN

Implementation of fast and reliable data transfer between an FPGA and a processor is a significant challenge for a designer of heterogeneous embedded systems. In the presented system two separate Printed Circuit Boards (PCB) are employed: ARM-based OMAP3530 [4] and FPGA Spartan3 [2]. SPI (Serial Peripheral Interface) [5] is used as a communication interface due to the OMAP3530 limitations in communication interface choice. For the FPGA module, Xilinx Embeded Development Kit (EDK) and soft-processor MicroBlaze are used. The EDK delivers SPI hardware module [9] compatible with the Processor Local Bus (PLB). Nevertheless, this module employs slave interface on the PLB therefore requires the soft-processor MicroBlaze interaction which limits the transfer speed. Consequently, a dedicated hardware module compatible with the PLB and EDK was designed. This module employs master interface on the PLB bus and slave interface on the SPI interface and is further denoted as the xps_spi_master. As a result, the MicroBlaze is not engaged in the data transfer and, therefore, the transfer speed is significantly larger (which resulted in significant increase in the data throughput). FPGA does ot generate any wait states and therefore the SPI transfer protocol is simplified. The SPI clock speed is 24 MHz and the measured data transfer is roughly 2 MB/s. Summing up, the designed module xps_spi_master significantly speed-ups data transfer and consumes significantly lower FPGA resources in comparison to the original EDK solution, which employs the MicroBlaze and PLB-slave-based SPI interface.

12

FPGA implementation of exchange-correlation potential calculation for DFT

Wielgosz M., Jamro E., Russek P., Wiatr K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2011

|

T. 15, z. 3

485-498

EN

Implementation results of the exchange-correlation module are presented in this paper. The authors have ported a computationally intensive part of quantum chemistry code to FPGA, which involved a substantial modification of its structure so that it matches the platform profile. Additionally, a set of the authors' customized modules for floating operations has been created along with software procedures handling FPGA-GPP intercommunication. Furthermore, several tests have been conducted to determine the speed-up achieved. Some more advanced computational cases have also been investigated to examine the module's performance increase with the number of atomic orbitals. The tests conducted for the orbital module revealed a significantly raised acceleration for higher atomic shells. This work also contains implementation results of the S matrix generation module, which are promising since the presented logic allows calculations to be conducted for 16 points simultaneously.

PL

W niniejszym artykule przedstawione zostały wyniki implementacji modułu obliczającego potencjał korelacyjno-wymienny dla procedury DFT. Autorzy zaimplementowali wymagające obliczeniowo fragmenty algorytmu DFT, co wiązało się ze znaczną modyfikacją algorytmu, tak by w pełni wykorzystać możliwości struktur rekonfigurowalnych. W konsekwencji powstał zestaw sprzętowych modułów zmiennoprzecinkowych oraz procedur zapewniających komunikacje pomiędzy częścią sprzętową oraz programowaną akceleratora. Przeprowadzone testy na platformie RASC wykazały przyspieszenie obliczeń wynoszące 3x dla modułu obliczającego wartość orbitalu atomowego w punkcie, natomiast większe przyspieszenie uzyskano dla jednostki realizujące obliczenia macierzy S.

13

Zmodyfikowane mnożenie o stałej szerokości bitowej

Jamro E., Wielgosz M., Russek P., Wiatr K.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 10

1133-1136

PL

Niniejszy artykuł prezentuje nową metodę kompensacji błędu odcięcia dla mnożenia o stałej szerokości bitowej czyli takiej, dla której szerokość bitowa argumentów wejściowych jest taka sama jak wyjścia. Niektóre poprzednie publikacje były oparte na błędnych założeniach, dlatego zadaniem tej publikacji jest wykazanie wspomnianych błędów oraz zaprezentowanie nowej architektury, dla której błąd średni dąży do zera.

EN

Multiplication is usually implemented in hardware as a full bit-width parallel multiplier, i.e., input bit-widths add up to make up the output bit-width. Nevertheless, in most real-world cases, the input bit-width n is the same as the output bit-width. Therefore, in order to reduce a multiplier area, the n LSBs columns of the multiplier are truncated during the multiplication process (see Fig. 1). This introduces a truncation error which can be reduced by an error compensation circuit. The truncation errors presented in the previous papers, e.g. [3, 6, 7], are based on the false assumption; during truncation error calculation it is sufficient to consider only the combination of each partial input bit products aibj. instead of ever input bits ai and bj (see Fig. 2 and Tab. 1). Therefore a proper fixed-width multiplier structure should be introduced (the old one should be redesigned). This paper focuses on optimizing the mean error (ME) of the truncated multiplier. As a result, a novel Improved Variable error Compensation Truncated Multiplier (IVCTM) is proposed which in comparison to [2], reduces the number of AND gates by 1 in the error compensation circuit (see Fig. 3). For the IVCTM, a mean error is significantly lower than for previously published counterparts. The structure of the IVCTM is simplified in comparison to the previously published truncated multiplier [2], therefore it occupies less silicon area.

14

Sprzętowa implementacja funkcji orbitalnej na potrzeby obliczeń kwantowo-chemicznych

Wielgosz M., Jamro E., Russek P., Wiatr K.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 7

705-707

PL

W niniejszym artykule przedstawione zostały wyniki implementacji modułu obliczającego wartość orbitalu atomowego w punkcie. Moduł ten stanowił cześć składową jednostki generującej wartość potencjału korelacyjno-wymiennego, wykorzystywaną w obliczeniach kwantowo-chemicznych. Prezentowana jednostka składa się z potokowych bloków zmiennoprzecinkowych. W pracy zaprezentowano również wyniki akceleracji obliczeń względem procesora ogólnego przeznaczenia Itanium2 1.6 GHz.

EN

The paper presents FPGA acceleration and implementation results of the orbital function calculation employed in quantum-chemistry. The orbital function core is composed of the authors' customized floating-point hardware modules. These modules are scalable from single to double precision, capable of working at frequency ranging from 100 to 200 MHz. Besides hardware implementation, the design process also involved reformulation of the algorithm in order to adapt them to the platform profile. The computational procedure presented in this paper is part of the algorithm for generating exchange-correlation potential, and is also recognized as one of the most computationally intensive routines. This feature justifies the effort devoted to develop its hardware implementation. The precision of floating-point operations becomes a primary concern when dealing with low-level quantum chemistry procedures, thus the authors have taken various measures to optimize them, both in terms of resource consumption and processing speed.

15

Mnożenie o stałej szerokości bitowej z zaokrąglaniem

Jamro E., Wielgosz M., Russek P., Wiatr K.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 7

769-771

PL

Niniejszy artykuł prezentuje mnożenie o stałej szerokości bitowej, dla którego szerokość bitowa argumentów jest taka sama jak danej wyjściowej. Najmłodsze bity wyniku są odrzucane już na etapie mnożenia, dzięki czemu układ zajmuje mniej zasobów kosztem niewielkiego błędu obliczeń, który można zmniejszyć poprzez zastosowanie dodatkowych bitów ochronnych, układu kompensacji błędu oraz operacji zaokrąglania. Niniejszy artykuł proponuje nową architekturę uwzględniające powyższe operacje.

EN

The paper deals with fixed-width multipliers, i.e. multipliers for which inputs and output bit-width is the same. In order to reduce hardware requirements for such a multiplier, some of the multiplier logic is truncated during multiplication process (see Fig. 1). This, however, introduces a calculation error which can be reduced by both special truncation-error compensation logic (e.g. presented in Fig. 2) and by additional guard bits. As presented in Tabs. 1 and 2, for relatively small number of guard bits g, the overall error is determined by the rounding process rather than truncation. Nevertheless, as it is proved in this paper, for g>0, the error compensation logic interfere with the rounding process, e.g. offsets the Mean Error (ME). Therefore a novel multiplier denoted as Mean Error optimized Rounded Truncated Multiplier (MERTM) is presented. The MERTM, instead of rounding, includes additional AND gates in comparison to the VCTM [1]. As a result, for the MERTM, ME approaches zero.

16

Zastosowanie układów rekonfigurowalnych we wspomaganiu operacji sortowania danych

Russek R., Jamro E., Wielgosz M., Wiatr K.

Elektronika : konstrukcje, technologie, zastosowania

|

2010

|

Vol. 51, nr 9

155-157

PL

Niniejszy artykuł dotyczy sprzętowej akceleracji operacji sortowania. W proponowanym rozwiązaniu operacja sortowania odbywa się w sposób hybrydowy. Część operacji realizowana jest przez procesor sprzętowy, a cześć przez procesor ogólnego przeznaczenia CPU. W celu przyśpieszenia procesu projektowania procesora dedykowanego, jako język opisu użyto języka projektowania wysokiego poziomu HLS Mitrion-C. Chociaż uzyskane przyśpieszenie rzędu 0,5 nie wydaje się bardzo atrakcyjne, jednak w przypadku zastosowania projektowania wysokiego poziomu jest akceptowalne ze względu na bardzo krotki czas projektowania i uruchomienia koprocesora. W artykule przedstawiono kilka konfiguracji procesora sortującego. Zastosowano układ rekonfigurowalny firmy Xilinx Virtex4.

EN

Data Sorting is a fundamental operation that is implemented by majority of the data mining systems. Consequently, in such solutions as databases it is critical for the overal system performance. Undoubtly, the sorting operation is necessary to perform a data indexing which is essential for efficient implementation of such basic data mining operation as data storing, dat analysis or searching. This article regards to hardware acceleration of sorting. For that purpose, dedicted coprocessor was developed to support CPU. In order to speed-up the design process, High Level Synthesis (HLS) language, Mitrion-C, was utilized as a design entry. The article presents several configurations of the sorting processor. Xilinx Virtex4 was used as an implementation platform.

17

Utilization of FPGA Architectures for High Performance Computations

Dąbrowska-Boruch A., Jamro E., Janiszewski M., Kuna D., Machaczek K., Russek P., Wiatr K., Wielgosz M.

Computational Methods in Science and Technology

|

2010

|

Vol. spec. iss. (1)

63-69

EN

The primary intention of this paper is to present the results of several cases where the FPGA technology was used for high performance calculations. We gathered applications that had been developed over the last couple of years. Over this period of time we observed that there had been a rapid growth of interest in the utilization of FPGA for HPC. Basing on our expertise we give selected metrics, results and conclusions which, in our opinion, are important for anyone who is interested in FPGA as an alternative for faster computations. A brief description of the characteristics of FPGA and FPGA usage for acceleration are also included for novices on the subject.

18

Sprzętowa implementacja części wielomianowej funkcji orbitalnej na potrzeby obliczeń kwantowo-chemicznych

Wielgosz M., Jamro E., Russek P., Wiatr K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2010

|

T. 14, z. 3/2

939-949

PL

W artykule przedstawione zostały wyniki implementacji modułu obliczającego część wielomianową orbitalu atomowego. Generowanie funkcji orbitalnej jest jednym z najbardziej wymagających obliczeniowo fragmentów procedury DFT. Procedura ta wykorzystywana jest w chemii kwantowej do modelowania zaawansowanych wieloatomowych cząsteczek. Wykonanie obliczeń na komputerach dużej mocy zajmuje często wiele czasu, który dla bardziej skomplikowanych układów może wynosić nawet kilka dni. Dlatego została podjęta próba przyspieszenia obliczeń DFT z wykorzystaniem układów FPGA. Otrzymane wyniki akceleracji silnie zależą od charakteru cząsteczki, dla której prowadzone są obliczenia. Maksymalne uzyskane przyspieszenie wynosiło 3,5x. Należy oczekiwać większego przyspieszania, gdy kompletny algorytm generowania macierzy korelacyjno-wymiennej zostanie zaimplementowany w układzie FPGA.

EN

The hardware acceleration module for generating the polynomial part of the orbital function in quantum chemistry calculation is presented. Both implementation and acceleration results are provided in the paper along with the comparison tests (against Itanium 2 processor). The implementation described can be regarded as a milestone on the way towards introducing an efficient hardware implementation of the exchange-correlation potential. The FPGA-based SGI RASC accelerator was used to offload a processor in the most exhausting computations of the SCF routine. The paper also covers issues regarding an integration of the PP (polynomial part) module with the rest of the computational system.

19

Novel Reduced-Width Multiplier Structure Dedicated for FPGAs

Jamro E., Wielgosz M., Wiatr K.

Przegląd Elektrotechniczny

|

2009

|

R. 85, nr 8

66-69

EN

This paper describes a novel structure of reduced-width multiplier. The main idea is to use a special architecture to compensate for the truncation error. The architecture is dedicated to FPGAs (Filed Programmable Gate Arrays) and does not require any additional FPGAs resources in comparison to the direct truncation.

PL

Niniejszy artykuł prezentuje nową strukturę układu mnożącego o skróconej szerokości z dodatkowym układem kompensacji błędu odcięcia. W przeciwieństwie do prezentowanych dotąd technik kompensacji błędu odcięcia, prezentowana architektura jest dedykowana dla układów programowalnych FPGA i nie wymaga dodatkowych zasobów logicznych a mimo to umożliwia znaczącą redukcję błędu.

20

Digital signal acquisition and processing in FPGAs

Jamro E., Cioch W.

Przegląd Elektrotechniczny

|

2009

|

R. 85, nr 2

7-9

EN

Rapid grow of FPGAs resources allows to include a whole system into a single FPGA chip. The system usually include soft-processors, limited on-chip memory, input-output interfaces, etc. The main advantage of FPGAs solution is, however, incorporation of dedicated hardware signal processing units such as signal filtering, FFT for which signal processing speed significantly surpasses DSP or general purpose processors. This paper presents an example of a whole system solution incorporating an FPGA and high-speed analog-digital converters.

PL

Gwałtowny wzrost dostępnych zasobów układów FPGA pozwolił na integrację całego systemu w jednym układzie FPGA. Taki system z reguły zawiera soft-procesor, niewielką pamięć, interface'y wejścia wyjścia. Podstawową zaletą systemu opartego na układach FPGA jest możliwość dołączenia dedykowanych jednostek przetwarzania danych, umożliwiających np. filtrowanie sygnałów, obliczanie FFT. Dedykowane jednostki wykonują wybrane obliczenia dużo szybciej niż procesory DSP lub ogólnego przeznaczenia. Niniejszy artykuł przedstawia przykład właśnie takiego systemu składającego się układu FPGA i szybkich przetworników analog-cyfra.