Wyniki wyszukiwania - BazTech

1

Implementacja metody momentów w heterogenicznym środowisku obliczeniowym CPU/GPU

Karwowski A., Topa T., Noga A.

Przegląd Telekomunikacyjny + Wiadomości Telekomunikacyjne

|

2016

|

nr 6

503--505, CD

PL

Opisano implementację metody momentów – sztandarowego narzędzia analizy zagadnień inżynierii pola elektromagnetycznego (anteny, kompatybilność EM, mikrofale) – w heterogenicznym środowisku obliczeniowym CPU/GPU niskobudżetowej stacji roboczej typu desktop. Wykazano możliwość znaczącej poprawy wydajności metody dzięki wykorzystaniu zdolności procesora wielordzeniowego i procesorów strumieniowych karty graficznej do przetwarzania równoległego.

EN

Implementation of the Method-of-Moments – as a tool for the analysis of various electromagnetic engineering problems (antennas, electromagnetic compatibility, microwaves) – on a heterogeneous CPU/GPU platform of a typical low-cost desktop workstation is described in the paper. The possibility of attaining noticeable performance improvement of the method by utilizing potential of both the multi-core CPU processor and graphic card for parallel processing is demonstrated.

2

FPAA Accelerator for Machine Vision systems

Szczęsny S., Handkiewicz A., Naumowicz M., Melosik M.

Przegląd Elektrotechniczny

|

2015

|

R. 91, nr 9

184-187

EN

This article presents a proposition of an FPAA-type programmable accelerator for image preprocessing. The structure of the accelerator is modelled basing on CPLD digital circuits. The innovation here – is using the current mode, which makes it possible to implement the accelerator in nanometre technologies. Another original solution proposed in the work is a reconfigurable multi-output current mirror. The article describes the hardware layer and a method for programming it. An implementation of an RGB-to-YCrCb colour space converter is presented. Moreover physical parameters obtained in post-layout simulations are presented as well. The solution can be used as a standalone programmable circuit or as an IPcore for a larger analogue-digital system.

PL

W artykule przedstawiono propozycję programowalnego akceleratora typu FPAA do wstępnej obróbki obrazu. Struktura akceleratora wzorowana jest na cyfrowych układach CPLD. Innowacyjność polega na wykorzystaniu trybu prądowego, co umożliwia realizację akceleratora w technologiach nanometrowych. Kolejnym oryginalnym rozwiązaniem zaproponowanym w pracy jest rekonfigurowalne wielowyjściowe zwierciadło prądowe. W artykule omówiono warstwę sprzętową oraz metodę jej programowania. Zaprezentowano implementację konwertera przestrzeni barw RGB do YCrCb w akceleratorze i przedstawiono parametry fizyczne uzyskane w symulacjach post-layoutowych. Rozwiązanie może być wykorzystane jako samodzielny układ programowalny lub IP-core większego systemu analogowo-cyfrowego.

3

Wykorzystanie akceleracji sprzętowej przy implementacji metryk podobieństwa tekstów

Iwanecki Ł., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

426--428

PL

Artykuł opisuje badania na temat klasyfikatorów tekstów. Zadanie polegało na zaprojektowaniu akceleratora sprzętowego, który przyspieszyłby proces klasyfikacji tekstów pod względem znaczeniowym. Projekt został podzielony na dwie części. Celem części pierwszej było zaproponowanie sprzętowej implementacji algorytmu realizującego metrykę do obliczania podobieństwa dokumentów. W drugiej części zaprojektowany został cały systemem akceleratora sprzętowego. Kolejnym etapem projektowym jest integracja modelu metryki z system akceleracji.

EN

The aim of this project is to propose a hardware accelerating system to improve the text categorization process. Text categorization is a task of categorizing electronic documents into the predefined groups, based on the content. This process is complex and requires a high performance computing system and a big number of comparisons. In this document, there is suggested a method to improve the text categorization using the FPGA technology. The main disadvantage of common processing systems is that they are single-threaded – it is possible to execute only one instruction per a single time unit. The FPGA technology improves concurrence. In this case, hundreds of big numbers may be compared in one clock cycle. The whole project is divided into two independent parts. Firstly, a hardware model of the required metrics is implemented. There are two useful metrics to compute a distance between two texts. Both of them are shown as equations (1) and (2). These formulas are similar to each other and the only difference is the denominator. This part results in two hardware models of the presented metrics. The main purpose of the second part of the project is to design a hardware accelerating system. The system is based on a Xilinx Zynq device. It consists of a Cortex-A9 ARM processor, a DMA controller and a dedicated IP Core with the accelerator. The block diagram of the system is presented in Fig.4. The DMA controller provides duplex transmission from the DDR3 memory to the accelerating unit omitting a CPU. The project is still in development. The last step is to integrate the hardware metrics model with the accelerating system.

4

Komunikacja ze sprzętowym akceleratorem haszowania n-gramów dla procesora ARM z wykorzystaniem portu ACP

Barszczowski M., Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2014

|

R. 60, nr 7

486--488

PL

Artykuł opisuje uruchomienie portu ACP w układzie EPP firmy Xilinx przy użyciu CDMA zarządzającego transmisją pomiędzy akceleratorem, a rdzeniami procesora. Głównym celem badań było utworzenie modułu dokonującego tak zwanego haszowania zbiorów danych. Do wykonania tej operacji wykorzystany został układ Zynq 7000 posiadający zasoby logiki programowalnej oraz dwa rdzenie ARM A9. Powstały dwie koncepcje realizacji akceleratora. Pierwsza wersja zakładała bezpośredni przepływ danych ze źródła do akceleratora, a następnie do rdzeni ARM. Drugie rozwiązanie zakłada wykorzystanie portu ACP.

EN

This paper introduces a new approach to hardware acceleration using the ACP(Acceleration Coherency Port) in Xilinx Zynq-7000 EPP XC7Z020. The first prototype allocated BRAM memory and transferred data through the ACP. The second one used a hardware hashing module to process data outside the CPU. The module received and returned data through the ACP port. The main task of the system is to replace a set of data with its shorter representative of constant length without interference of the processing unit. The main benefit of hashing data lies within the constant length of function outcome, which leads to data compression. Compression is highly desirable while comparing large subsets of data, especially in data mining. The execution of a hashing function requires high performance of the CPU due to the computational complexity of the algorithm. Two concepts where established. The first one assumed transferring data directly do the hardware accelerator and later to ARM cores. This solution is attractive due to its simplicity and relatively fast. Unfortunately, the data cannot be processed before hashing with the same CPU without significant speed reduction. The second approach used the ACP port which can transfer data very fast between L2/L3 cache memory without flushing of validating cache. The data can be processed by the software driven CPU, sent to the accelerator and then sent back to CPU for further processing. To accomplish the established task, the Zynq 7000 EPP with double ARM A9 core and programmable logic in one chip was used.

5

Sprzętowa akceleracja wybranych algorytmów kompresji obrazu nieruchomego w standardzie JPEG

Koryciak S., Dąbrowska-Boruch A., Wiatr K.

Pomiary Automatyka Kontrola

|

2012

|

R. 58, nr 7

593-595

PL

Artykuł opisuje opracowanie akceleratora dla wybranych algorytmów kompresji obrazu nieruchomego. Do jego sprzętowej realizacji został wykorzystany język opisu sprzętu VHDL. Wynikiem pracy była skuteczna implementacja na układ programowalny dekompresora obrazów nieruchomych zapisanych w standardzie JPEG ISO/IEC 10918-1(1993), trybie Baseline będącym podstawowym i obowiązkowym trybem dla tego standardu. Szczególną uwagę poświęcono wyborowi i implementacji dwóch najważniejszych zdaniem autora algorytmów występujących w omawianym standardzie.

EN

Image compression is one of the most important topics in the industry, commerce and scientific research. Image compression algorithms need to perform a large number of operations on a large number of data. In the case of compression and decompression of still images the time needed to process a single image is not critical. However, the assumption of this project was to build a solution which would be fully parallel, sequential and synchronous. The paper describes the development of an accelerator for selected still image compression algorithms. In its hardware implementation there was used the hardware description language VHDL. The result of this work was a successful implementation on a programmable system decompressor of still images saved in JPEG standard ISO / IEC 10918-1 (1993), Baseline mode, which is a primary, fundamental, and mandatory mode for this standard. The modular system and method of connection allows the continuous input data stream. Particular attention was paid to selection and implementation of two major, in the authors opinion, algorithms occuring in this standard. Executing the IDCT module uses an algorithm transformation IDCT-SQ modified by the authors of this paper. It provides a full pipelining by applying the same kind of arithmetic operations between each stage. The module used to decode Huffman's code proved to be a bottleneck

6

Real-time implementation of moving object detection in video surveillance systems using FPGA

Kryjak T., Gorgoń M.

Computer Science

|

2011

|

Vol. 12

149-162

EN

The article presents the concept of real-time implementation computing tasks in video surveillance systems. A pipeline implementation of a multimodal background generation algorithm for colour video stream and a moving objects segmentation based on brightness, colour and textural information in reconfigurable resources of FPGA device is described. System architecture, resource usage and segmentation results are presented.

PL

W artykule zaprezentowano koncepcję implementacji zadań obliczeniowych wykorzystywanych w systemach nadzoru wizyjnego w czasie rzeczywistym. Opisano implementację wielomodalnej metody generacji tła dla sekwencji wideo zarejestrowanych w kolorze oraz segmentację obiektów ruchomych z wykorzystaniem informacji o jasności, kolorze i teksturze w zasobach rekonfigurowalnych układów FPGA. Zaprezentowano architekturę systemu, zużycie zasobów i przykładowe rezultaty segmentacji.

7

FPGA implementation of exchange-correlation potential calculation for DFT

Wielgosz M., Jamro E., Russek P., Wiatr K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2011

|

T. 15, z. 3

485-498

EN

Implementation results of the exchange-correlation module are presented in this paper. The authors have ported a computationally intensive part of quantum chemistry code to FPGA, which involved a substantial modification of its structure so that it matches the platform profile. Additionally, a set of the authors' customized modules for floating operations has been created along with software procedures handling FPGA-GPP intercommunication. Furthermore, several tests have been conducted to determine the speed-up achieved. Some more advanced computational cases have also been investigated to examine the module's performance increase with the number of atomic orbitals. The tests conducted for the orbital module revealed a significantly raised acceleration for higher atomic shells. This work also contains implementation results of the S matrix generation module, which are promising since the presented logic allows calculations to be conducted for 16 points simultaneously.

PL

W niniejszym artykule przedstawione zostały wyniki implementacji modułu obliczającego potencjał korelacyjno-wymienny dla procedury DFT. Autorzy zaimplementowali wymagające obliczeniowo fragmenty algorytmu DFT, co wiązało się ze znaczną modyfikacją algorytmu, tak by w pełni wykorzystać możliwości struktur rekonfigurowalnych. W konsekwencji powstał zestaw sprzętowych modułów zmiennoprzecinkowych oraz procedur zapewniających komunikacje pomiędzy częścią sprzętową oraz programowaną akceleratora. Przeprowadzone testy na platformie RASC wykazały przyspieszenie obliczeń wynoszące 3x dla modułu obliczającego wartość orbitalu atomowego w punkcie, natomiast większe przyspieszenie uzyskano dla jednostki realizujące obliczenia macierzy S.

8

Sprzętowa implementacja funkcji orbitalnej na potrzeby obliczeń kwantowo-chemicznych

Wielgosz M., Jamro E., Russek P., Wiatr K.

Pomiary Automatyka Kontrola

|

2010

|

R. 56, nr 7

705-707

PL

W niniejszym artykule przedstawione zostały wyniki implementacji modułu obliczającego wartość orbitalu atomowego w punkcie. Moduł ten stanowił cześć składową jednostki generującej wartość potencjału korelacyjno-wymiennego, wykorzystywaną w obliczeniach kwantowo-chemicznych. Prezentowana jednostka składa się z potokowych bloków zmiennoprzecinkowych. W pracy zaprezentowano również wyniki akceleracji obliczeń względem procesora ogólnego przeznaczenia Itanium2 1.6 GHz.

EN

The paper presents FPGA acceleration and implementation results of the orbital function calculation employed in quantum-chemistry. The orbital function core is composed of the authors' customized floating-point hardware modules. These modules are scalable from single to double precision, capable of working at frequency ranging from 100 to 200 MHz. Besides hardware implementation, the design process also involved reformulation of the algorithm in order to adapt them to the platform profile. The computational procedure presented in this paper is part of the algorithm for generating exchange-correlation potential, and is also recognized as one of the most computationally intensive routines. This feature justifies the effort devoted to develop its hardware implementation. The precision of floating-point operations becomes a primary concern when dealing with low-level quantum chemistry procedures, thus the authors have taken various measures to optimize them, both in terms of resource consumption and processing speed.

9

Zastosowanie układów rekonfigurowalnych we wspomaganiu operacji sortowania danych

Russek R., Jamro E., Wielgosz M., Wiatr K.

Elektronika : konstrukcje, technologie, zastosowania

|

2010

|

Vol. 51, nr 9

155-157

PL

Niniejszy artykuł dotyczy sprzętowej akceleracji operacji sortowania. W proponowanym rozwiązaniu operacja sortowania odbywa się w sposób hybrydowy. Część operacji realizowana jest przez procesor sprzętowy, a cześć przez procesor ogólnego przeznaczenia CPU. W celu przyśpieszenia procesu projektowania procesora dedykowanego, jako język opisu użyto języka projektowania wysokiego poziomu HLS Mitrion-C. Chociaż uzyskane przyśpieszenie rzędu 0,5 nie wydaje się bardzo atrakcyjne, jednak w przypadku zastosowania projektowania wysokiego poziomu jest akceptowalne ze względu na bardzo krotki czas projektowania i uruchomienia koprocesora. W artykule przedstawiono kilka konfiguracji procesora sortującego. Zastosowano układ rekonfigurowalny firmy Xilinx Virtex4.

EN

Data Sorting is a fundamental operation that is implemented by majority of the data mining systems. Consequently, in such solutions as databases it is critical for the overal system performance. Undoubtly, the sorting operation is necessary to perform a data indexing which is essential for efficient implementation of such basic data mining operation as data storing, dat analysis or searching. This article regards to hardware acceleration of sorting. For that purpose, dedicted coprocessor was developed to support CPU. In order to speed-up the design process, High Level Synthesis (HLS) language, Mitrion-C, was utilized as a design entry. The article presents several configurations of the sorting processor. Xilinx Virtex4 was used as an implementation platform.

10

Sprzętowa implementacja części wielomianowej funkcji orbitalnej na potrzeby obliczeń kwantowo-chemicznych

Wielgosz M., Jamro E., Russek P., Wiatr K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2010

|

T. 14, z. 3/2

939-949

PL

W artykule przedstawione zostały wyniki implementacji modułu obliczającego część wielomianową orbitalu atomowego. Generowanie funkcji orbitalnej jest jednym z najbardziej wymagających obliczeniowo fragmentów procedury DFT. Procedura ta wykorzystywana jest w chemii kwantowej do modelowania zaawansowanych wieloatomowych cząsteczek. Wykonanie obliczeń na komputerach dużej mocy zajmuje często wiele czasu, który dla bardziej skomplikowanych układów może wynosić nawet kilka dni. Dlatego została podjęta próba przyspieszenia obliczeń DFT z wykorzystaniem układów FPGA. Otrzymane wyniki akceleracji silnie zależą od charakteru cząsteczki, dla której prowadzone są obliczenia. Maksymalne uzyskane przyspieszenie wynosiło 3,5x. Należy oczekiwać większego przyspieszania, gdy kompletny algorytm generowania macierzy korelacyjno-wymiennej zostanie zaimplementowany w układzie FPGA.

EN

The hardware acceleration module for generating the polynomial part of the orbital function in quantum chemistry calculation is presented. Both implementation and acceleration results are provided in the paper along with the comparison tests (against Itanium 2 processor). The implementation described can be regarded as a milestone on the way towards introducing an efficient hardware implementation of the exchange-correlation potential. The FPGA-based SGI RASC accelerator was used to offload a processor in the most exhausting computations of the SCF routine. The paper also covers issues regarding an integration of the PP (polynomial part) module with the rest of the computational system.

11

Akceleracja obliczeń zmiennoprzecinkowych na platformie RASC

Wielgosz M., Jamro E., Wiatr K.

Pomiary Automatyka Kontrola

|

2009

|

R. 55, nr 7

485-487

PL

W artykule zostały zaprezentowane wyniki testów przeprowadzonych w celu określenia maksymalnej szybkości wykonywania operacji zmiennoprzecinkowych na platformie rekonfigurowanej RASC. Zaimplementowano różne dostępne tryby konfiguracji jednostki Host oraz RASC w celu wyłonienia najbardziej efektywnego pod względem wydajności trybu pracy jednostki obliczeniowej. Uzyskane wyniki pomiarów ujawniały, że kombinacja Direct I/O oraz DMA zapewnia najwyższą przepustowość pomiędzy węzłami Host i RASC. Niemniej jednak dla niektórych aplikacji tryb multi-buffering może okazać się bardziej odpowiedni, ze względu na możliwość jednoczesnego przesyłania danych i wykonywania operacji. Funkcja exp() w standardzie zmiennoprzecinkowym o podwójnej precyzji została wykorzystana jako przykładowa aplikacja, która pozwoliła oszacowanie możliwej do uzyskania akceleracji obliczeń na platformie RASC.

EN

This paper presents results of the tests performed to determine high speed calculations capabilities of the SGI RASC platform. Different data transfer modes and memory management approaches were examined to choose the most effective combination of the Host and RASC memory adjustments. That work may be regarded as a case study of the contemporary FPGA -based accelerator which, however, can characterize the whole branch of the devices. The paper is strongly focused on the floating point calculations potential of the FPGA accelerator. The RASC algorithm execution procedure, from the processor perspective, is composed of several functions which reserve resources, queue commands and perform other preparation steps. It is noteworthy (Fig. 3) that the time consumed by the functions remains roughly the same, independent of the algorithm being executed. The resource reservation procedure, once conducted, allows many executions of the algorithm -that amounts to huge time savings, since the procedure takes approximately 7.5 ms, which is roughly 99 % of the overall execution time of the algorithm. Rasclib algorithm commit and rasclib algorithm wait calls are considered to be the key (Fig. 3) part of the RASC software execution routine. The first one activates the FPGA between these two commands is the transfer and algorithm execution time. All curves (Fig. 4) reflect overall processing time of the same amount of data, but differ in size of the single data chunk which varies from 1024x64 bit = 8 kB to 1048576x64 bit = 8 MB. It has been observed that for the bigger chunk much better results are achieved in terms of the effective execution time. However, above 1 MB a decrease of the effective execution time seems to indicate saturation, therefore sending data in bigger portions may not improve the performance of the system so much. The most effective execution time of single exp() function for SRAM buffering mode is 12 ns, so 9,5 ns is transport overhead due to bus delays. The theoretical calculation time of single exp() function (data transfer is not taken into account) is 2,5 ns because two exp() are implemented on the RASC and clocked at 200 Mhz. The obtained measurement results show that Direct I/O mode together with DMA transfer provides the highest data throughput between the Host and RASC slice. Nevertheless, for some application multi-buffering can appear to be more suitable in terms of concurrent data transfer capabilities and FPGA algorithm execution. As a hardware acceleration example, there is considered an exponential function which allows estimating maximum achievable data processing speed.

12

Implementacja w układach FPGA modułu obliczającego funkcję jednoelektronową

Wielgosz M., Jamro E., Wiatr K.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

|

2009

|

T. 13, z. 3/1

1043-1050

PL

W artykule przedstawione zostały wyniki implementacji modułu obliczającego część eksponencjalną orbitalu atomowego (funkcję jednoelektoronową). Generowanie funkcji jednoelektrodowych jest jednym z najbardziej wymagających obliczeniowo fragmentów procedury DFT. Dlatego autorzy pracy postanowili wykorzystać układy FPGA do akceleracji wspomnianego algorytmu. Moduł sprzętowy został zaimplementowany na platformie SGI RASC w układzie FPGA serii Virtex-4 LX200. Składa się on z szeregu jednostek zmiennoprzecinkowych zaprojektowanych tak, by mogły pracować w sposób potokowy z częstotliwością sięgającą 200 MHz. Wstępnie przeprowadzone testy wykazały, że uzyskuje się przyspieszenie rzędu 5x względem analogicznych obliczeń prowadzonych na procesorze Intel Itanium 2 1.6 GHz. Należy zaznaczyć, że uzyskiwane przyspieszanie jest limitowane przez ograniczenia platformy (szerokości interfejsu komunikacyjnego).

EN

This paper presents an FPGA implementation of a finite sum of the exponential products (orbital function) calculation module. The module is composed of several units. All of them are specially designed, fully pipelined floating-point modules optimized for high speed performance, up to 200 MHz. Execution results revealed speed-up of 5x for the finite sum of the exponential products comparing to Intel Itanium 2 1.6 processor. Orbital function is a computationally critical part of the Hartree-Fock algorithm. Therefore an approach presented here aims to increase the performance of the whole quantum chemistry computational system by extending it with FPGA-based accelerator which is composed of two Xilinx Virtex-4 LX200 chips. It is worth underlining that achieved speed-up is limited by an external memory width constrain. Thus it can be expected that in foreseeable future introduction of next generation of FPGA-based accelerators will allow to increase the speed-up by just porting a project to them without adoption of any changes in the module's architecture.