Ograniczanie wyników
Czasopisma help
Autorzy help
Lata help
Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników

Znaleziono wyników: 110

Liczba wyników na stronie
first rewind previous Strona / 6 next fast forward last
Wyniki wyszukiwania
help Sortuj według:

help Ogranicz wyniki do:
first rewind previous Strona / 6 next fast forward last
A simple single-chip, FPGA based oscilloscope was designed both to supply a user with a low-budget oscilloscope and to teach operation of such devices. The device implements basic functions of real oscilloscope, providing clear insight in processes of signal acquisition (employing FPGA built-in analog-to-digital converter with aggregated sampling rate equal 1MS/s), processing and displaying acquired signals. Also some effort was made in order to fit the design in limited resources of the selected FPGA device. The project is open source [1].
FPGAs have big computing possibilities and therefore are very popular as dedicated hardware accelerators. A few years ago, FPGAs were expensive and the cheapest ones had very limited capabilities, because of small amount of logic elements and slow internal clocks. Nowadays, cheap development boards are available at a price below 50€ with abilities to transmit even HDMI signals. This paper covers implementation of the soft processor with a 3D graphics coprocessor on the cheapest available FPGA board with HDMI connector, containing only 8k Logic Elements.
The video resolutions used in a variety of media are constantly rising. While manufacturers struggle to perfect their screens, it is also important to ensure the high quality of the displayed image. Overall quality can be measured using a Mean Opinion Score (MOS). Video quality can be affected by miscellaneous artifacts appearing at every stage of video creation and transmission. In this paper, we present a solution to calculate four distinct video quality metrics that can be applied to a real-time video quality assessment system. Our assessment module is capable of processing 8K resolution in real time set at a level of 30 frames per second. The throughput of 2.19 GB/s surpasses the performance of pure software solutions. The module was created using a high-level language to concentrate on architectural optimization.
A floating point accumulator cannot be obtained straightforwardly due to its pipeline architecture and feedback loop. Therefore, an essential part of the proposed floating point accumulator is a critical accumulation loop which is limited to an integer adder and 16-bit shifter only. The proposed accumulator detects a catastrophic cancellation which occurs e.g. when two similar numbers are subtracted. Additionally, modules with reduced hardware resources for rough error evaluation are proposed. The proposed architecture does not comply with the IEEE-754 floating point standard but it guarantees that a correct result, with an arbitrarily defined number of significant bits, is obtained. The proposed calculation philosophy focuses on the desired result error rather than on calculation precision as such.
Automatic text categorization presents many difficulties. Modern algorithms are getting better in extracting meaningful information from human language. However, they often significantly increase complexity of computations. This increased demand for computational capabilities can be facilitated by the usage of hardware accelerators like general purpose graphic cards. In this paper we present a full processing flow for document categorization system. Gram-Schmidt process signatures calculation up to 12 fold decrease in computing time of system components.
Content available Real time 8K video quality assessment using FPGA
This paper presents a hardware architecture of the video quality assessment module. Two different metrics were implemented on FPGA using modern High Level Language for digital system design – Impulse C. FPGA resources consumption of the presented module is low, which enables module-level parallelization. Tests conducted for four modules working concurrently show that 1.96 GB/s throughput can be achieved. The module is capable of processing 8K video stream in a real-time manner i.e. 30 frames/second. Such high performance of the presented solution was achieved due to the series of architectural optimization introduced to the module, such as reduction of data precision and reuse of various module components.
We present a custom processor that was designed to enhance algorithms of finding Low Autocorrelation Binary Sequences (LABS). Finding LABS is very computationally exhaustive, but no custom computing solutions have been reported in the literature so far. A computational kernel which allowed creating an effective single-purpose processor was determined and an appropriate architecture was proposed. The selected elements of the architecture were coded in High-Level Synthesis (HLS) language to speed up the design process. Afterwards, the processor was verified and tested in Xilinx’s Virtex7 FPGA. At the beginning of the paper, we briefly present the finding LABS problem and its importance. Later, we deliver the algorithm, its custom processor structure, and implementation results in terms of the processor performance, size and power.
In this paper, we present the advantage of the ability of FPGAs to perform various computationally complex calculations using deep pipelining and parallelism. We propose an architecture that consists of many small stream processing blocks. The designed framework maintains proper data movement and synchronization. The architecture can be easily adapted to be implemented in FPGA devices of a various size and cost - from small SoC devices to high-end PCIe accelerator cards. It is capable to perform a selected operation on a sparse data that are loaded as the stream of vectors. As an example application, we have implemented the cosine similarity measure for the text similarity calculations that uses the TF-IDF weighting scheme. The presented example application calculates the similarity of texts from the set of input documents to documents from the large database. The scheme is used to find the most similar documents. The proposed design can decrease the service time of search queries in computer centers while reducing power consumption.
One of the most challenging issues in the case of many and multi-core architectures is how to exploit their potential computing power in legacy systems without a deep knowledge of their architecture. The analysis of static dependence and dynamic data dependences of a program run, can help to identify independent paths that could have been computed by individual parallel threads. The statistics of reusing the data and its size is also crucial in adapting the application in GPU many-core hardware architecture because of specific memory hierarchies. The proposed profiling system accomplishes static data analysis and computes dynamic dependencies for Java programs as well as recommends parts of source code with the highest potential for parallelization in GPU. Such an analysis can also provide starting point for automatic parallelization.
The presented algorithms employ the Vector Space Model (VSM) and its enhancements such as TFIDF (Term Frequency Inverse Document Frequency) with Singular Value Decomposition (SVD). TFIDF were applied to emphasize the important features of documents and SVD was used to reduce the analysis space. Consequently, a series of experiments were conducted. They revealed important properties of the algorithms and their accuracy. The accuracy of the algorithms was estimated in terms of their ability to match the human classification of the subject. For unsupervised algorithms the entropy was used as a quality evaluation measure. The combination of VSM, TFIDF, and SVD came out to be the best performing unsupervised algorithm with entropy of 0.16.
Artykuł opisuje badania na temat klasyfikatorów tekstów. Zadanie polegało na zaprojektowaniu akceleratora sprzętowego, który przyspieszyłby proces klasyfikacji tekstów pod względem znaczeniowym. Projekt został podzielony na dwie części. Celem części pierwszej było zaproponowanie sprzętowej implementacji algorytmu realizującego metrykę do obliczania podobieństwa dokumentów. W drugiej części zaprojektowany został cały systemem akceleratora sprzętowego. Kolejnym etapem projektowym jest integracja modelu metryki z system akceleracji.
The aim of this project is to propose a hardware accelerating system to improve the text categorization process. Text categorization is a task of categorizing electronic documents into the predefined groups, based on the content. This process is complex and requires a high performance computing system and a big number of comparisons. In this document, there is suggested a method to improve the text categorization using the FPGA technology. The main disadvantage of common processing systems is that they are single-threaded – it is possible to execute only one instruction per a single time unit. The FPGA technology improves concurrence. In this case, hundreds of big numbers may be compared in one clock cycle. The whole project is divided into two independent parts. Firstly, a hardware model of the required metrics is implemented. There are two useful metrics to compute a distance between two texts. Both of them are shown as equations (1) and (2). These formulas are similar to each other and the only difference is the denominator. This part results in two hardware models of the presented metrics. The main purpose of the second part of the project is to design a hardware accelerating system. The system is based on a Xilinx Zynq device. It consists of a Cortex-A9 ARM processor, a DMA controller and a dedicated IP Core with the accelerator. The block diagram of the system is presented in Fig.4. The DMA controller provides duplex transmission from the DDR3 memory to the accelerating unit omitting a CPU. The project is still in development. The last step is to integrate the hardware metrics model with the accelerating system.
W ramach praca przeprowadzona została analiza możliwości wykorzystania algorytmu winnowing do strumieniowego przetwarzania informacji tekstowej. W szczególności nacisk został położony na operacje generacji odcisku jako jej zredukowanej reprezentacji wiadomości tekstowej. Autorzy przeprowadzili szereg eksperymentów, w celu określenia efektywności działania algorytmu oraz możliwego do uzyskania przyspieszenia obliczeń, z wykorzy-staniem węzła procesorów Intel Xeon E5645 2.40GHz oraz karty GPU Nvidia Tesla m2090.
There are several models available for information retrieval and text analysis but the two are considered to be the dominant ones, namely Boolean and the vector space model (VSM). A model maps the existing words or text into a new representation space. This paper presents a boolean n-gram-based algorithm - winnowing for fast text search and comparison of documents with main focus on its implementation and performance analysis. The algorithm is used to generate fingerprints (i.e. a set of hashes) of the analyzed documents. A dedicated test framework was designed and implemented to handle the task of the algorithm evaluation which utilizes PAN test corpus and programming environment. Several tests were conducted in order to determine the comparison quality of the obfuscated and not obfuscated text for the winnowing algorithm and different window and n-gram size. The tests revealed interesting properties of the algorithms with respect to comparison of documents as well as defied the limits of their applicability. The n-gram-based algorithms due to their simplicity are well suited for hardware implementation. Thus, the authors implemented compu-tationally demanding part of both fingerprint generation both on CPU and GPU. Performance measurements for Intel Xeon E5645, 2.40GHz and Nvidia Tesla m2090 implementation of Ngram-based algorithm show approximately 14x computational speedup.
Prezentowane w pracy badania dotyczą bezstratnej kompresji danych opartej o metodę Huffmana i zgodnej ze standardem deflate stosowanym w plikach .zip / .gz. Zaproponowana jest optymalizacja kodera Huffmana polegająca na podziale na bloki, w których stosuje się różne książki kodowe. Wprowadzenie dodatkowego bloku z reguły poprawia stopień kompresji kosztem narzutu spowodowanego koniecznością przesłania dodatkowej książki kodowej. Dlatego w artykule zaproponowano nowy algorytm podziału na bloki.
According to deflate [2] standard (used e.g. in .zip / .gz files), an input file can be divided into different blocks, which are compressed employing different Huffman [1] codewords. Usually the smaller the block size, the better the compression ratio. Nevertheless each block requires additional header (codewords) overhead. Consequently, introduction of a new block is a compromise between pure data compression ratio and headers size. This paper introduces a novel algorithm for block Huffman compression, which compares sub-block data statistics (histograms) based on current sub-block entropy E(x) (1) and entropy-based estimated average word bitlength Emod(x) for which codewords are obtained for the previous sub-block (2). When Emod(x) - E(x) > T (T - a threshold), then a new block is inserted. Otherwise, the current sub-block is merged into the previous block. The typical header size is 50 B, therefore theoretical threshold T for different sub-block sizes S is as in (3) and is given in Tab. 2. Nevertheless, the results presented in Tab. 1 indicate that optimal T should be slightly different - smaller for small sub-block size S and larger for big S. The deflate standard was selected due to its optimal compression size to compression speed ratio [3]. This standard was selected for hardware implementation in FPGA [4, 5, 6, 7].
Artykuł opisuje uruchomienie portu ACP w układzie EPP firmy Xilinx przy użyciu CDMA zarządzającego transmisją pomiędzy akceleratorem, a rdzeniami procesora. Głównym celem badań było utworzenie modułu dokonującego tak zwanego haszowania zbiorów danych. Do wykonania tej operacji wykorzystany został układ Zynq 7000 posiadający zasoby logiki programowalnej oraz dwa rdzenie ARM A9. Powstały dwie koncepcje realizacji akceleratora. Pierwsza wersja zakładała bezpośredni przepływ danych ze źródła do akceleratora, a następnie do rdzeni ARM. Drugie rozwiązanie zakłada wykorzystanie portu ACP.
This paper introduces a new approach to hardware acceleration using the ACP(Acceleration Coherency Port) in Xilinx Zynq-7000 EPP XC7Z020. The first prototype allocated BRAM memory and transferred data through the ACP. The second one used a hardware hashing module to process data outside the CPU. The module received and returned data through the ACP port. The main task of the system is to replace a set of data with its shorter representative of constant length without interference of the processing unit. The main benefit of hashing data lies within the constant length of function outcome, which leads to data compression. Compression is highly desirable while comparing large subsets of data, especially in data mining. The execution of a hashing function requires high performance of the CPU due to the computational complexity of the algorithm. Two concepts where established. The first one assumed transferring data directly do the hardware accelerator and later to ARM cores. This solution is attractive due to its simplicity and relatively fast. Unfortunately, the data cannot be processed before hashing with the same CPU without significant speed reduction. The second approach used the ACP port which can transfer data very fast between L2/L3 cache memory without flushing of validating cache. The data can be processed by the software driven CPU, sent to the accelerator and then sent back to CPU for further processing. To accomplish the established task, the Zynq 7000 EPP with double ARM A9 core and programmable logic in one chip was used.
Artykuł przedstawia implementację algorytmów tekstowych w wybranych platformach przetwarzania równoległego. Dostępność procesorów wielordzeniowych oraz kart graficznych ogólnego przeznaczenia sprawia, iż badania nad równoległą implementacją algorytmów w celu ich akceleracji nabierają coraz większego znaczenia. Algorytmy tekstowe są niezwykle istotnym i często niezbędnym elementem zaawansowanych algorytmów analizy tekstu oraz są także składowymi funkcji wyszukiwania wzorców w tekście wielu języków programowania. W pracy dokonano analizy najpopularniejszych algorytmów tekstowych oraz dokonano ich analizy pod kątem ich zrównoleglenia w celu ich implementacji w procesorze wielordzeniowym oraz karcie graficznej ogólnego przeznaczenia. Analizowanymi algorytmami są: boyer-moore, algorytm naiwny oraz algorytm knuth-morris-pratt. Następnie dokonano porównania efektywności ich realizacji na wymienionych platformach sprzętowych.
This paper presents implementation of text algorithms in multicore CPU and GPGPU. The text algorithms are very common algorithms used in text analysis process and they are a part of functions used for text patterns recognition. The library functions for text searching implemented in many languages very often use most popular text-algorithms. The paper describes the analysis of these algorithms for parallel implementations in multicore processors and general purpose graphic cards. The research work presented in this paper shows that text algorithms can be partially parallelized. The process of acceleration can be done by appropriate dividing the input text between parallel threads (data parallelism). The comparative studies were performed for the following algorithms: boyer-moore (horspool) , naive and knuth-morris-pratt algorithm. The presented results show the efficiency of these algorithms in the case of different type and size of patterns. In the case of GPU the implementation was made in the CUDA framework. The OpenMP library was used for a multicore version.
Content available remote Medical Visualizer 3D: Hardware Controller for Dmd Module
In this paper an implementation of the module responsible for the control of micro-mirror array for later use in projection is described. Existing technologies allow for projections of medical images in Digital Imaging and Communications in Medicine format only in the form of a flat 2D image. The 3D Visualizer will allow to display medical images in three dimensions using its own projection surface. The matrix controlling device has been largely developed on the basis of reverse engineering studies carried out on the functional system based on a driver from Texas Instruments. Driver is built on the FPGA with implemented soft processor from Xilinx - MicroBlaze.
Niniejszy artykuł opisuje nową architekturę sprzętową kompresji słownikowej, np. LZ77, LZSS czy też Deflate. Zaproponowana architektura oparta jest na funkcji haszującej. Poprzednie publikacje były oparte na sekwencyjnym odczycie adresu wskazywanego przez pamięć hasz, niniejszy artykuł opisuje układ, w którym możliwe jest równoległe odczytywanie tego adresu z wielu pamięci hasz, w konsekwencji możliwa jest kompresja słownikowa z szybkością na poziomie 1B ciągu wejściowego na takt zegara. Duża szybkość kompresji jest okupiona nieznacznym spadkiem stopnia kompresji.
This paper describes a novel parallel architecture for hardware (ASIC or FPGA) implementation of dictionary compressor, e.g. LZ77 [1], LZSS [2] or Deflate [4]. The proposed architecture allows for very fast compression – 1B of input data per clock cycle. A standard compression architecture [8, 9] is based on sequential hash address reading (see Fig. 2) and requires M clock cycles per 1B of input data, where M is the number of candidates for string matching, i.e. hashes look ups (M varies for different input data). In this paper every hash address is looked up in parallel (see Fig. 3). The drawback of the presented method is that the number of M is defined (limited), therefore the compression ratio is slightly degraded (see Fig. 4). To improve compression ratio, a different sting length may be searched independently, i.e. not only 3B, but also 4B, … N B hashes (see results in Fig. 5, 6). Every hash memory (M(N-2)) usually requires a direct look-up in the dictionary to eliminate hash false positive cases or to check whether a larger length sting was found. In order to reduce the number of dictionary reads, an additional pre-elimination algorithm is proposed, thus the number of dictionary reads does not increase rapidly with growing N (see Fig. 7).
Otwarty standard kompresji danych, Deflate, jest szeroko stosowanym standardem w plikach .gz / .zip i stanowi kombinację kompresji metodą LZ77 / LZSS oraz kodowania Huffmana. Niniejszy artykuł opisuje implementację w układach FPGA dekompresji danych według tego standardu. Niniejszy moduł jest w stanie dokonać dekompresji co najmniej 1B na takt zegara, co przy zegarze 100MHz daje 100MB/s. Aby zwiększyć szybkość, możliwa jest praca wielu równoległych modułów dla różnych strumieni danych wejściowych.
This paper describes FPGA implementation of the Deflate standard decoder. Deflate [1] is a commonly used compression standard employed e.g. in zip and gz files. It is based on dictionary compression (LZ77 / LZSS) [4] and Huffman coding [5]. The proposed Huffman decoded is similar to [9], nevertheless several improvements are proposed. Instead of employing barrel shifter a different translation function is proposed (see Tab. 1). This is a very important modification as the barrel shifter is a part of the time-critical feedback loop (see Fig. 1). Besides, the Deflate standard specifies extra bits, which causes that a single input word might be up to 15+13=28 bits wide, but this width is very rare. Consequently, as the input buffer might not feed the decoder width such wide input date, a conditional decoding is proposed, for which the validity of the input data is checked after decoding the input symbol, thus when the actual input symbol bit widths is known. The implementation results (Tab. 2) show that the occupied hardware resources are mostly defined by the number of BRAM modules, which are mostly required by the 32kB dictionary memory. For example, comparable logic (LUT / FF) resources to the Deflate standard decoder are required by the AXI DMA module which transfers data to / from the decoder.
Sorting is a common problem in computer science. There are a lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, many-core and multi-core platforms have enabled the creation of wide parallel algorithms. We have standard processors that consist of multiple cores and hardware accelerators, like the GPU. Graphic cards, with their parallel architecture, provide new opportunities to speed up many algorithms. In this paper, we describe the results from the implementation of a few different parallel sorting algorithms on GPU cards and multi-core processors. Then, a hybrid algorithm will be presented, consisting of parts executed on both platforms (a standard CPU and GPU). In recent literature about the implementation of sorting algorithms in the GPU, a fair comparison between many core and multi-core platforms is lacking. In most cases, these describe the resulting time of sorting algorithm executions on the GPU platform and a single CPU core.
The paper addresses the issue of searching for similar images and objects in arepository of information. The contained images are annotated with the help of the sparse descriptors. In the presented research, different color and edge histogram descriptors were used. To measure similarities among images,various color descriptors are compared. For this purpose different distance measures were employed. In order to decrease execution time, several code optimization and parallelization methods are proposed. Results of these experiments, as well as discussion of the advantages and limitations of different combinations of metods are presented.
first rewind previous Strona / 6 next fast forward last
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.