SwePub
Sök i SwePub databas

  Extended search

Träfflista för sökning "L773:1936 7414 OR L773:1936 7406 "

Search: L773:1936 7414 OR L773:1936 7406

  • Result 1-9 of 9
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Cevrero, Alessandro, et al. (author)
  • Field Programmable Compressor Trees : Acceleration of Multi-Input Addition on FPGAs
  • 2009
  • In: ACM Trans. Reconfigurable Technol. Syst.. - : Association for Computing Machinery (ACM). - 1936-7406. ; 2:2, s. 1-36
  • Journal article (peer-reviewed)abstract
    • Multi-input addition occurs in a variety of arithmetically intensive signal processing applications. The DSP blocks embedded in high-performance FPGAs perform fixed bitwidth parallel multiplication and Multiply-ACcumulate (MAC) operations. In theory, the compressor trees contained within the multipliers could implement multi-input addition; however, they are not exposed to the programmer. To improve FPGA performance for these applications, this article introduces the Field Programmable Compressor Tree (FPCT) as an alternative to the DSP blocks. By providing just a compressor tree, the FPCT can perform multi-input addition along with parallel multiplication and MAC in conjunction with a small amount of FPGA general logic. Furthermore, the user can configure the FPCT to precisely match the bitwidths of the operands being summed. Although an FPCT cannot beat the performance of a well-designed ASIC compressor tree of fixed bitwidth, for example, 9×9 and 18×18-bit multipliers/MACs in DSP blocks, its configurable bitwidth and ability to perform multi-input addition is ideal for reconfigurable devices that are used across a variety of applications.
  •  
2.
  • Ioannou, Aggelos D., et al. (author)
  • UNILOGIC: A Novel Architecture for Highly Parallel Reconfigurable Systems
  • 2020
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 13:4
  • Journal article (peer-reviewed)abstract
    • One of the main characteristics of High-performance Computing (HPC) applications is that they become increasingly performance and power demanding, pushing HPC systems to their limits. Existing HPC systems have not yet reached exascale performance mainly due to power limitations. Extrapolating from today's top HPC systems, about 100-200 MWatts would be required to sustain an exaflop-level of performance. A promising solution for tackling power limitations is the deployment of energy-efficient reconfigurable resources (in the form of Field-programmable Gate Arrays (FPGAs)) tightly integrated with conventional CPUs. However, current FPGA tools and programming environments are optimized for accelerating a single application or even task on a single FPGA device. In this work, we present UNILOGIC (Unified Logic), a novel HPC-tailored parallel architecture that efficiently incorporates FPGAs. UNILOGIC adopts the Partitioned Global Address Space (PGAS) model and extends it to include hardware accelerators, i.e., tasks implemented on the reconfigurable resources. The main advantages of UNILOGIC are that (i) the hardware accelerators can be accessed directly by any processor in the system, and (ii) the hardware accelerators can access any memory location in the system. In this way, the proposed architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. The UNILOGIC architecture also provides hardware virtualization of the reconfigurable logic so that the hardware accelerators can be shared among multiple applications or tasks. The FPGA layer of the architecture is implemented by splitting its reconfigurable resources into (i) a static partition, which provides the PGAS-related communication infrastructure, and (ii) fixed-size and dynamically reconfigurable slots that can be programmed and accessed independently or combined together to support both line and coarse grain reconfiguration.(1) Finally, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis, each of which includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs); each QFDB supports four tightly coupled Xilinx Zynq Ultrascalei MPSoCs as well as 64 Gigabytes of DDR4 memory, and thus, the prototype features a total of 64 Zynq MPSoCs and 1 Terabyte of memory. We tuned and evaluated the UNILOGIC prototype using both low-level (baremetal) performance tests, as well as two popular real-world HPC applications, one compute-intensive and one data-intensive. Our evaluation shows that UNILOGIC offers impressive performance that ranges from being 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs, while it also outperforms GPUs by a factor ranging from 3 to 6 times in terms of time to solution, and from 10 to 20 times in terms of energy to solution.
  •  
3.
  • Martin, Kevin, et al. (author)
  • Constraint Programming Approach to Reconfigurable Processor Extension Generation and Application Compilation
  • 2012
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7406 .- 1936-7414. ; 5:2, s. 1-38
  • Journal article (peer-reviewed)abstract
    • Abstract in UndeterminedIn this article, we present a constraint programming approach for solving hard design problems present when automatically designing specialized processor extensions. Specifically, we discuss our approach for automatic selection and synthesis of processor extensions as well as efficient application compilation for these newly generated extensions. The discussed approach is implemented in our integrated design framework, IFPEC, built using Constraint Programming (CP). In our framework, custom instructions, implemented as processor extensions, are defined as computational patterns and represented as graphs. This, along with the graph representation of an application, provides a way to use our CP framework equipped with subgraph isomorphism and connected component constraints for identification of processor extensions as well as their selection, application scheduling, binding, and routing. All design steps assume architectures composed of runtime reconfigurable cells, implementing selected extensions, tightly connected to a processor. An advantage of our approach is the possibility of combining different heterogeneous constraints to represent and solve all our design problems. Moreover, the flexibility and expressiveness of the CP framework makes it possible to solve simultaneously extension selection, application scheduling, and binding and improve the quality of the generated results. The article is largely illustrated with experimental results.
  •  
4.
  • Martorell, Xavier, et al. (author)
  • Introduction to the Special Section on FPL 2019
  • 2021
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 14:2
  • Journal article (other academic/artistic)
  •  
5.
  • Mentens, Nele, et al. (author)
  • Introduction to the Special Section on FPL 2020
  • 2022
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 15:4
  • Journal article (other academic/artistic)
  •  
6.
  • Panerati, Jacopo, et al. (author)
  • Coordination of Independent Loops in Self-Adaptive Systems
  • 2014
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7406 .- 1936-7414. ; 7:2, s. 12-16
  • Journal article (peer-reviewed)abstract
    • Nowadays, the same piece of code should run on different architectures, providing performance guarantees in a variety of environments and situations. To this end, designers often integrate existing systems with ad-hoc adaptive strategies able to tune specific parameters that impact performance or energy—for example, frequency scaling. However, these strategies interfere with one another and unpredictable performance degradation may occur due to the interaction between different entities. In this article, we propose a software approach to reconfiguration when different strategies, called loops, are encapsulated in the system and are available to be activated. Our solution to loop coordination is based on machine learning and it selects a policy for the activation of loops inside of a system without prior knowledge. We implemented our solution on top of GNU/Linux and evaluated it with a significant subset of the PARSEC benchmark suite.
  •  
7.
  • Ramakrishnan Geethakumari, Prajith, 1986, et al. (author)
  • Stream Aggregation with Compressed Sliding Windows
  • 2023
  • In: ACM Transactions on Reconfigurable Technology and Systems. - 1936-7414 .- 1936-7406. ; 16:3
  • Journal article (peer-reviewed)abstract
    • High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.
  •  
8.
  • Ul-Abdin, Zain, 1975-, et al. (author)
  • A Retargetable Compilation Framework for Heterogeneous Reconfigurable Computing
  • 2016
  • In: ACM Transactions on Reconfigurable Technology and Systems. - New York, NY : ACM Special Interest Group on Computer Science Education. - 1936-7406 .- 1936-7414. ; 9:4
  • Journal article (peer-reviewed)abstract
    • The future trend in microprocessors for the more advanced embedded systems is focusing on massively parallel reconfigurable architectures, consisting of heterogeneous ensembles of hundreds of processing elements communicating over a reconfigurable interconnection network. However, the mastering of low-level micro-architectural details involved in programming of such massively parallel platforms becomes too cumbersome, which limits their adoption in many applications. Thus there is a dire need of an approach to produce high-performance scalable implementations that harness the computational resources of the emerging reconfigurable platforms.This paper addresses the grand challenge of accessibility of these diverse reconfigurable platforms by suggesting the use of a high-level language, occam-pi, and developing a complete design flow for building, compiling, and generating machine code for heterogeneous coarse-grained hardware. We have evaluated the approach by implementing complex industrial case studies and three common signal processing algorithms. The results of the implemented case-studies suggest that the occam-pi language based approach, because of its well-defined semantics for expressing concurrency and reconfigurability, simplifies the development of applications employing run-time reconfigurable devices. The associated compiler framework ensures portability as well as the performance benefits across heterogeneous platforms.
  •  
9.
  • Umuroglu, Yaman, et al. (author)
  • Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
  • 2019
  • In: ACM Transactions on Reconfigurable Technology and Systems. - : ASSOC COMPUTING MACHINERY. - 1936-7406 .- 1936-7414. ; 12:3
  • Journal article (peer-reviewed)abstract
    • Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes six-input LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-9 of 9

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view